[ensembl-dev] Python Script to read emf dump files?
Thomas Walsh
twalsh at ebi.ac.uk
Fri Mar 10 13:39:03 GMT 2023
Hi Omar,
Thanks for getting back to me. The reason I asked about whether it must
be EMF is that GERP scores are also included in comment lines (starting
with '#') of the MAF files generated from the same alignments as the EMF
files (e.g.
https://ftp.ensembl.org/pub/release-109/maf/ensembl-compara/multiple_alignments/90_mammals.epo_extended/90_mammals.epo_extended.1_1.maf.gz
). I don't know of an existing Python script or resource that will load
the 90-mammal alignment from either EMF or MAF format complete with GERP
scores into Python. However, it's possible to load them from the MAF
files -- with a little coaxing -- using the Biopython MafIterator
function (
https://biopython.org/docs/1.81/api/Bio.AlignIO.MafIO.html#Bio.AlignIO.MafIO.MafIterator
).
The Biopython MafIterator ignores comment lines, but we could use a
class such as 'CommentStashFile' below, to iterate over the lines of the
input MAF file, extracting the comment lines and returning everything
else. This allows us to access the comments in the MAF file, while
letting 'MafIterator' take care of parsing the alignments.
> from collections.abc import Iterator
>
> class CommentStashFile(Iterator):
>
> def __init__(self, file_object):
> self.file_object = file_object
> self.comment_lines = []
>
> def __next__(self):
> for line in self.file_object:
> if line.startswith("#"):
> self.comment_lines.append(line)
> else:
> return line
> raise StopIteration
With the 'CommentStashFile' class, we can create a 'GerpMafIterator'
function, which stashes the comments for each MAF block, then checks for
the comment containing the GERP scores. These GERP scores can be added
as column annotations to the 'MultipleSeqAlignment' object representing
the corresponding MAF block:
> from Bio.AlignIO.MafIO import MafIterator
>
> def GerpMafIterator(handle):
> gerp_line_prefix = "# gerp scores: "
> comment_stash_file = CommentStashFile(handle)
> for maf_block in MafIterator(comment_stash_file):
> gerp_score_line = None
> for comment_line in comment_stash_file.comment_lines:
> if comment_line.startswith(gerp_line_prefix):
> gerp_score_line = comment_line
> break
> try:
> gerp_score_text = gerp_score_line[len(gerp_line_prefix):].rstrip("\n")
> except TypeError as exc:
> raise ValueError("GERP score line not found for MAF block") from exc
> maf_block.column_annotations["gerp_scores"] = [
> None if x == "." else float(x)
> for x in gerp_score_text.split()
> ]
> yield maf_block
> comment_stash_file.comment_lines.clear()
It should be possible to use 'GerpMafIterator' on any current Ensembl
MAF file containing GERP scores encoded as comments. For example, the
file "90_mammals.epo_extended.1_1.maf.gz" could be downloaded and then
processed as follows:
> import gzip
>
> with gzip.open("90_mammals.epo_extended.1_1.maf.gz", "rt") as
> in_file_obj:
> for maf_block in GerpMafIterator(in_file_obj):
> gerp_scores = maf_block.column_annotations["gerp_scores"]
> # ... do stuff with GERP scores ...
Do you think something like this might do the job?
All the best,
Thomas.
On 2023-03-09 22:18, Omar Gamel wrote:
> Hi Thomas,
>
> Thank you for your response.
> I am interested in loading the GERP and associated alignment data into
> Python, not necessarily from an EMF file.
> Right now I loaded the human genome and gerp scores separately from
> release 109 FTP, matching them by position.
> I would similarly like to be able to get the gerp for an entire
> alignment.
>
> Best
> Omar
>
> On Wed, Mar 8, 2023 at 5:56 PM Thomas Walsh <twalsh at ebi.ac.uk> wrote:
>
> Hello Omar,
>
> My name is Thomas Walsh from the Ensembl Compara team, and I helped
> create the EMF files that you've expressed an interest in.
>
> I understand that you're especially interested in accessing GERP and
> alignment data together. Is it necessary for this to be taken from an
> EMF file? Or are you primarily interested in loading the GERP and
> associated alignment data into Python, regardless of how it got there?
>
> All the best,
>
> Thomas Walsh.
>
> -------- Forwarded Message --------
> Subject: [ensembl-dev] Python Script to read emf dump files?
> Date: Thu, 2 Mar 2023 13:21:18 -0500
> From: Omar Gamel <omar.gamel at utoronto.ca>
> Reply-To: Ensembl developers list <dev at ensembl.org>
> To: dev at ensembl.org
>
> Hello,
>
> I am looking for a python package or script that can parse the
> alignment emf files in the ENSEMBL FTP dump into an object I can
> manipulate
> Example files here:
> https://ftp.ensembl.org/pub/current_emf/ensembl-compara/multiple_alignments/90_mammals.epo_extended/
> I am particularly interested in the GERP data along with the alignment.
>
> I see the example scripts provided are largely in perl:
> https://github.com/Ensembl/ensembl-compara/tree/release/109/scripts/dumps
>
> Please let me know where I can find such a python resource.
>
> Thank you
> Omar
>
> Thomas Walsh
>
> Bioinformatics Developer, Ensembl Compara
>
> European Bioinformatics Institute (EMBL-EBI)
>
> Wellcome Genome Campus
>
> Hinxton
>
> Cambridge CB10 1SD
>
> United Kingdom
>
> Email: twalsh at ebi.ac.uk
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/
_______________________________________________
Dev mailing list Dev at ensembl.org
Posting guidelines and subscribe/unsubscribe info:
https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
Ensembl Blog: http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20230310/ed2df3c0/attachment.html>
More information about the Dev
mailing list