[ensembl-dev] Python Script to read emf dump files?

Fri Mar 10 13:39:03 GMT 2023

Hi Omar,

Thanks for getting back to me. The reason I asked about whether it must 
be EMF is that GERP scores are also included in comment lines (starting 
with '#') of the MAF files generated from the same alignments as the EMF 
files (e.g. 
https://ftp.ensembl.org/pub/release-109/maf/ensembl-compara/multiple_alignments/90_mammals.epo_extended/90_mammals.epo_extended.1_1.maf.gz 
). I don't know of an existing Python script or resource that will load 
the 90-mammal alignment from either EMF or MAF format complete with GERP 
scores into Python. However, it's possible to load them from the MAF 
files -- with a little coaxing -- using the Biopython MafIterator 
function ( 
https://biopython.org/docs/1.81/api/Bio.AlignIO.MafIO.html#Bio.AlignIO.MafIO.MafIterator 
).

The Biopython MafIterator ignores comment lines, but we could use a 
class such as 'CommentStashFile' below, to iterate over the lines of the 
input MAF file, extracting the comment lines and returning everything 
else. This allows us to access the comments in the MAF file, while 
letting 'MafIterator' take care of parsing the alignments.

> from collections.abc import Iterator
> 
> class CommentStashFile(Iterator):
> 
> def __init__(self, file_object):
> self.file_object = file_object
> self.comment_lines = []
> 
> def __next__(self):
> for line in self.file_object:
> if line.startswith("#"):
> self.comment_lines.append(line)
> else:
> return line
> raise StopIteration

With the 'CommentStashFile' class, we can create a 'GerpMafIterator' 
function, which stashes the comments for each MAF block, then checks for 
the comment containing the GERP scores. These GERP scores can be added 
as column annotations to the 'MultipleSeqAlignment' object representing 
the corresponding MAF block:

> from Bio.AlignIO.MafIO import MafIterator
> 
> def GerpMafIterator(handle):
> gerp_line_prefix = "# gerp scores: "
> comment_stash_file = CommentStashFile(handle)
> for maf_block in MafIterator(comment_stash_file):
> gerp_score_line = None
> for comment_line in comment_stash_file.comment_lines:
> if comment_line.startswith(gerp_line_prefix):
> gerp_score_line = comment_line
> break
> try:
> gerp_score_text = gerp_score_line[len(gerp_line_prefix):].rstrip("\n")
> except TypeError as exc:
> raise ValueError("GERP score line not found for MAF block") from exc
> maf_block.column_annotations["gerp_scores"] = [
> None if x == "." else float(x)
> for x in gerp_score_text.split()
> ]
> yield maf_block
> comment_stash_file.comment_lines.clear()

It should be possible to use 'GerpMafIterator' on any current Ensembl 
MAF file containing GERP scores encoded as comments. For example, the 
file "90_mammals.epo_extended.1_1.maf.gz" could be downloaded and then 
processed as follows:

> import gzip
> 
> with gzip.open("90_mammals.epo_extended.1_1.maf.gz", "rt") as 
> in_file_obj:
> for maf_block in GerpMafIterator(in_file_obj):
> gerp_scores = maf_block.column_annotations["gerp_scores"]
> # ... do stuff with GERP scores ...

Do you think something like this might do the job?

All the best,

Thomas.

On 2023-03-09 22:18, Omar Gamel wrote:

> Hi Thomas,
> 
> Thank you for your response.
> I am interested in loading the GERP and associated alignment data into 
> Python,  not necessarily from an EMF file.
> Right now I loaded the human genome and gerp scores separately from 
> release 109 FTP, matching them by position.
> I would similarly like to be able to get the gerp for an entire 
> alignment.
> 
> Best
> Omar
> 
> On Wed, Mar 8, 2023 at 5:56 PM Thomas Walsh <twalsh at ebi.ac.uk> wrote:
> 
> Hello Omar,
> 
> My name is Thomas Walsh from the Ensembl Compara team, and I helped 
> create the EMF files that you've expressed an interest in.
> 
> I understand that you're especially interested in accessing GERP and 
> alignment data together. Is it necessary for this to be taken from an 
> EMF file? Or are you primarily interested in loading the GERP and 
> associated alignment data into Python, regardless of how it got there?
> 
> All the best,
> 
> Thomas Walsh.
> 
> -------- Forwarded Message --------
> Subject: [ensembl-dev] Python Script to read emf dump files?
> Date: Thu, 2 Mar 2023 13:21:18 -0500
> From: Omar Gamel <omar.gamel at utoronto.ca>
> Reply-To: Ensembl developers list <dev at ensembl.org>
> To: dev at ensembl.org
> 
> Hello,
> 
> I am looking for a python package or script that can parse the 
> alignment emf files in the ENSEMBL FTP dump into an object I can 
> manipulate
> Example files here: 
> https://ftp.ensembl.org/pub/current_emf/ensembl-compara/multiple_alignments/90_mammals.epo_extended/
> I am particularly interested in the GERP data along with the alignment.
> 
> I see the example scripts provided are largely in perl: 
> https://github.com/Ensembl/ensembl-compara/tree/release/109/scripts/dumps
> 
> Please let me know where I can find such a python resource.
> 
> Thank you
> Omar
> 
> Thomas Walsh
> 
> Bioinformatics Developer, Ensembl Compara
> 
> European Bioinformatics Institute (EMBL-EBI)
> 
> Wellcome Genome Campus
> 
> Hinxton
> 
> Cambridge CB10 1SD
> 
> United Kingdom
> 
> Email: twalsh at ebi.ac.uk
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: 
> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/

_______________________________________________
Dev mailing list    Dev at ensembl.org
Posting guidelines and subscribe/unsubscribe info: 
https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
Ensembl Blog: http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20230310/ed2df3c0/attachment.html>