[ensembl-dev] pseudogenes in ensembl bacteria

Wed Aug 21 13:06:27 BST 2013

>> Hi Dan,
>>
>> I am not too familiar with the core database schema, but what I would
>> like to be able to do is extract all gene sequences (including
>> pseudogenes) into a multi fasta file. What would be the easiest way for
>> me to do this?
>
> I'll devise a script for you...
>

OK, here's an example script using the ensemblgenomes and ensembl Perl 
APIs to dump FASTA files containing all gene and "gene" simple_feature 
sequences for each genome matching haemophilus_.*:
https://gist.github.com/danstaines/6293539

You should be able to customise this to fit your exact needs (e,g, 
contents of the header, or the range of genomes involved) but please let 
me know if you're not sure how to get a specific bit of information from 
Ensembl. One thing to note is that the "gene" simple_features stored 
don't include the locus_tag as a very generic approach is taken to 
storing these.

Hope this help - let me know if I can do anything else to help.

Dan.

-- 
Dan Staines, PhD
Technical Coordinator, Ensembl Genomes
European Bioinformatics Institute (EMBL-EBI)
http://www.ebi.ac.uk/
http://www.ensemblgenomes.org/