[ensembl-dev] pseudogenes in ensembl bacteria

Adam Witney awitney at sgul.ac.uk
Wed Aug 21 09:28:16 BST 2013



On 21. 8. 2013 8:57, Dan Staines wrote:
>
>>> Are details about pseudogenes stored anywhere? For example, HI_0006 from
>>> haemophilus_influenzae_rd_kw20 seems to be missing from any of the
>>> downloaded files (fasta, gff3, gtf) and is not searchable on the
>>> browser. Can it be accessed (i.e coordinates and DNA sequence) from any
>>> of the database tables?
>>
>> For some reason, the pseudogenes from that record do not appear to
>> have been loaded into the core database (though they have been for
>> other genomes in the same database). I'll investigate and let you know
>> when I have a more detailed answer.
>
> Looking at this some more, it seems there is variability in how
> pseudogenes are annotated in INSDC records, which has led to missing
> pseudogenes from some records. This will be corrected in a future
> release of Ensembl Bacteria (unclear right now whether this will make
> the upcoming release in September though). For the moment, the currently
> unprocessed features are stored as simple_feature entries in the core
> database (visible in the browser as the "ENA features" track") so they
> can be retrieved if needed. I can help with retrieval here if needed.

Hi Dan,

I am not too familiar with the core database schema, but what I would 
like to be able to do is extract all gene sequences (including 
pseudogenes) into a multi fasta file. What would be the easiest way for 
me to do this?

Thanks again for your help

Adam






More information about the Dev mailing list