[ensembl-dev] pseudogenes in ensembl bacteria

Wed Aug 21 10:03:49 BST 2013

On 08/21/2013 09:28 AM, Adam Witney wrote:
>
>
> On 21. 8. 2013 8:57, Dan Staines wrote:
>>
>>>> Are details about pseudogenes stored anywhere? For example, HI_0006
>>>> from
>>>> haemophilus_influenzae_rd_kw20 seems to be missing from any of the
>>>> downloaded files (fasta, gff3, gtf) and is not searchable on the
>>>> browser. Can it be accessed (i.e coordinates and DNA sequence) from any
>>>> of the database tables?
>>>
>>> For some reason, the pseudogenes from that record do not appear to
>>> have been loaded into the core database (though they have been for
>>> other genomes in the same database). I'll investigate and let you know
>>> when I have a more detailed answer.
>>
>> Looking at this some more, it seems there is variability in how
>> pseudogenes are annotated in INSDC records, which has led to missing
>> pseudogenes from some records. This will be corrected in a future
>> release of Ensembl Bacteria (unclear right now whether this will make
>> the upcoming release in September though). For the moment, the currently
>> unprocessed features are stored as simple_feature entries in the core
>> database (visible in the browser as the "ENA features" track") so they
>> can be retrieved if needed. I can help with retrieval here if needed.
>
> Hi Dan,
>
> I am not too familiar with the core database schema, but what I would
> like to be able to do is extract all gene sequences (including
> pseudogenes) into a multi fasta file. What would be the easiest way for
> me to do this?

I'll devise a script for you...

-- 
Dan Staines, PhD
Technical Coordinator, Ensembl Genomes
European Bioinformatics Institute (EMBL-EBI)
http://www.ebi.ac.uk/
http://www.ensemblgenomes.org/