[ensembl-dev] ids from older versions of Ensembl.

Dan Staines dstaines at ebi.ac.uk
Sat Aug 17 11:04:03 BST 2013


As is always the case, things are slightly different with Ensembl 
Bacteria ;-)

As a general rule, Ensembl Genomes as a project does not do genebuilds, 
but instead uses gene models from third parties. This means we use 
identifiers assigned by those third parties rather than assigning our 
own as Ensembl do.

The exception to this was the first iteration of Ensembl Bacteria, where 
we sometimes made slight modifications to gene models from INSDC using 
additional curation from UniProt. We took the decision at this point to 
assign identifiers ourselves, with the consequent need to map those 
identifiers between releases.

Moving to the much expanded second iteration of Ensembl Bacteria, the 
general Ensembl Genomes strategy of using third party identifiers was 
adopted, so we use locus_tag and protein_id identifiers from INSDC 
(where available) as stable IDs. For the legacy identifiers for the 
200-odd genomes from the first iteration, we provide mappings based on 
protein_id identifiers where we can. Pairs of genes mapped in this way 
are generally identical in sequence, though as I mentioned some models 
in the first version are modified based on UniProt curation so you might 
want to check sequence if that is important for your purposes.

As a slight wrinkle - there are a very small number of genes in the new 
Ensembl Bacteria that come from historical records for which INSDC does 
not currently provide a suitable unique identifier (these are usually 
but not exclusively ncRNA genes). For these, we do still assign our own 
identifiers, based on the underlying feature coordinates within the 
record which are used to ensure the same identifier is always used if 
the feature does not change. Given the small number of genes involved, 
we don't provide any mapping beyond this, also since any update by INSDC 
is likely to involve a correction to provide identifiers.

Hope this explains things a little more.

Dan.

-- 
Dan Staines, PhD
Technical Coordinator, Ensembl Genomes
European Bioinformatics Institute (EMBL-EBI)
http://www.ebi.ac.uk/
http://www.ensemblgenomes.org/





More information about the Dev mailing list