[ensembl-dev] "stable" ids of genes

PATERSON Trevor trevor.paterson at roslin.ed.ac.uk
Thu Feb 21 11:27:55 GMT 2013


thanks Dan

that seems eminently sensible going forwards

but mapping tables would be good ( and it would be good if these were hosted in the EG mysql database somewhere?)

trevor


Trevor Paterson PhD
trevor.paterson at roslin.ed.ac.uk
Bioinformatics 
The Roslin Institute
Royal (Dick) School of Veterinary Studies
University of Edinburgh
Easter Bush
Midlothian
EH25 9RG
Scotland UK

phone +44 (0)131 651 9157

http://bioinformatics.roslin.ed.ac.uk/

Please consider the environment before printing this e-mail
The University of Edinburgh is a charitable body, registered in Scotland with registration number SC005336
Disclaimer:This e-mail and any attachments are confidential and intended solely for the use of the recipient(s) to whom they are addressed. If you have received it in error, please destroy all copies and inform the sender. 


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


-----Original Message-----
From: Dan Staines [mailto:dstaines at ebi.ac.uk] 
Sent: 21 February 2013 11:24
To: dev at ensembl.org; PATERSON Trevor
Subject: Re: [ensembl-dev] "stable" ids of genes

Hi Trevor,

Yes, this is a one-off change specific to the change in bacteria. 
Previously, since we modified a small number of gene models in the bacterial genomes we provided, we maintained our own stable IDs. We now use the gene models directly from the INSDC record, so this is no longer done and we're taking the same approach to stable IDs as for the majority of Ensembl Genomes species, which is to where possible use externally maintained identifiers which we don't track separately ourselves.

In the case of bacteria, the source for our stable IDs is the INSDC record itself, using the locus_tag for the gene, and the protein_id for the transcript/translation. Having said that, there are a small number of genes from older bacterial genomes where the locus_tag is missing or duplicated. In these cases, we use the protein_id if we can, or in exceptional cases we assign our own identifiers (this affects 14,780 genes from 20,965,827). These are stable providing the original gene or CDS feature does not change its location as they are based on a key derived from INSDC feature type and location, but we won't do any other mapping of these.

Lots of information here:
http://bacteria.ensembl.org/info/building/building_ensembl_bacteria.html#identifiers
but please ask if anything is unclear.

Lastly, we will look at providing mapping tables from old to new identifiers as a one-off.

Hope this helps,

Dan.

-- 
Dan Staines, PhD               Ensembl Genomes Technical Coordinator
EMBL-EBI                       Tel: +44-(0)1223-492507
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/




More information about the Dev mailing list