[ensembl-dev] changes to organisation of bacterial collections in EnsemblGenomes
trevor.paterson at roslin.ed.ac.uk
Tue Feb 12 16:33:03 GMT 2013
the reason I am using 'species.url' is just to generate links back to the website...
(you changed the way the address for collection_species is generated - but I think I solved that...)
there have been renames in the vertebrate DB too ( orang-utan and squirrel) - having worked with taxonomists for a few years a while back I can confidently predict names will keep on changing - forever!
Trevor Paterson PhD
trevor.paterson at roslin.ed.ac.uk
The Roslin Institute
Royal (Dick) School of Veterinary Studies
University of Edinburgh
phone +44 (0)131 651 9157
Please consider the environment before printing this e-mail
The University of Edinburgh is a charitable body, registered in Scotland with registration number SC005336
Disclaimer:This e-mail and any attachments are confidential and intended solely for the use of the recipient(s) to whom they are addressed. If you have received it in error, please destroy all copies and inform the sender.
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
From: Dan Staines [mailto:dstaines at ebi.ac.uk]
Sent: 12 February 2013 16:17
To: Ensembl developers list
Cc: PATERSON Trevor
Subject: Re: [ensembl-dev] changes to organisation of bacterial collections in EnsemblGenomes
On 02/12/2013 04:03 PM, PATERSON Trevor wrote:
> 1. Is the 'assembly.accession' always now a GCA identifier? I notice that some species in Ensembl vertebrates still lack an 'assembly.accession' (eg chicken), presumably because the assembly used doesn't have a GCA identifier.
Where set, yes - but the use of this is not always universal for EG as older assemblies aren't always in the assembly database (I can't speak for Ensembl (vertebrates) on this though).
> 2. Is there now really a distinction/difference between 'species.url' and 'species.production_name'
Yes - species.url is often upper-cased, whilst species.production_name is not. These can be independent of each other though (it reflects the desire to have an initial capital letter in the URL). It really shouldn't be used outside the web interface though - species.production_name is a computationally safe name that we use in our pipelines/APIs/dbs to identify species for a given release.
> I think my problem is more complex and possibly not soluble
> automatically (for bacteria anyway)
> I'm not sure there is any set of rules that can automatically deduce that a bacterial species in one release is definitely the same as in another...
> - the 'assembly.accession' and 'assembly.name' change between major
> releases (maybe this isn't true for the stem of the
> 'assembly.accession' anymore, going forwards, but it is the case for
> existing archived releases)
If you use the "stem" of the assembly.accession, that should be stable (its called the set chain in the jargon of the Genome Assembly database). For bacteria, you can use this to identify all but about 10 genomes from pre-EG17.
> - the taxon IDs are not necessarily unique (e.g. they become confusing
> once assemblies of sub-strains of sub-strains of bacteria start being
> - production names (and urls) change over time ( this seems to be far
> more of a problem with bacteria than vertebrates etc.)
Yes, thats pretty much right. Just to make things harder for you, we also changed the naming conventions from EG16 to EG17 as abbreviating the species no longer made much sense. I can provide a concordance list if you need. Anyway, like I say - for bacteria, the set chain derived from the assembly accession is going to be your best bet - if not, you can probably rely on the species name.
As you say, things are much more straightforward outside the bacterial world (though names do change, as happened for one of our fungi)
Dan Staines, PhD Ensembl Genomes Technical Coordinator
EMBL-EBI Tel: +44-(0)1223-492507
Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
More information about the Dev