[ensembl-dev] changes to organisation of bacterial collections in EnsemblGenomes

Dan Staines dstaines at ebi.ac.uk
Tue Feb 12 16:16:53 GMT 2013

On 02/12/2013 04:03 PM, PATERSON Trevor wrote:
> 1. Is the 'assembly.accession' always now a GCA identifier? I notice that some species in Ensembl vertebrates still lack an 'assembly.accession' (eg chicken), presumably because the assembly used doesn't have a GCA identifier.

Where set, yes - but the use of this is not always universal for EG as 
older assemblies aren't always in the assembly database (I can't speak 
for Ensembl (vertebrates) on this though).

> 2. Is there now really a distinction/difference between 'species.url' and 'species.production_name'

Yes - species.url is often upper-cased, whilst species.production_name 
is not. These can be independent of each other though (it reflects the 
desire to have an initial capital letter in the URL). It really 
shouldn't be used outside the web interface though - 
species.production_name is a computationally safe name that we use in 
our pipelines/APIs/dbs to identify species for a given release.

> I think my problem is more complex and possibly not soluble automatically (for bacteria anyway)
> I'm not sure there is any set of rules that  can automatically deduce  that a bacterial species in one release is definitely the same as in another...
> - the 'assembly.accession' and 'assembly.name' change between major releases (maybe this isn't true for the stem of the 'assembly.accession' anymore, going forwards, but it is the case for existing archived releases)

If you use the "stem" of the assembly.accession, that should be stable 
(its called the set chain in the jargon of the Genome Assembly 
database). For bacteria, you can use this to identify all but about 10 
genomes from pre-EG17.

> - the taxon IDs are not necessarily unique (e.g. they become confusing  once assemblies of sub-strains of sub-strains of bacteria start being curated!)
> - production names (and urls) change over time ( this seems to be far more of a problem with bacteria than vertebrates etc.)

Yes, thats pretty much right. Just to make things harder for you, we 
also changed the naming conventions from EG16 to EG17 as abbreviating 
the species no longer made much sense. I can provide a concordance list 
if you need. Anyway, like I say - for bacteria, the set chain derived 
from the assembly accession is going to be your best bet - if not, you 
can probably rely on the species name.

As you say, things are much more straightforward outside the bacterial 
world (though names do change, as happened for one of our fungi)


Dan Staines, PhD               Ensembl Genomes Technical Coordinator
EMBL-EBI                       Tel: +44-(0)1223-492507
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/

More information about the Dev mailing list