[ensembl-dev] changes to organisation of bacterial collections in EnsemblGenomes
dstaines at ebi.ac.uk
Tue Feb 12 16:16:53 GMT 2013
On 02/12/2013 04:03 PM, PATERSON Trevor wrote:
> 1. Is the 'assembly.accession' always now a GCA identifier? I notice that some species in Ensembl vertebrates still lack an 'assembly.accession' (eg chicken), presumably because the assembly used doesn't have a GCA identifier.
Where set, yes - but the use of this is not always universal for EG as
older assemblies aren't always in the assembly database (I can't speak
for Ensembl (vertebrates) on this though).
> 2. Is there now really a distinction/difference between 'species.url' and 'species.production_name'
Yes - species.url is often upper-cased, whilst species.production_name
is not. These can be independent of each other though (it reflects the
desire to have an initial capital letter in the URL). It really
shouldn't be used outside the web interface though -
species.production_name is a computationally safe name that we use in
our pipelines/APIs/dbs to identify species for a given release.
> I think my problem is more complex and possibly not soluble automatically (for bacteria anyway)
> I'm not sure there is any set of rules that can automatically deduce that a bacterial species in one release is definitely the same as in another...
> - the 'assembly.accession' and 'assembly.name' change between major releases (maybe this isn't true for the stem of the 'assembly.accession' anymore, going forwards, but it is the case for existing archived releases)
If you use the "stem" of the assembly.accession, that should be stable
(its called the set chain in the jargon of the Genome Assembly
database). For bacteria, you can use this to identify all but about 10
genomes from pre-EG17.
> - the taxon IDs are not necessarily unique (e.g. they become confusing once assemblies of sub-strains of sub-strains of bacteria start being curated!)
> - production names (and urls) change over time ( this seems to be far more of a problem with bacteria than vertebrates etc.)
Yes, thats pretty much right. Just to make things harder for you, we
also changed the naming conventions from EG16 to EG17 as abbreviating
the species no longer made much sense. I can provide a concordance list
if you need. Anyway, like I say - for bacteria, the set chain derived
from the assembly accession is going to be your best bet - if not, you
can probably rely on the species name.
As you say, things are much more straightforward outside the bacterial
world (though names do change, as happened for one of our fungi)
Dan Staines, PhD Ensembl Genomes Technical Coordinator
EMBL-EBI Tel: +44-(0)1223-492507
Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
More information about the Dev