[ensembl-dev] changes to organisation of bacterial collections in EnsemblGenomes

PATERSON Trevor trevor.paterson at roslin.ed.ac.uk
Tue Feb 12 16:03:15 GMT 2013


Dan
Thanks for the clarifications 

I have a couple more minor questions...

1. Is the 'assembly.accession' always now a GCA identifier? I notice that some species in Ensembl vertebrates still lack an 'assembly.accession' (eg chicken), presumably because the assembly used doesn't have a GCA identifier.

2. Is there now really a distinction/difference between 'species.url' and 'species.production_name'

I think my problem is more complex and possibly not soluble automatically (for bacteria anyway)

I'm not sure there is any set of rules that  can automatically deduce  that a bacterial species in one release is definitely the same as in another...

- the 'assembly.accession' and 'assembly.name' change between major releases (maybe this isn't true for the stem of the 'assembly.accession' anymore, going forwards, but it is the case for existing archived releases)
- the taxon IDs are not necessarily unique (e.g. they become confusing  once assemblies of sub-strains of sub-strains of bacteria start being curated!)
- production names (and urls) change over time ( this seems to be far more of a problem with bacteria than vertebrates etc.)


happily, in the non-collection databases, database names/production names are stable enough for this not to be an issue, and these can be used to provide 'species' identity.
(and on the rare occasions there is a species rename I have a simple method for aliasing using a config file... 
As long as I know a rename has happened that is :)


thanks again
Trevor 



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


-----Original Message-----
From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf Of Dan Staines
Sent: 12 February 2013 10:25
To: dev at ensembl.org
Subject: Re: [ensembl-dev] changes to organisation of bacterial collections in EnsemblGenomes

Hi Trevor,

> Could you please provide some details to help me out?
>
> *Is this change to collection organization finalized?
> *Is the distribution of species to collections arbitrary?
> *Will the distribution of particular species to particular collections 
> change with each release?

There is no taxonomic basis for assigning genomes to collections (something we considered but the distribution doesn't lend itself to
this) so the order is arbitrary as far as external users are concerned. 
We aim to add new genomes to the last collection (creating a new collection database once 250 genomes have been reached), and use the same collection name and species ID for existing genomes when reloading (leaving gaps in earlier collections when existing genomes are no longer available). However, we make no guarantee of this order or assignment and you should not be relying on collection names in your code. Whilst the species.production_name is usually stable, true continuity between genomes can only be guaranteed via the assembly.accession, which uniquely identifies a given version of an assembly for a given genome in the INSDC Genome Assembly database (past experience suggests names are not always stable and taxon IDs are not always unique).

> *Will homologies for bacterial genes (proteins) no longer be curated 
> in neither the 'ensembl_compara_bacteria' nor the 
> 'ensembl_compara_pan_homology' databases?

We will not provide a bacterial homology, but >100 bacterial genomes are present in pan compara (selection is based on a number of criteria, namely presence in previous versions of pan compara, presence in UniProt reference proteome sets, and level of literature citation). We do however provide a family-based compara for all bacteria, where proteins are grouped into families based on their PANTHER or HAMAP classification (this is all documented on bacteria.ensembl.org)

Hope this helps,

Dan.

-- 
Dan Staines, PhD               Ensembl Genomes Technical Coordinator
EMBL-EBI                       Tel: +44-(0)1223-492507
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/

_______________________________________________
Dev mailing list    Dev at ensembl.org
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/




More information about the Dev mailing list