[ensembl-dev] Coordinate System anomalies in EnsemblGenomes

PATERSON Trevor trevor.paterson at roslin.ed.ac.uk
Mon Mar 4 11:33:43 GMT 2013


Dan and Andy

thanks for your clarifications.....I think that I am sorted now 

For my use case: to  display available assemblies for a given species- 

case A

For a species that does not have a coordsystem named  as 'chromosome'  
I use the lowest ranked coordsystem as my 'top display level' ( which might, for example be named as  'supercontig' or 'scaffold') 
the sequences that I pull out that have this coordsystem should all have the attribute code 'toplevel' [true?]
and theoretically there could be further sequences, with other coord_systems, which are also attributed 'top_level'

case B

for species that DO have a coordsystem named  as 'chromosome'  
all the sequences that I pull out that have this coordsystem should all have the attribute code 'toplevel'
but there may be other sequences, with different coordsytems, that also  have the attribute code 'toplevel'
(And these will be fragments or supercontigs  that haven't been assembled into chromosomes)

I think that my temporary crisis in comprehension is due to a particular problem with the bacterial assembly example given,  all 244 of the sequence regions labelled as top level (with coordsystem  = 'supercontig') seem to be identical to the 244 sequence regions belonging to the sequence_level coord system ('contig') - so it appears that in fact  no "assembly" has been imported into ensemblgenomes for this species.

thanks again


Trevor Paterson PhD
trevor.paterson at roslin.ed.ac.uk
Bioinformatics 
The Roslin Institute
Royal (Dick) School of Veterinary Studies
University of Edinburgh
Easter Bush
Midlothian
EH25 9RG
Scotland UK

phone +44 (0)131 651 9157

http://bioinformatics.roslin.ed.ac.uk/

Please consider the environment before printing this e-mail
The University of Edinburgh is a charitable body, registered in Scotland with registration number SC005336
Disclaimer:This e-mail and any attachments are confidential and intended solely for the use of the recipient(s) to whom they are addressed. If you have received it in error, please destroy all copies and inform the sender. 


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


-----Original Message-----
From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf Of Andy Yates
Sent: 02 March 2013 18:46
To: Ensembl developers list
Cc: dev at ensembl.org
Subject: Re: [ensembl-dev] Coordinate System anomalies in EnsemblGenomes

Hi

Yes the toplevel data flag is held in seq_region_attrib and joins into attrib_type for the cv/dict table. What I think has confused things a little is that toplevel sequences must be attached to the default assembly whose version can normally be found in the lowest ranking coord system. As you'll see in a number of species toplevel spans multiple coordinate systems but does not span more than one version. 

Hope this clears up any confusion. 

Cheers,

Andy

Sent from my mobile.

On 1 Mar 2013, at 17:21, Dan Staines <dstaines at ebi.ac.uk> wrote:

> Hi Trevor,
> 
> The quick (Friday afternoon!) answer is that (according to my colleagues in core), rank doesn't have to be an exact sequence, but you should always have toplevel sequences which may be (in the case of the bacterial load) from the coord_systems chromosome or supercontig, or a mixture of the two. You should use the top_level seq_region attribute to identify top level sequences (maybe someone from core can comment about the schema doc you reference).
> 
> As you point out, there are a significant number without chromosomal assemblies. There are also a handful of cases where a chromosome is incorrectly labelled as a supercontig due to a description line (which can be quite variable...) not matching the set of expected values - I'm looking in to fixing these for the next release.
> 
> Having said that, there does seem to be something else up with the specific example in your mail, where the underlying WGS sequences have been retrieved as toplevel rather than the single assembled chromosome, which may be to do with how the logic of how the assembly is retrieved from the INSDC assembly database, or the state of the assembly database at load time. I'll look into it and let you know.
> 
> Dan.
> 
> -- 
> Dan Staines, PhD               Ensembl Genomes Technical Coordinator
> EMBL-EBI                       Tel: +44-(0)1223-492507
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

_______________________________________________
Dev mailing list    Dev at ensembl.org
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/




More information about the Dev mailing list