[ensembl-dev] Coordinate System anomalies in EnsemblGenomes

Magali mr6 at ebi.ac.uk
Mon Mar 4 11:42:15 GMT 2013


Hi Trevor,

If you use the perl API, the method fetch_all can take the argument
'top_level' in the same way as you could fetch_all('chromosome')

The sql equivalent would be
select distinct s.seq_region_id from seq_region s, coord_system cs,
seq_region_attrib sa, attrib_type at where s.seq_region_id =
sa.seq_region_id and sa.attrib_type_id = at.attrib_type_id and code =
'toplevel' and s.coord_system_id = cs.coord_system_id and species_id = 1

For an assembly, all sequences which are part of the assembly can be
assembled into toplevel sequences.
Depending on the strategy used by the assembly provider, this means that
some contig sequences, which could not be assembled into bigger
fragments, are duplicated into scaffolds which themselves are toplevel
or that these contigs are labelled as toplevel.

Also, for the lowest ranked coordsystem, all sequences will be toplevel
sequences, but for the other coordsystems, all combinations are possible.
For example, if you have chromosomes, they should all be toplevel, but
some scaffolds which could not be assembled into chromosomes would also
be toplevel.


Hope that helps,
Magali

On 04/03/13 11:33, PATERSON Trevor wrote:
> Dan and Andy
>
> thanks for your clarifications.....I think that I am sorted now 
>
> For my use case: to  display available assemblies for a given species- 
>
> case A
>
> For a species that does not have a coordsystem named  as 'chromosome'  
> I use the lowest ranked coordsystem as my 'top display level' ( which might, for example be named as  'supercontig' or 'scaffold') 
> the sequences that I pull out that have this coordsystem should all have the attribute code 'toplevel' [true?]
> and theoretically there could be further sequences, with other coord_systems, which are also attributed 'top_level'
>
> case B
>
> for species that DO have a coordsystem named  as 'chromosome'  
> all the sequences that I pull out that have this coordsystem should all have the attribute code 'toplevel'
> but there may be other sequences, with different coordsytems, that also  have the attribute code 'toplevel'
> (And these will be fragments or supercontigs  that haven't been assembled into chromosomes)
>
> I think that my temporary crisis in comprehension is due to a particular problem with the bacterial assembly example given,  all 244 of the sequence regions labelled as top level (with coordsystem  = 'supercontig') seem to be identical to the 244 sequence regions belonging to the sequence_level coord system ('contig') - so it appears that in fact  no "assembly" has been imported into ensemblgenomes for this species.
>
> thanks again
>
>
> Trevor Paterson PhD
> trevor.paterson at roslin.ed.ac.uk
> Bioinformatics 
> The Roslin Institute
> Royal (Dick) School of Veterinary Studies
> University of Edinburgh
> Easter Bush
> Midlothian
> EH25 9RG
> Scotland UK
>
> phone +44 (0)131 651 9157
>
> http://bioinformatics.roslin.ed.ac.uk/
>
> Please consider the environment before printing this e-mail
> The University of Edinburgh is a charitable body, registered in Scotland with registration number SC005336
> Disclaimer:This e-mail and any attachments are confidential and intended solely for the use of the recipient(s) to whom they are addressed. If you have received it in error, please destroy all copies and inform the sender. 
>
>





More information about the Dev mailing list