[ensembl-dev] assembly with multiple mapping paths

Sharon Wei weix at cshl.edu
Thu Jan 13 02:02:09 GMT 2011


Dear ensemblers,

Does any one know how to use ensembl core API to fetch sequences from 
assembly with multiple mapping paths? I have trouble getting correct 
sequences by $SliceAdaptor->fetch_by_region( $cs, $seq_region_name) when 
the tilling path is made up of 2 different component coordinate systems;

In this genome, there are 3 coordinate systems: chromosome, scaffold, 
contig. The sequence level coordinate system is contig.  Two AGP files, 
one contains scaffold tiling path from contig, the other contains 
chromosome tiling path from both scaffolds and contigs. Both tiling 
paths were loaded into assembly table. In meta table, multiple mapping 
paths were assigned to "assembly.mapping" including "chromsome|scaffold" 
and "chromosome|contig", see the following tables.

mysql> select * from coord_system;
coord_system_id species_id      name    version rank    attrib
1       1       chromosome      454.2pools.2009 1       default_version
2       1       scaffold        454.2pools.2009 2       default_version
3       1       contig  454.2pools.2009 3       
default_version,sequence_level


mysql> select * from meta where meta_key='assembly.mapping';
meta_id species_id      meta_key        meta_value
30      1       assembly.mapping        
chromosome:454.2pools.2009|scaffold:454.2pools.2009
87      1       assembly.mapping        
chromosome:454.2pools.2009|scaffold:454.2pools.2009|contig:454.2pools.2009
29      1       assembly.mapping        
scaffold:454.2pools.2009|contig:454.2pools.2009
117    1       assembly.mapping        
chromosome:454.2pools.2009|contig:454.2pools.2009

An excerpt of the chr AGP file is: (notice there are both scaffold and 
contig):
...
O.brachyantha_V1.0      1       152187  1       W       
Obrachyantha03S_1.scaffold00624 1       152187  +
O.brachyantha_V1.0      152188  152287  2       N       100     
fragment        no
O.brachyantha_V1.0      152288  159378  3       W       
Obrachyantha03S_1.contig00345   1       7091    -
O.brachyantha_V1.0      159379  159478  4       N       100     
fragment        no
O.brachyantha_V1.0      159479  383477  5       W       
Obrachyantha03S_1.scaffold00031 1       223999  +
O.brachyantha_V1.0      383478  383577  6       N       100     
fragment        no
O.brachyantha_V1.0      383578  404096  7       W       
Obrachyantha03S_1.scaffold00086 1       20519   -
O.brachyantha_V1.0      404097  404196  8       N       100     
fragment        no
...

However, when I use $SliceAdaptor->fetch_by_region( chromosome, 
$seq_region_name) to retrieve chromosome sequences, all the scaffold 
contributed regions were returned as Ns, only regions assembled directly 
from contigs have actual sequences. I also got a warning of
"Meta table specifies multiple mapping paths between coord systems 
chromosome and contig.
   Choosing shorter path arbitrarily.
"
If I delete meta_id=117, mapping path of chromosome|contig, the warnings 
disappeared, but the regions assembled directly from contigs will be Ns.

It seems there is no way to fetch the complete correct chromosome 
sequence made up of both scaffolds and contigs. There is no restriction 
on AGP specification to prevent multiple component coordinate systems. 
So this should be a legitimate case.

My question is, is this a potential bug in the API? Is there any way to 
make it work by playing with the mapping paths in the meta or assembly 
table?

Any help is appreciated.

Thanks,


Sharon







More information about the Dev mailing list