[ensembl-dev] assembly with multiple mapping paths
Sharon Wei
weix at cshl.edu
Thu Jan 13 02:02:09 GMT 2011
Dear ensemblers,
Does any one know how to use ensembl core API to fetch sequences from
assembly with multiple mapping paths? I have trouble getting correct
sequences by $SliceAdaptor->fetch_by_region( $cs, $seq_region_name) when
the tilling path is made up of 2 different component coordinate systems;
In this genome, there are 3 coordinate systems: chromosome, scaffold,
contig. The sequence level coordinate system is contig. Two AGP files,
one contains scaffold tiling path from contig, the other contains
chromosome tiling path from both scaffolds and contigs. Both tiling
paths were loaded into assembly table. In meta table, multiple mapping
paths were assigned to "assembly.mapping" including "chromsome|scaffold"
and "chromosome|contig", see the following tables.
mysql> select * from coord_system;
coord_system_id species_id name version rank attrib
1 1 chromosome 454.2pools.2009 1 default_version
2 1 scaffold 454.2pools.2009 2 default_version
3 1 contig 454.2pools.2009 3
default_version,sequence_level
mysql> select * from meta where meta_key='assembly.mapping';
meta_id species_id meta_key meta_value
30 1 assembly.mapping
chromosome:454.2pools.2009|scaffold:454.2pools.2009
87 1 assembly.mapping
chromosome:454.2pools.2009|scaffold:454.2pools.2009|contig:454.2pools.2009
29 1 assembly.mapping
scaffold:454.2pools.2009|contig:454.2pools.2009
117 1 assembly.mapping
chromosome:454.2pools.2009|contig:454.2pools.2009
An excerpt of the chr AGP file is: (notice there are both scaffold and
contig):
...
O.brachyantha_V1.0 1 152187 1 W
Obrachyantha03S_1.scaffold00624 1 152187 +
O.brachyantha_V1.0 152188 152287 2 N 100
fragment no
O.brachyantha_V1.0 152288 159378 3 W
Obrachyantha03S_1.contig00345 1 7091 -
O.brachyantha_V1.0 159379 159478 4 N 100
fragment no
O.brachyantha_V1.0 159479 383477 5 W
Obrachyantha03S_1.scaffold00031 1 223999 +
O.brachyantha_V1.0 383478 383577 6 N 100
fragment no
O.brachyantha_V1.0 383578 404096 7 W
Obrachyantha03S_1.scaffold00086 1 20519 -
O.brachyantha_V1.0 404097 404196 8 N 100
fragment no
...
However, when I use $SliceAdaptor->fetch_by_region( chromosome,
$seq_region_name) to retrieve chromosome sequences, all the scaffold
contributed regions were returned as Ns, only regions assembled directly
from contigs have actual sequences. I also got a warning of
"Meta table specifies multiple mapping paths between coord systems
chromosome and contig.
Choosing shorter path arbitrarily.
"
If I delete meta_id=117, mapping path of chromosome|contig, the warnings
disappeared, but the regions assembled directly from contigs will be Ns.
It seems there is no way to fetch the complete correct chromosome
sequence made up of both scaffolds and contigs. There is no restriction
on AGP specification to prevent multiple component coordinate systems.
So this should be a legitimate case.
My question is, is this a potential bug in the API? Is there any way to
make it work by playing with the mapping paths in the meta or assembly
table?
Any help is appreciated.
Thanks,
Sharon
More information about the Dev
mailing list