[ensembl-dev] assembly with multiple mapping paths
ian Longden
ianl at ebi.ac.uk
Thu Jan 13 10:40:33 GMT 2011
The mapping code can only deal with going from one coordinate system
to another via one path else how would it know which is the proper
one.
In the mapping data do you have the chromosomes being assembled from contigs and
another scaffolds (case 1)
Chr A
---------------------------
contig 1
---------------------------
and also
Chr A
---------------------------
scaffold 1
---------------------------
or is it more like (case 2)
Chr A
---------------------------
contig 1 scaffold 1
------------ ---------------
contig 2
---------------
or (case 3)
Chr A
---------------------------
scaffold 1
---------------------------
contig 1
---------------------------
Hopefully the email reader will not mess up all the spaces....
Cases 1 (though not very good) and 3 are doable but case 2 is not.
In case 2 you would have to add a dummy scaffold to make it work:-
Chr A
---------------------------
dummy 1 scaffold 1
------------ ---------------
contig 1 contig 2
-------------- ---------------
So that the path is always chromosome -> scaffold -> contig.
Does this make sense, if i could get a better idea about the data i
would be able to help more.
-Ian
Ensembl Developer.
On Thu, Jan 13, 2011 at 2:02 AM, Sharon Wei <weix at cshl.edu> wrote:
> Dear ensemblers,
>
> Does any one know how to use ensembl core API to fetch sequences from
> assembly with multiple mapping paths? I have trouble getting correct
> sequences by $SliceAdaptor->fetch_by_region( $cs, $seq_region_name) when the
> tilling path is made up of 2 different component coordinate systems;
>
> In this genome, there are 3 coordinate systems: chromosome, scaffold,
> contig. The sequence level coordinate system is contig. Two AGP files, one
> contains scaffold tiling path from contig, the other contains chromosome
> tiling path from both scaffolds and contigs. Both tiling paths were loaded
> into assembly table. In meta table, multiple mapping paths were assigned to
> "assembly.mapping" including "chromsome|scaffold" and "chromosome|contig",
> see the following tables.
>
> mysql> select * from coord_system;
> coord_system_id species_id name version rank attrib
> 1 1 chromosome 454.2pools.2009 1 default_version
> 2 1 scaffold 454.2pools.2009 2 default_version
> 3 1 contig 454.2pools.2009 3
> default_version,sequence_level
>
>
> mysql> select * from meta where meta_key='assembly.mapping';
> meta_id species_id meta_key meta_value
> 30 1 assembly.mapping
> chromosome:454.2pools.2009|scaffold:454.2pools.2009
> 87 1 assembly.mapping
> chromosome:454.2pools.2009|scaffold:454.2pools.2009|contig:454.2pools.2009
> 29 1 assembly.mapping
> scaffold:454.2pools.2009|contig:454.2pools.2009
> 117 1 assembly.mapping
> chromosome:454.2pools.2009|contig:454.2pools.2009
>
> An excerpt of the chr AGP file is: (notice there are both scaffold and
> contig):
> ...
> O.brachyantha_V1.0 1 152187 1 W
> Obrachyantha03S_1.scaffold00624 1 152187 +
> O.brachyantha_V1.0 152188 152287 2 N 100 fragment
> no
> O.brachyantha_V1.0 152288 159378 3 W
> Obrachyantha03S_1.contig00345 1 7091 -
> O.brachyantha_V1.0 159379 159478 4 N 100 fragment
> no
> O.brachyantha_V1.0 159479 383477 5 W
> Obrachyantha03S_1.scaffold00031 1 223999 +
> O.brachyantha_V1.0 383478 383577 6 N 100 fragment
> no
> O.brachyantha_V1.0 383578 404096 7 W
> Obrachyantha03S_1.scaffold00086 1 20519 -
> O.brachyantha_V1.0 404097 404196 8 N 100 fragment
> no
> ...
>
> However, when I use $SliceAdaptor->fetch_by_region( chromosome,
> $seq_region_name) to retrieve chromosome sequences, all the scaffold
> contributed regions were returned as Ns, only regions assembled directly
> from contigs have actual sequences. I also got a warning of
> "Meta table specifies multiple mapping paths between coord systems
> chromosome and contig.
> Choosing shorter path arbitrarily.
> "
> If I delete meta_id=117, mapping path of chromosome|contig, the warnings
> disappeared, but the regions assembled directly from contigs will be Ns.
>
> It seems there is no way to fetch the complete correct chromosome sequence
> made up of both scaffolds and contigs. There is no restriction on AGP
> specification to prevent multiple component coordinate systems. So this
> should be a legitimate case.
>
> My question is, is this a potential bug in the API? Is there any way to make
> it work by playing with the mapping paths in the meta or assembly table?
>
> Any help is appreciated.
>
> Thanks,
>
>
> Sharon
>
>
>
>
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev
>
More information about the Dev
mailing list