[ensembl-dev] assembly with multiple mapping paths

ian Longden ianl at ebi.ac.uk
Thu Jan 13 10:40:33 GMT 2011


The mapping code can only deal with going from one coordinate system
to another via one path else how would it know which is the proper
one.

In the mapping data do you have the chromosomes being assembled from contigs and
another scaffolds (case 1)


                                        Chr A
                                ---------------------------

                                       contig 1
                                ---------------------------



and also


                                        Chr A
                                ---------------------------

                                       scaffold 1
                                ---------------------------



or is it more like (case 2)



                                        Chr A
                                ---------------------------

                                 contig 1  scaffold 1
                                ------------   ---------------

                                                 contig 2
                                               ---------------

or  (case 3)



                                        Chr A
                                ---------------------------

                                       scaffold 1
                                ---------------------------
                                       contig 1
                                ---------------------------



Hopefully the email reader will not mess up all the spaces....

Cases 1 (though not very good)  and 3 are doable but case 2 is not.

In case 2 you would have to add a dummy scaffold to make it work:-



                                        Chr A
                                ---------------------------

                                dummy 1   scaffold 1
                                ------------   ---------------

                                 contig 1    contig 2
                                -------------- ---------------


So that the path is always chromosome -> scaffold -> contig.

Does this make sense, if i could get a better idea about the data i
would be able to help more.


-Ian
Ensembl Developer.


On Thu, Jan 13, 2011 at 2:02 AM, Sharon Wei <weix at cshl.edu> wrote:
> Dear ensemblers,
>
> Does any one know how to use ensembl core API to fetch sequences from
> assembly with multiple mapping paths? I have trouble getting correct
> sequences by $SliceAdaptor->fetch_by_region( $cs, $seq_region_name) when the
> tilling path is made up of 2 different component coordinate systems;
>
> In this genome, there are 3 coordinate systems: chromosome, scaffold,
> contig. The sequence level coordinate system is contig.  Two AGP files, one
> contains scaffold tiling path from contig, the other contains chromosome
> tiling path from both scaffolds and contigs. Both tiling paths were loaded
> into assembly table. In meta table, multiple mapping paths were assigned to
> "assembly.mapping" including "chromsome|scaffold" and "chromosome|contig",
> see the following tables.
>
> mysql> select * from coord_system;
> coord_system_id species_id      name    version rank    attrib
> 1       1       chromosome      454.2pools.2009 1       default_version
> 2       1       scaffold        454.2pools.2009 2       default_version
> 3       1       contig  454.2pools.2009 3
> default_version,sequence_level
>
>
> mysql> select * from meta where meta_key='assembly.mapping';
> meta_id species_id      meta_key        meta_value
> 30      1       assembly.mapping
>  chromosome:454.2pools.2009|scaffold:454.2pools.2009
> 87      1       assembly.mapping
>  chromosome:454.2pools.2009|scaffold:454.2pools.2009|contig:454.2pools.2009
> 29      1       assembly.mapping
>  scaffold:454.2pools.2009|contig:454.2pools.2009
> 117    1       assembly.mapping
>  chromosome:454.2pools.2009|contig:454.2pools.2009
>
> An excerpt of the chr AGP file is: (notice there are both scaffold and
> contig):
> ...
> O.brachyantha_V1.0      1       152187  1       W
> Obrachyantha03S_1.scaffold00624 1       152187  +
> O.brachyantha_V1.0      152188  152287  2       N       100     fragment
>    no
> O.brachyantha_V1.0      152288  159378  3       W
> Obrachyantha03S_1.contig00345   1       7091    -
> O.brachyantha_V1.0      159379  159478  4       N       100     fragment
>    no
> O.brachyantha_V1.0      159479  383477  5       W
> Obrachyantha03S_1.scaffold00031 1       223999  +
> O.brachyantha_V1.0      383478  383577  6       N       100     fragment
>    no
> O.brachyantha_V1.0      383578  404096  7       W
> Obrachyantha03S_1.scaffold00086 1       20519   -
> O.brachyantha_V1.0      404097  404196  8       N       100     fragment
>    no
> ...
>
> However, when I use $SliceAdaptor->fetch_by_region( chromosome,
> $seq_region_name) to retrieve chromosome sequences, all the scaffold
> contributed regions were returned as Ns, only regions assembled directly
> from contigs have actual sequences. I also got a warning of
> "Meta table specifies multiple mapping paths between coord systems
> chromosome and contig.
>  Choosing shorter path arbitrarily.
> "
> If I delete meta_id=117, mapping path of chromosome|contig, the warnings
> disappeared, but the regions assembled directly from contigs will be Ns.
>
> It seems there is no way to fetch the complete correct chromosome sequence
> made up of both scaffolds and contigs. There is no restriction on AGP
> specification to prevent multiple component coordinate systems. So this
> should be a legitimate case.
>
> My question is, is this a potential bug in the API? Is there any way to make
> it work by playing with the mapping paths in the meta or assembly table?
>
> Any help is appreciated.
>
> Thanks,
>
>
> Sharon
>
>
>
>
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev
>




More information about the Dev mailing list