[ensembl-dev] assembly with multiple mapping paths
Sharon Wei
weix at cshl.edu
Thu Jan 13 20:18:40 GMT 2011
Hi Ian,
Thanks for the reply. The case 2 is what we have here. So the solution
is to create dummy scaffolds for those contigs.
Sharon
On 1/13/11 5:40 AM, ian Longden wrote:
> The mapping code can only deal with going from one coordinate system
> to another via one path else how would it know which is the proper
> one.
>
> In the mapping data do you have the chromosomes being assembled from contigs and
> another scaffolds (case 1)
>
>
> Chr A
> ---------------------------
>
> contig 1
> ---------------------------
>
>
>
> and also
>
>
> Chr A
> ---------------------------
>
> scaffold 1
> ---------------------------
>
>
>
> or is it more like (case 2)
>
>
>
> Chr A
> ---------------------------
>
> contig 1 scaffold 1
> ------------ ---------------
>
> contig 2
> ---------------
>
> or (case 3)
>
>
>
> Chr A
> ---------------------------
>
> scaffold 1
> ---------------------------
> contig 1
> ---------------------------
>
>
>
> Hopefully the email reader will not mess up all the spaces....
>
> Cases 1 (though not very good) and 3 are doable but case 2 is not.
>
> In case 2 you would have to add a dummy scaffold to make it work:-
>
>
>
> Chr A
> ---------------------------
>
> dummy 1 scaffold 1
> ------------ ---------------
>
> contig 1 contig 2
> -------------- ---------------
>
>
> So that the path is always chromosome -> scaffold -> contig.
>
> Does this make sense, if i could get a better idea about the data i
> would be able to help more.
>
>
> -Ian
> Ensembl Developer.
>
>
> On Thu, Jan 13, 2011 at 2:02 AM, Sharon Wei<weix at cshl.edu> wrote:
>> Dear ensemblers,
>>
>> Does any one know how to use ensembl core API to fetch sequences from
>> assembly with multiple mapping paths? I have trouble getting correct
>> sequences by $SliceAdaptor->fetch_by_region( $cs, $seq_region_name) when the
>> tilling path is made up of 2 different component coordinate systems;
>>
>> In this genome, there are 3 coordinate systems: chromosome, scaffold,
>> contig. The sequence level coordinate system is contig. Two AGP files, one
>> contains scaffold tiling path from contig, the other contains chromosome
>> tiling path from both scaffolds and contigs. Both tiling paths were loaded
>> into assembly table. In meta table, multiple mapping paths were assigned to
>> "assembly.mapping" including "chromsome|scaffold" and "chromosome|contig",
>> see the following tables.
>>
>> mysql> select * from coord_system;
>> coord_system_id species_id name version rank attrib
>> 1 1 chromosome 454.2pools.2009 1 default_version
>> 2 1 scaffold 454.2pools.2009 2 default_version
>> 3 1 contig 454.2pools.2009 3
>> default_version,sequence_level
>>
>>
>> mysql> select * from meta where meta_key='assembly.mapping';
>> meta_id species_id meta_key meta_value
>> 30 1 assembly.mapping
>> chromosome:454.2pools.2009|scaffold:454.2pools.2009
>> 87 1 assembly.mapping
>> chromosome:454.2pools.2009|scaffold:454.2pools.2009|contig:454.2pools.2009
>> 29 1 assembly.mapping
>> scaffold:454.2pools.2009|contig:454.2pools.2009
>> 117 1 assembly.mapping
>> chromosome:454.2pools.2009|contig:454.2pools.2009
>>
>> An excerpt of the chr AGP file is: (notice there are both scaffold and
>> contig):
>> ...
>> O.brachyantha_V1.0 1 152187 1 W
>> Obrachyantha03S_1.scaffold00624 1 152187 +
>> O.brachyantha_V1.0 152188 152287 2 N 100 fragment
>> no
>> O.brachyantha_V1.0 152288 159378 3 W
>> Obrachyantha03S_1.contig00345 1 7091 -
>> O.brachyantha_V1.0 159379 159478 4 N 100 fragment
>> no
>> O.brachyantha_V1.0 159479 383477 5 W
>> Obrachyantha03S_1.scaffold00031 1 223999 +
>> O.brachyantha_V1.0 383478 383577 6 N 100 fragment
>> no
>> O.brachyantha_V1.0 383578 404096 7 W
>> Obrachyantha03S_1.scaffold00086 1 20519 -
>> O.brachyantha_V1.0 404097 404196 8 N 100 fragment
>> no
>> ...
>>
>> However, when I use $SliceAdaptor->fetch_by_region( chromosome,
>> $seq_region_name) to retrieve chromosome sequences, all the scaffold
>> contributed regions were returned as Ns, only regions assembled directly
>> from contigs have actual sequences. I also got a warning of
>> "Meta table specifies multiple mapping paths between coord systems
>> chromosome and contig.
>> Choosing shorter path arbitrarily.
>> "
>> If I delete meta_id=117, mapping path of chromosome|contig, the warnings
>> disappeared, but the regions assembled directly from contigs will be Ns.
>>
>> It seems there is no way to fetch the complete correct chromosome sequence
>> made up of both scaffolds and contigs. There is no restriction on AGP
>> specification to prevent multiple component coordinate systems. So this
>> should be a legitimate case.
>>
>> My question is, is this a potential bug in the API? Is there any way to make
>> it work by playing with the mapping paths in the meta or assembly table?
>>
>> Any help is appreciated.
>>
>> Thanks,
>>
>>
>> Sharon
>>
>>
>>
>>
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
>>
More information about the Dev
mailing list