[ensembl-dev] identifying known variants that map to multiple genome locations

Pablo Marin-Garcia pg4 at sanger.ac.uk
Thu Dec 9 12:33:37 GMT 2010


I have explored a bit further how ensembl deals with the haplotypic regions and 
map_weight:


select vf.variation_id, vf.variation_name, vf.map_weight, sr.name as chr, 
vf.seq_region_start as start  from variation_feature vf, seq_region sr where 
vf.seq_region_id=sr.seq_region_id and  vf.variation_id in (select variation_id 
from variation_feature where seq_region_id=27796);
variation_id    variation_name  map_weight      chr     start
48783   rs140811        1       6       9152453
48783   rs140811        1       HSCHR6_MHC_APD  9152453
48783   rs140811        1       HSCHR6_MHC_COX  9152453
48783   rs140811        1       HSCHR6_MHC_DBB  9152453
48783   rs140811        1       HSCHR6_MHC_MANN 9152453

150381  rs262389        3       17      67738061
150381  rs262389        3       2       19510112
150381  rs262389        3       9       69077043
150381  rs262389        3       HSCHR6_MHC_COX  133386353
150381  rs262389        3       HSCHR6_MHC_QBL  133393565

262381  rs394853        1       6       9154689
262381  rs394853        1       HSCHR6_MHC_APD  9154689
262381  rs394853        1       HSCHR6_MHC_COX  9154689
262381  rs394853        1       HSCHR6_MHC_MCF  9154689
262381  rs394853        1       HSCHR6_MHC_QBL  9154689
262381  rs394853        1       HSCHR6_MHC_SSTO 9154689

267399  rs401174        1       6       9147062
267399  rs401174        1       HSCHR6_MHC_COX  9147062
267399  rs401174        1       HSCHR6_MHC_MANN 9147062
267399  rs401174        1       HSCHR6_MHC_MCF  9147062
267399  rs401174        1       HSCHR6_MHC_QBL  9147062
267399  rs401174        1       HSCHR6_MHC_SSTO 9147062


As you can see here seems that ensembl API is doing the right thing with the 
map_weight. rs262389 has multiple mapping but the rest only 1


Then you have interesting cases where things does not have a clear cut:

296965  rs438053        1       15      20000521
296965  rs438053        1       HSCHR6_MHC_APD  61888502
296965  rs438053        1       HSCHR6_MHC_APD  61892262
296965  rs438053        1       HSCHR6_MHC_COX  61826727
296965  rs438053        1       HSCHR6_MHC_COX  61830487
296965  rs438053        1       HSCHR6_MHC_MANN 62058991
296965  rs438053        1       HSCHR6_MHC_MANN 62062751
296965  rs438053        1       HSCHR6_MHC_MCF  62075461
296965  rs438053        1       HSCHR6_MHC_MCF  62079221

It maps in the chr6 haplotypic region but not in chr6 reference (that is fine), 
but maps in another reference chr (still could be ok due a translocation to the 
haplotypic region), but then you have duplications in all the haplotypic 
regions. In this case map_weight should be more than one. One reason to filter 
by mapweight is to clear SNPs that would be having problems in genotyping 
studies and probably this one would give you some trouble in a GWAS.

any views about this case for rs438053?


   -Pablo


On Thu, 9 Dec 2010, Pablo Marin-Garcia wrote:

> On Thu, 9 Dec 2010, Andrea Edwards wrote:
>
>> Hi
>> 
>> My email was very confusing - I'm sorry about that. I'll try again
>> 
>> I am retrieving a set of variation feature objects like so
>> 
>> 
>> # get all VariationFeatures in a region
>> $slice = $sa->fetch_by_region('chromosome', '7');
>> foreach $vf (@{$vfa->fetch_all_by_Slice($slice)}) {
>> 
>> 
>> how would i check the variation feature object that i have returned maps to 
>> a single genome location?
>> 
>> I don't have a variation object, but now i have a vf object (which is what 
>> i was getting at in my last email but explained badly) so I could create a 
>> variation object from it and do this
>> 
>> $v = $va->fetch_by_name(vf->variation_name);
>> $count =  @{$vfa->fetch_all_by_Variation($v)}
>> if $count>  1 then it maps to more than one place
>> 
>> But this seemed circuitous and I was wondering if there was a better way.
>
> Hello Andrea,
>
> If you have a variation_feature you can look at $vf->map_weight to see if it 
> has multiple mapping but.....
>
> You need to create some filters yourself for the following cases:
>
> - before $vfa->fetch_all_by_Variation($v) check that $v exist
> - $vfa->fetch_all_by_Variation($v) could be void: no mapping or more than 3 
> mappings
> - the problem of using your $count is that you end  filtering out valid snps 
> with only one mapping on the refernece but present also in the haplotypic 
> regions like chr6 COX and others, or both chrX and Y (at least en previous 
> APIs).
>  - I deal with this by checking the count of variations with chr
>    starting by number or XY:
>
>    $count = [grep{$_->seq_region_name =~ /^[1-9XY]/} @vfeats_for_rs]
>
>    But probably there are better options.
>
> - questions to Will/Pontus:
>
> Is map_weight reporting only 1 if you map only once in the reference chrs but 
> also to the haplotypic regions?
> How are snps in chrX and Y reported, you will have two variation features for 
> a single variation?
> And which map_weight would be assigned to these SNPs?
>
>
> Finally, the reason why you should go through a variation object instead 
> passing a rs to variation_feature is beacause the variarion_feature mysql 
> table only have indexes for variarion_id and position, so you need to acces 
> it passing a Slice or Variation object.
>
> mysql> show index from variation_feature;
> +-------------------+------------+---------------+--------------+---------------------
> | Table             | Non_unique | Key_name      | Seq_in_index | Column_name
> +-------------------+------------+---------------+--------------+---------------------
> | variation_feature |          0 | PRIMARY       |            1 | 
> variation_feature_id
> | variation_feature |          1 | variation_idx |            1 | 
> variation_id
> | variation_feature |          1 | pos_idx       |            1 | 
> seq_region_id
> | variation_feature |          1 | pos_idx       |            2 | 
> seq_region_start
> | variation_feature |          1 | pos_idx       |            3 | 
> seq_region_end
> +-------------------+------------+---------------+--------------+----------------------
>
> Hope this helps
>
>  -Pablo
>
>> 
>> 
>> 
>> 
>> 
>> On 09/12/2010 09:43, Will McLaren wrote:
>>> Hi Andrea,
>>> 
>>> I'm not quite sure why you're creating a variation feature object for
>>> a known variant in that way; you should be retrieving it via the
>>> variation object using either
>>> 
>>> $v->get_all_VariationFeatures
>>> 
>>> or
>>> 
>>> $variation_feature_adaptor->fetch_all_by_Variation($v)
>>> 
>>> There is no way to get the other variation features from the variation
>>> feature itself; the quickest way (or the least amount of code) would
>>> be:
>>> 
>>> $vf->variation->get_all_VariationFeatures
>>> 
>>> Will
>>> 
>>> On 8 December 2010 23:38, Andrea Edwards<edwardsa at cs.man.ac.uk>  wrote:
>>>> Esteemed api developers,
>>>> 
>>>> Is there any easy way with the perl api to tell if a known variant (by 
>>>> this
>>>> i mean one that has a an rsID and that i have created like this:
>>>> 
>>>> # Variation feature representing a single nucleotide polymorphism
>>>> $vf = Bio::EnsEMBL::Variation::VariationFeature->new
>>>> (-start =>  100,
>>>> -end =>  100,
>>>> -strand =>  1,
>>>> -slice =>  $slice,
>>>> -allele_string =>  'A/T',
>>>> -variation_name =>  'rs635421',
>>>> -map_weight =>  1,
>>>> -variation =>  $v);
>>>> 
>>>> is known to map to another region of the genome other than the region 
>>>> have
>>>> specified with the slice/start/end properties?
>>>> 
>>>> dbSNP variants can be assigned to multiple genome locations when it 
>>>> hasn't
>>>> been possible to unambiguously assign them to a single location.
>>>> 
>>>> If i can't tell from the variation feature directly i believe i can do 
>>>> this:
>>>> 
>>>> $vfa = $reg->get_adaptor("human","variation","variationfeature");
>>>> $va = $reg->get_adaptor("human","variation","variation");
>>>> 
>>>> # fetch all genome hits for a particular variation
>>>> $v = $va->fetch_by_name('rs56');
>>>> $count =  @{$vfa->fetch_all_by_Variation($v)}
>>>> if $count>  1 then it maps to more than one place?
>>>> 
>>>> I was hoping to be able to do it directly from teh variation feature
>>>> 
>>>> thanks
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list
>>>> Dev at ensembl.org
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> 
>>>> 
>> 
>> 
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
>> 
>
>
> -----
>
>  Pablo Marin-Garcia
>
>
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev
>


-----

   Pablo Marin-Garcia





More information about the Dev mailing list