[ensembl-dev] identifying known variants that map to multiple genome locations
Pablo Marin-Garcia
pg4 at sanger.ac.uk
Thu Dec 9 12:33:37 GMT 2010
I have explored a bit further how ensembl deals with the haplotypic regions and
map_weight:
select vf.variation_id, vf.variation_name, vf.map_weight, sr.name as chr,
vf.seq_region_start as start from variation_feature vf, seq_region sr where
vf.seq_region_id=sr.seq_region_id and vf.variation_id in (select variation_id
from variation_feature where seq_region_id=27796);
variation_id variation_name map_weight chr start
48783 rs140811 1 6 9152453
48783 rs140811 1 HSCHR6_MHC_APD 9152453
48783 rs140811 1 HSCHR6_MHC_COX 9152453
48783 rs140811 1 HSCHR6_MHC_DBB 9152453
48783 rs140811 1 HSCHR6_MHC_MANN 9152453
150381 rs262389 3 17 67738061
150381 rs262389 3 2 19510112
150381 rs262389 3 9 69077043
150381 rs262389 3 HSCHR6_MHC_COX 133386353
150381 rs262389 3 HSCHR6_MHC_QBL 133393565
262381 rs394853 1 6 9154689
262381 rs394853 1 HSCHR6_MHC_APD 9154689
262381 rs394853 1 HSCHR6_MHC_COX 9154689
262381 rs394853 1 HSCHR6_MHC_MCF 9154689
262381 rs394853 1 HSCHR6_MHC_QBL 9154689
262381 rs394853 1 HSCHR6_MHC_SSTO 9154689
267399 rs401174 1 6 9147062
267399 rs401174 1 HSCHR6_MHC_COX 9147062
267399 rs401174 1 HSCHR6_MHC_MANN 9147062
267399 rs401174 1 HSCHR6_MHC_MCF 9147062
267399 rs401174 1 HSCHR6_MHC_QBL 9147062
267399 rs401174 1 HSCHR6_MHC_SSTO 9147062
As you can see here seems that ensembl API is doing the right thing with the
map_weight. rs262389 has multiple mapping but the rest only 1
Then you have interesting cases where things does not have a clear cut:
296965 rs438053 1 15 20000521
296965 rs438053 1 HSCHR6_MHC_APD 61888502
296965 rs438053 1 HSCHR6_MHC_APD 61892262
296965 rs438053 1 HSCHR6_MHC_COX 61826727
296965 rs438053 1 HSCHR6_MHC_COX 61830487
296965 rs438053 1 HSCHR6_MHC_MANN 62058991
296965 rs438053 1 HSCHR6_MHC_MANN 62062751
296965 rs438053 1 HSCHR6_MHC_MCF 62075461
296965 rs438053 1 HSCHR6_MHC_MCF 62079221
It maps in the chr6 haplotypic region but not in chr6 reference (that is fine),
but maps in another reference chr (still could be ok due a translocation to the
haplotypic region), but then you have duplications in all the haplotypic
regions. In this case map_weight should be more than one. One reason to filter
by mapweight is to clear SNPs that would be having problems in genotyping
studies and probably this one would give you some trouble in a GWAS.
any views about this case for rs438053?
-Pablo
On Thu, 9 Dec 2010, Pablo Marin-Garcia wrote:
> On Thu, 9 Dec 2010, Andrea Edwards wrote:
>
>> Hi
>>
>> My email was very confusing - I'm sorry about that. I'll try again
>>
>> I am retrieving a set of variation feature objects like so
>>
>>
>> # get all VariationFeatures in a region
>> $slice = $sa->fetch_by_region('chromosome', '7');
>> foreach $vf (@{$vfa->fetch_all_by_Slice($slice)}) {
>>
>>
>> how would i check the variation feature object that i have returned maps to
>> a single genome location?
>>
>> I don't have a variation object, but now i have a vf object (which is what
>> i was getting at in my last email but explained badly) so I could create a
>> variation object from it and do this
>>
>> $v = $va->fetch_by_name(vf->variation_name);
>> $count = @{$vfa->fetch_all_by_Variation($v)}
>> if $count> 1 then it maps to more than one place
>>
>> But this seemed circuitous and I was wondering if there was a better way.
>
> Hello Andrea,
>
> If you have a variation_feature you can look at $vf->map_weight to see if it
> has multiple mapping but.....
>
> You need to create some filters yourself for the following cases:
>
> - before $vfa->fetch_all_by_Variation($v) check that $v exist
> - $vfa->fetch_all_by_Variation($v) could be void: no mapping or more than 3
> mappings
> - the problem of using your $count is that you end filtering out valid snps
> with only one mapping on the refernece but present also in the haplotypic
> regions like chr6 COX and others, or both chrX and Y (at least en previous
> APIs).
> - I deal with this by checking the count of variations with chr
> starting by number or XY:
>
> $count = [grep{$_->seq_region_name =~ /^[1-9XY]/} @vfeats_for_rs]
>
> But probably there are better options.
>
> - questions to Will/Pontus:
>
> Is map_weight reporting only 1 if you map only once in the reference chrs but
> also to the haplotypic regions?
> How are snps in chrX and Y reported, you will have two variation features for
> a single variation?
> And which map_weight would be assigned to these SNPs?
>
>
> Finally, the reason why you should go through a variation object instead
> passing a rs to variation_feature is beacause the variarion_feature mysql
> table only have indexes for variarion_id and position, so you need to acces
> it passing a Slice or Variation object.
>
> mysql> show index from variation_feature;
> +-------------------+------------+---------------+--------------+---------------------
> | Table | Non_unique | Key_name | Seq_in_index | Column_name
> +-------------------+------------+---------------+--------------+---------------------
> | variation_feature | 0 | PRIMARY | 1 |
> variation_feature_id
> | variation_feature | 1 | variation_idx | 1 |
> variation_id
> | variation_feature | 1 | pos_idx | 1 |
> seq_region_id
> | variation_feature | 1 | pos_idx | 2 |
> seq_region_start
> | variation_feature | 1 | pos_idx | 3 |
> seq_region_end
> +-------------------+------------+---------------+--------------+----------------------
>
> Hope this helps
>
> -Pablo
>
>>
>>
>>
>>
>>
>> On 09/12/2010 09:43, Will McLaren wrote:
>>> Hi Andrea,
>>>
>>> I'm not quite sure why you're creating a variation feature object for
>>> a known variant in that way; you should be retrieving it via the
>>> variation object using either
>>>
>>> $v->get_all_VariationFeatures
>>>
>>> or
>>>
>>> $variation_feature_adaptor->fetch_all_by_Variation($v)
>>>
>>> There is no way to get the other variation features from the variation
>>> feature itself; the quickest way (or the least amount of code) would
>>> be:
>>>
>>> $vf->variation->get_all_VariationFeatures
>>>
>>> Will
>>>
>>> On 8 December 2010 23:38, Andrea Edwards<edwardsa at cs.man.ac.uk> wrote:
>>>> Esteemed api developers,
>>>>
>>>> Is there any easy way with the perl api to tell if a known variant (by
>>>> this
>>>> i mean one that has a an rsID and that i have created like this:
>>>>
>>>> # Variation feature representing a single nucleotide polymorphism
>>>> $vf = Bio::EnsEMBL::Variation::VariationFeature->new
>>>> (-start => 100,
>>>> -end => 100,
>>>> -strand => 1,
>>>> -slice => $slice,
>>>> -allele_string => 'A/T',
>>>> -variation_name => 'rs635421',
>>>> -map_weight => 1,
>>>> -variation => $v);
>>>>
>>>> is known to map to another region of the genome other than the region
>>>> have
>>>> specified with the slice/start/end properties?
>>>>
>>>> dbSNP variants can be assigned to multiple genome locations when it
>>>> hasn't
>>>> been possible to unambiguously assign them to a single location.
>>>>
>>>> If i can't tell from the variation feature directly i believe i can do
>>>> this:
>>>>
>>>> $vfa = $reg->get_adaptor("human","variation","variationfeature");
>>>> $va = $reg->get_adaptor("human","variation","variation");
>>>>
>>>> # fetch all genome hits for a particular variation
>>>> $v = $va->fetch_by_name('rs56');
>>>> $count = @{$vfa->fetch_all_by_Variation($v)}
>>>> if $count> 1 then it maps to more than one place?
>>>>
>>>> I was hoping to be able to do it directly from teh variation feature
>>>>
>>>> thanks
>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing list
>>>> Dev at ensembl.org
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>
>>>>
>>
>>
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
>>
>
>
> -----
>
> Pablo Marin-Garcia
>
>
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev
>
-----
Pablo Marin-Garcia
More information about the Dev
mailing list