[ensembl-dev] Why do I get duplicated Variation Features
Andy Yates
ayates at ebi.ac.uk
Wed Jul 4 17:27:23 BST 2012
Hi Ben,
The API handles the human PAR region by only holding features mapped to it once on the X chromosome. When you request a region which contains said PAR region the API converts your query into one which queries X for the projected region & queries Y for the non-projected regions. In the examples you are looking at we have the same RS mapped once on the X chromosome PAR and once on the Y. When querying the X features are returned along with the Y features but the X features are projected into Y coordinates.
Requesting the name of the feature slice is a programatically cheap way (not computationally but in you coding a solution) of getting the location of variations. This can be used in a hash like so:
#Assume we have already tested somewhere we are querying the Y chromosome and we have assigned it into this variable
my $is_y = 1;
my %seen;
foreach my $vf (@{$features}) {
my $test_location = ($is_y && $vf->seq_region_start() < 2649521) ? 1 : 0;
if($test_location) {
my $loc = $vf->feature_Slice()->name();
next if $seen{$loc};
}
#do something
$seen{$loc} = 1 if $test_location;
}
}
This should eliminate your duplicates
Andy
Andrew Yates Ensembl Core Software Project Leader
EMBL-EBI Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK http://www.ensembl.org/
On 4 Jul 2012, at 15:59, Benoît Ballester wrote:
> Thanks Andy for the quick reply.
>
> I don't get how the $vf->feature_Slice()->name() would help in getting rid of the duplicate here (see 5th column)
>
> It still prints chromosome:GRCh37:Y:xxx:xxx
> I would have expected one to be on chro X the other on the Y...
>
>
> 22:Y:116772:116785:-1 rs36189917 G/A chromosome:GRCh37:Y:116775:116775:-1 Y 116775 116775 SNP dbSNP UPSTREAM
> 22:Y:116772:116785:-1 rs36189917 G/A chromosome:GRCh37:Y:116775:116775:-1 Y 116775 116775 SNP dbSNP INTERGENIC
>
> 22:Y:116798:116811:-1 rs35600455 G/A chromosome:GRCh37:Y:116801:116801:-1 Y 116801 116801 SNP dbSNP UPSTREAM
> 22:Y:116798:116811:-1 rs35600455 G/A chromosome:GRCh37:Y:116801:116801:-1 Y 116801 116801 SNP dbSNP INTERGENIC
>
>
> Ben
>
>
> On 4 Jul 2012, at 16:40, Andy Yates wrote:
>> Hi Ben,
>>
>> It seems that the duplication is in the database:
>>
>> select sr.name, vf.seq_region_start, vf.seq_region_end, vf.seq_region_strand, vf.allele_string
>> from variation v
>> join variation_feature vf using (variation_id)
>> join seq_region sr using (seq_region_id)
>> where v.name = 'rs71900610'
>>
>>
>> name seq_region_start seq_region_end seq_region_strand allele_string
>> Y 347778 347779 1 GA/-
>> X 397778 397779 1 GA/-
>>
>>
>> For the moment if you are on the Y chromosome & are in a PAR region then I would start hashing the duplicates out (anything less than position 2649521 in Y). Fastest way to do that I think would be to make a call to $vf->feature_Slice()->name() which will give you the coordinates along with the Sequence region information in a single string.
>>
>> Andy
>>
>> Andrew Yates Ensembl Core Software Project Leader
>> EMBL-EBI Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK http://www.ensembl.org/
>>
>> On 4 Jul 2012, at 15:18, Benoît Ballester wrote:
>>
>>> It looks like the usual HAP/PAR headache _again_ :(
>>>
>>> Any idea on how not to get VariationFeature twice when giving a slice falling in those region.
>>>
>>> Ps: my slice comes from a $feature->slice. Do I have to transform/project my slice to top-level ? I thought it was done by default.
>>>
>>> Ben
>>>
>>>
>>> On 4 Jul 2012, at 15:46, Benoît Ballester wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to fetch some variants for some slices but get duplicated variation features. I don't understand why, as I would expect one variant per slice.
>>>>
>>>> eg:
>>>> 7:Y:347772:347784:1 rs71900610 GA/- Y 347778 347779 deletion dbSNP INTERGENIC
>>>> 7:Y:347772:347784:1 rs71900610 GA/- Y 347778 347779 deletion dbSNP INTERGENIC
>>>>
>>>> or
>>>>
>>>> 48:Y:386863:386872:1 rs10600708 ACAC/- Y 386864 386867 deletion dbSNP INTERGENIC
>>>> 48:Y:386863:386872:1 rs10600708 ACAC/- Y 386864 386867 deletion dbSNP INTERGENIC
>>>>
>>>> or
>>>>
>>>> 22:Y:116772:116785:-1 rs36189917 G/A Y 116775 116775 SNP dbSNP UPSTREAM
>>>> 22:Y:116772:116785:-1 rs36189917 G/A Y 116775 116775 SNP dbSNP INTERGENIC
>>>> (here UPSTREAM/INTERGENIC difference)
>>>>
>>>>
>>>> I am sure I am missing something obvious somewhere, but so far I couldn't put my finger on it.
>>>>
>>>>
>>>> My code is pretty straightforward :
>>>>
>>>> my $vfs = $vfa->fetch_all_by_Slice($slice);
>>>> foreach my $vf (@$vfs) {
>>>> my $v = $vf->variation();
>>>> #print into on slice/variant/variantion-feature
>>>> }
>>>> }
>>>>
>>>>
>>>> Any feedback appreciated,
>>>>
>>>> Ben
>>>>
>>>> --
>>>> Benoit Ballester, PhD
>>>> _______________________________________________
>>>> Dev mailing list Dev at ensembl.org
>>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>> --
>>> Benoit Ballester, PhD
>>> Vertebrate Genomics - Ensembl
>>> European Bioinformatics Institute (EMBL-EBI)
>>> Wellcome Trust Genome Campus, Hinxton
>>> Cambridge CB10 1SD, United Kingdom
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org
>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>
> --
> Benoit Ballester, PhD
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
More information about the Dev
mailing list