[ensembl-dev] Why do I get duplicated Variation Features

Andy Yates ayates at ebi.ac.uk
Wed Jul 4 17:27:23 BST 2012


Hi Ben,

The API handles the human PAR region by only holding features mapped to it once on the X chromosome. When you request a region which contains said PAR region the API converts your query into one which queries X for the projected region & queries Y for the non-projected regions. In the examples you are looking at we have the same RS mapped once on the X chromosome PAR and once on the Y. When querying the X features are returned along with the Y features but the X features are projected into Y coordinates.

Requesting the name of the feature slice is a programatically cheap way (not computationally but in you coding a solution) of getting the location of variations. This can be used in a hash like so:

#Assume we have already tested somewhere we are querying the Y chromosome and we have assigned it into this variable
my $is_y = 1;
my %seen;
foreach my $vf (@{$features}) {
  my $test_location = ($is_y && $vf->seq_region_start() < 2649521) ? 1 : 0;
  if($test_location) {
    my $loc = $vf->feature_Slice()->name();
    next if $seen{$loc};
  }
  #do something

  $seen{$loc} = 1 if $test_location;
  }
}

This should eliminate your duplicates

Andy

Andrew Yates                   Ensembl Core Software Project Leader
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensembl.org/

On 4 Jul 2012, at 15:59, Benoît Ballester wrote:

> Thanks Andy for the quick reply. 
> 
> I don't get how the $vf->feature_Slice()->name() would help in getting rid of the duplicate here (see 5th column)
> 
> It still prints chromosome:GRCh37:Y:xxx:xxx
> I would have expected one to be on chro X the other on the Y...
> 
> 
> 22:Y:116772:116785:-1  rs36189917      G/A     chromosome:GRCh37:Y:116775:116775:-1    Y       116775  116775  SNP     dbSNP   UPSTREAM
> 22:Y:116772:116785:-1  rs36189917      G/A     chromosome:GRCh37:Y:116775:116775:-1    Y       116775  116775  SNP     dbSNP   INTERGENIC
> 
> 22:Y:116798:116811:-1  rs35600455      G/A     chromosome:GRCh37:Y:116801:116801:-1    Y       116801  116801  SNP     dbSNP   UPSTREAM
> 22:Y:116798:116811:-1  rs35600455      G/A     chromosome:GRCh37:Y:116801:116801:-1    Y       116801  116801  SNP     dbSNP   INTERGENIC
> 
> 
> Ben
> 
> 
> On 4 Jul 2012, at 16:40, Andy Yates wrote:
>> Hi Ben,
>> 
>> It seems that the duplication is in the database:
>> 
>> select sr.name, vf.seq_region_start, vf.seq_region_end, vf.seq_region_strand, vf.allele_string
>> from variation v 
>> join variation_feature vf using (variation_id)
>> join seq_region sr using (seq_region_id)
>> where v.name = 'rs71900610'
>> 
>> 
>> name	seq_region_start	seq_region_end	seq_region_strand	allele_string
>> Y	347778			347779		1			GA/-
>> X	397778			397779		1			GA/-
>> 
>> 
>> For the moment if you are on the Y chromosome & are in a PAR region then I would start hashing the duplicates out (anything less than position 2649521 in Y). Fastest way to do that I think would be to make a call to $vf->feature_Slice()->name() which will give you the coordinates along with the Sequence region information in a single string.
>> 
>> Andy
>> 
>> Andrew Yates                   Ensembl Core Software Project Leader
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensembl.org/
>> 
>> On 4 Jul 2012, at 15:18, Benoît Ballester wrote:
>> 
>>> It looks like the usual HAP/PAR headache _again_  :(
>>> 
>>> Any idea on how not to get VariationFeature twice when giving a slice falling in those region. 
>>> 
>>> Ps: my slice comes from a $feature->slice. Do I have to transform/project my slice to top-level ? I thought it was done by default. 
>>> 
>>> Ben
>>> 
>>> 
>>> On 4 Jul 2012, at 15:46, Benoît Ballester wrote:
>>> 
>>>> Hi, 
>>>> 
>>>> I am trying to fetch some variants for some slices but get duplicated variation features. I don't understand why, as I would expect one variant per slice.
>>>> 
>>>> eg: 
>>>> 7:Y:347772:347784:1  rs71900610      GA/-    Y       347778  347779  deletion        dbSNP   INTERGENIC
>>>> 7:Y:347772:347784:1  rs71900610      GA/-    Y       347778  347779  deletion        dbSNP   INTERGENIC
>>>> 
>>>> or 
>>>> 
>>>> 48:Y:386863:386872:1     rs10600708      ACAC/-  Y       386864  386867  deletion        dbSNP   INTERGENIC
>>>> 48:Y:386863:386872:1     rs10600708      ACAC/-  Y       386864  386867  deletion        dbSNP   INTERGENIC
>>>> 
>>>> or 
>>>> 
>>>> 22:Y:116772:116785:-1   rs36189917      G/A     Y       116775  116775  SNP     dbSNP   UPSTREAM
>>>> 22:Y:116772:116785:-1   rs36189917      G/A     Y       116775  116775  SNP     dbSNP   INTERGENIC
>>>> (here UPSTREAM/INTERGENIC difference)
>>>> 
>>>> 
>>>> I am sure I am missing something obvious somewhere, but so far I couldn't put my finger on it.  
>>>> 
>>>> 
>>>> My code is pretty straightforward :
>>>> 
>>>> my $vfs = $vfa->fetch_all_by_Slice($slice);
>>>> 	foreach my $vf (@$vfs) {
>>>> 	    my $v = $vf->variation();
>>>> 	#print into on slice/variant/variantion-feature
>>>>     }
>>>> }
>>>> 
>>>> 
>>>> Any feedback appreciated,
>>>> 
>>>> Ben
>>>> 
>>>> --
>>>> Benoit Ballester, PhD
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>>> --
>>> Benoit Ballester, PhD
>>> Vertebrate Genomics - Ensembl
>>> European Bioinformatics Institute (EMBL-EBI)
>>> Wellcome Trust Genome Campus, Hinxton
>>> Cambridge CB10 1SD, United Kingdom
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> --
> Benoit Ballester, PhD
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list