[ensembl-dev] r2-pair calculations

andrew126 at mac.com andrew126 at mac.com
Tue Jun 5 09:01:14 BST 2018


Hi Anja,

Thank you for the suggestion.  This seems similar to what I have been doing (checking for 1KG data in the variant-feature's "variant sets").  Unfortunately, it would make me think that the API 90 result is valid.

And after some digging, I can understand it is a tough problem, as reasons for markers being eliminated from LD calculations are spread across the perl modules and the c code (id doesn't exist, poor genotype/allele behavior, etc.).

I was hopeful and tried calling this:

	my $ldFeatureContainer = $ldFeatureContainerAdaptor->fetch_by_VariationFeatures([$query_variation_feature,$query_variation_feature], $population);

... i.e. testing for a marker's LD to itself, which would empirically answer the question.  It didn't work, which, on reflection, isn't surprising .. smile.

Thanks again for the help.  I look forward to API 93.

Best,

Andrew





> On Jun 4, 2018, at 7:19 AM, Anja Thormann <anja at ebi.ac.uk> wrote:
> 
> Hi Andrew,
> 
> the best way is to check the evidence attributes of a variation feature object: http://www.ensembl.org/info/docs/Doxygen/variation-api/classBio_1_1EnsEMBL_1_1Variation_1_1VariationFeature.html#a698bae1ff1f3999af3f204f820a4c8b7 <http://www.ensembl.org/info/docs/Doxygen/variation-api/classBio_1_1EnsEMBL_1_1Variation_1_1VariationFeature.html#a698bae1ff1f3999af3f204f820a4c8b7>. If they contain the 1000Genomes attribute it is almost certain that the variant can be used for LD computation with the caveat that some will be excluded due to ID changes. This will be fixed in release/93 API code.
> 
> Anja
> 
>> On 4 Jun 2018, at 14:25, andrew126 at mac.com <mailto:andrew126 at mac.com> wrote:
>> 
>> Hi Anja,
>> 
>> Thanks very much for the helpful answer.
>> 
>> In general, is there a function/method that lets me query a particular variation feature to see if it is included in a particular API's LD calculations?  I very much would like to distinguish variation features with truly no r2 pairs >= .05 and those variation features that simply aren't included.
>> 
>> Thanks.
>> 
>> Best,
>> 
>> Andrew
>> 
>>> On Jun 4, 2018, at 4:50 AM, Anja Thormann <anja at ebi.ac.uk <mailto:anja at ebi.ac.uk>> wrote:
>>> 
>>> Hi Andrew,
>>> 
>>> we updated our 1000 Genomes Project files since release 90.
>>> 
>>> For release/90 we used ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/ALL.chr17.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz <ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/ALL.chr17.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz>
>>> 
>>> In those files rs1406071 was known under an older variant ID rs566458560
>>> 
>>> 
>>> We updated the files to ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/ALL.chr17_GRCh38.genotypes.20170504.vcf.gz <ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/ALL.chr17_GRCh38.genotypes.20170504.vcf.gz>
>>> 
>>> The new files correct problems that were present in the first version.
>>> 
>>> You can find a detailed report about the new files here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/README_GRCh38_liftover_20170504.txt <ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/README_GRCh38_liftover_20170504.txt>
>>> 
>>> 
>>> For release/92 we used ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/ALL.chr17_GRCh38.genotypes.20170504.vcf.gz <ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/ALL.chr17_GRCh38.genotypes.20170504.vcf.gz>
>>> 
>>> And in those files rs1406071 is used.
>>> 
>>> tabix ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/ALL.chr17_GRCh38.genotypes.20170504.vcf.gz <ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/ALL.chr17_GRCh38.genotypes.20170504.vcf.gz> 17:46148119-46148119
>>> 
>>> We are aware that mapping by ID only is not always reliable and added a mapping by location for release/93 (due end of June 2018) to our LD code.
>>> 
>>> 
>>> For human we only have genotype data from the 1000 Genomes Project. We added an evidence attribute to each variant in our database with 1000 Genomes Data. You can retrieve the attribute by calling get_all_evidence_values <http://www.ensembl.org/info/docs/Doxygen/variation-api/classBio_1_1EnsEMBL_1_1Variation_1_1VariationFeature.html#a698bae1ff1f3999af3f204f820a4c8b7> on a variation feature object and then filter for the attribute called 1000Genomes.
>>> 
>>> I don’t know how feasible it is for you to switch to the most recent API version. If you need to stay with release/90 code you could point to the new VCF files by updating you vcf_config file to
>>> https://github.com/Ensembl/ensembl-variation/blob/release/92/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json <https://github.com/Ensembl/ensembl-variation/blob/release/92/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json>
>>> 
>>> Most of all you need to use the highlighted line in your vcf_config file https://github.com/Ensembl/ensembl-variation/blob/release/92/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json#L22 <https://github.com/Ensembl/ensembl-variation/blob/release/92/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json#L22>
>>> 
>>> Please let me know if you have any further questions. Thank you also for the script you provided for debugging the problem.
>>> 
>>> Anja
>>> 
>>>> On 2 Jun 2018, at 16:10, andrew126 at mac.com <mailto:andrew126 at mac.com> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> rs1406071 shows identical mapping information, 1KG variant-set participation, and 1KG allele frequencies between Ensembl API 90 and 92; however, API 90 returns no high-r2 LD results, while API 92 returns high-r2 LD results.  Script to reproduce and output shown below.  I'm not sure why API 90 returns no results?
>>>> 
>>>> In general, when $ldFeatureContainerAdaptor->fetch_by_VariationFeature returns no r2 results, it either means the variation feature has no high-LD pairs or the variation feature is not in the LD dataset and so no r2 pairs can be calculated.  It seems very useful (and important) to distinguish those two possibilities when interpretting r2 results.
>>>> 
>>>> Based on the API 90 result, pre-screening for query variant features which have 1KG frequencies and 1KG variant sets is insufficient for determining if a variation feature is capable of returning 1KG high-LD results.  Is there a method to answer "can this variant feature generate any LD results?" .. presumably it would also cover tri-allelics, which I don't think report LD results either?
>>>> 
>>>> Thanks for any guidance.
>>>> 
>>>> Best regards,
>>>> 
>>>> Andrew
>>>> 
>>>> API 90 output:
>>>> 
>>>> 	The API version used is 90
>>>> 	variation-feature chromosomes:
>>>> 	        17
>>>> 	        CHR_HSCHR17_1_CTG5
>>>> 	        CHR_HSCHR17_2_CTG5
>>>> 	allele frequencies:
>>>> 	        Allele T has frequency: 0.759443339960239 in population 1000GENOMES:phase_3:EUR.
>>>> 	        Allele C has frequency: 0.240556660039761 in population 1000GENOMES:phase_3:EUR.
>>>> 	variant sets:
>>>> 	        1000 Genomes 3 - EUR
>>>> 	        1000 Genomes 3 - EUR - common
>>>> 	LD results >= r2 0.99:
>>>> 
>>>> API 92 output:
>>>> 
>>>> 	The API version used is 92
>>>> 	variation-feature chromosomes:
>>>> 		17
>>>> 		CHR_HSCHR17_1_CTG5
>>>> 		CHR_HSCHR17_2_CTG5
>>>> 	allele frequencies:
>>>> 		Allele C has frequency: 0.240556660039761 in population 1000GENOMES:phase_3:EUR.
>>>> 		Allele T has frequency: 0.759443339960239 in population 1000GENOMES:phase_3:EUR.
>>>> 	variant sets:
>>>> 		1000 Genomes 3 - EUR - common
>>>> 		1000 Genomes 3 - EUR
>>>> 	LD results >= r2 0.99:
>>>> 		rs1406071 rs74348235 1.000000
>>>> 		rs1406071 rs17660251 1.000000
>>>> 		rs1406071 rs2696707 1.000000
>>>> 		rs1406071 rs2696590 0.994581
>>>> 		rs1406071 rs2696703 1.000000
>>>> 		rs1406071 rs62061766 1.000000
>>>> 		rs1406071 rs2732596 0.994566
>>>> 		rs1406071 rs62061821 1.000000
>>>> 	Etc.
>>>> 
>>>> Script:
>>>> 
>>>> use strict;
>>>> $|=1;
>>>> use Bio::EnsEMBL::Registry;
>>>> use Bio::EnsEMBL::ApiVersion;
>>>> printf( "The API version used is %s\n", software_version() );
>>>> 
>>>> my $registry = 'Bio::EnsEMBL::Registry';
>>>> $registry->load_all();
>>>> $registry->load_registry_from_db(
>>>>    -host => 'ensembldb.ensembl.org <http://ensembldb.ensembl.org/>', # alternatively 'useastdb.ensembl.org <http://useastdb.ensembl.org/>'
>>>>    -user => 'anonymous',
>>>>    );
>>>> 
>>>> $registry->set_reconnect_when_lost(1);
>>>> my $v_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');
>>>> $v_adaptor->db->use_vcf(1);
>>>> my $pop_adaptor = $registry->get_adaptor("human","variation","population");
>>>> my $ldFeatureContainerAdaptor = $registry->get_adaptor('human', 'variation', 'ldfeaturecontainer');
>>>> $ldFeatureContainerAdaptor->max_snp_distance(100000);
>>>> my $population = $pop_adaptor->fetch_by_name("1000GENOMES:phase_3:EUR");
>>>> my $variation = $v_adaptor->fetch_by_name("rs1406071");
>>>> my $vfref = $variation->get_all_VariationFeatures();
>>>> my $query_variation_feature;
>>>> print "variation-feature chromosomes:\n";
>>>> foreach my $vf (@{$vfref}) {
>>>>    print "\t".$vf->seq_region_name."\n";
>>>>    if ($vf->seq_region_name eq "17") {
>>>>        $query_variation_feature = $vf;
>>>>    }
>>>> }
>>>> print "allele frequencies:\n";
>>>> my $alleles = $query_variation_feature->variation->get_all_Alleles();
>>>> foreach my $allele (@{$alleles}) {
>>>>    next unless (defined $allele->population);
>>>>    my $allele_string   = $allele->allele;
>>>>    my $frequency       = $allele->frequency || 'NA';
>>>>    my $population_name = $allele->population->name;
>>>>    if ($population_name=~/EUR/ && $population_name=~/1000/) {
>>>>        printf("\tAllele %s has frequency: %s in population %s.\n", $allele_string, $frequency, $population_name);
>>>>    }
>>>> }
>>>> my $variant_sets = $query_variation_feature->get_all_VariationSets();
>>>> print "variant sets:\n";
>>>> if (defined $variant_sets) {
>>>>    foreach my $vs (@{$variant_sets}){
>>>>        if ($vs->name()=~/EUR/) {
>>>>            print "\t".$vs->name()."\n";
>>>>        }
>>>>    }
>>>> }
>>>> print "LD results >= r2 0.99:\n";
>>>> my $ldFeatureContainer = $ldFeatureContainerAdaptor->fetch_by_VariationFeature($query_variation_feature,$population);
>>>> foreach my $r_square (@{$ldFeatureContainer->get_all_r_square_values}){
>>>>    if ($r_square->{r2} >= 0.99){
>>>>        print "\t".$r_square->{variation1}->variation_name." ".$r_square->{variation2}->variation_name." ".$r_square->{r2}."\n";
>>>>    }
>>>> }
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>>>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20180605/5445637f/attachment.html>


More information about the Dev mailing list