[ensembl-dev] vcf_import of 1000GENOMES phase 3 data

Mon Nov 17 11:39:00 GMT 2014

Hello G

The 1000 Genomes Project is currently undertaking a QC of P3.

We aim to provide a list to the community categorizing the different
reasons variants aren't part of the P3 sites list.  This will allow
everyone to see if we no longer called a site, filtered it out or
missed it from our final build. We hope to release the list as soon as
possible.

We will ensure that Ensembl gets that list and it is available for
display as soon as is feasible.

thanks

Laura

On 17 November 2014 10:50, Genomeo Dev <genomeodev at gmail.com> wrote:
> Hi Will,
>
> As you might be aware phase 3 has some known issues at the moment which
> include a considerable fraction of variants missing compared to phase 1 v3
> with no apparent reason.
>
> For example, for EUR super population, there are about 1.6 million variants
> missing.
>
>>>bcftools isec -p dir -n-1 -w1 1000GENOMES-phase_1_EUR.vcf.gz
>>> EUR.all.phase3_shapeit2_mvncall_integrated_v5.20130502.vcf.gz
>
>>>gawk '{print $5}' dir/sites.txt  | sort -V | uniq -c
>
> 1,619,556 10
> 68,792,800 01
>
> See:
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/README_known_issues_20141030
>
> See also a recent  thread on Ensembl developers email list.
>
> I wonder whether Ensembl can do some merger between phase 1 and 3 while
> waiting for a fix from 1000 genomes.
>
> G.
>
> On 17 November 2014 09:27, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>
>> Hi Drew,
>>
>> We hope to have 1000 genomes phase 3 data available in release 79 of
>> Ensembl, which is currently due for release early next year.
>>
>> Thanks for the valuable feedback.
>>
>> Regards
>>
>> Will
>>
>> On 15 November 2014 02:58, Roberts, Drew (RY) <andrew_roberts at merck.com>
>> wrote:
>>>
>>> Will-
>>>
>>>
>>>
>>> This is very useful information, thank you much.  If nothing else, it
>>> makes me feel better about having problems loading 1000GENOMES!
>>>
>>>
>>>
>>> We will most likely follow your suggestions and try to implement the new
>>> features you mention, but in immediate short term I think we'll move our
>>> focus to some simpler tasks.  When we're ready to move forward I'll post to
>>> this list again and take you up on your offer for configuration help.
>>>
>>>
>>>
>>> For now, I will add 1 question and 1 comment:
>>>
>>>
>>>
>>> Question:  Do you have any idea when 1000GENOMES phase 3 data will be
>>> included in the standard EnsEMBL release?
>>>
>>>
>>>
>>> Comment:  I would request that a small update be made to the
>>> vcf_import.pl documentation online, at:
>>>
>>> http://www.ensembl.org/info/genome/variation/import_vcf.html#tables
>>> The current text implies that loading transcript_variation after the
>>> import_vcf is a simple and supported pipeline, which it sounds like it isn't
>>> quite yet.  If the wording was changed to warn people that it isn't so easy,
>>> they will know up front that they need to do an "add_tables
>>> transcript_variation" during their initial import_vcf.pl run if that data is
>>> required.
>>>
>>>
>>>
>>> Again, thanks for the enlightening response, and I expect that I will be
>>> following up on your ideas at some point in the not-too-horribly-distant
>>> future.
>>>
>>>
>>>
>>> Cheers!
>>>
>>> Drew
>>>
>>>
>>>
>>> From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf
>>> Of Will McLaren
>>> Sent: Friday, November 14, 2014 12:52 AM
>>> To: Ensembl developers list
>>> Cc: Hughes, Jason
>>> Subject: Re: [ensembl-dev] vcf_import of 1000GENOMES phase 3 data
>>>
>>>
>>>
>>> Hi Drew,
>>>
>>>
>>>
>>> The Ensembl Variation team created the variation DB for the 1000 genomes
>>> browser site, so we do have relevant experience here.
>>>
>>>
>>>
>>> We too found that the volume of data stretches a lot of resources in ways
>>> they don't want to be stretched. To this end we have a couple of solutions
>>> in the API that mitigate these issues to some extent.
>>>
>>>
>>>
>>> 1) We don't load the allele or population_genotype table with 1000
>>> genomes data; in fact we started doing this with phase 1. There exists code
>>> in the API to generate allele and population_genotype objects on the fly
>>> from just the data in the compressed_genotype_var table; in order to enable
>>> this, you simply need to set the freqs_from_gts column to 1 in the
>>> population table for the relevant populations.
>>>
>>>
>>>
>>> Note that this data is then unavailable via direct SQL query as the
>>> genotype data in compressed_genotype_var is readable only by the API.
>>>
>>>
>>>
>>> 2) For the phase 3 data, we realised that populating
>>> compressed_genotype_region and compressed_genotype_var would also become
>>> impractical; tests indicated that each table would be around 1.5TB on disk!
>>>
>>>
>>>
>>> To this end we developed some more API trickery to load genotype data on
>>> the fly from VCF files. The genotype objects from this are then also fed
>>> into the features mentioned above, such that all genotype and frequency data
>>> that you see at e.g.
>>> http://browser.1000genomes.org/Homo_sapiens/Variation/Population?v=rs699
>>> come from the same VCFs you are using to load your DB. The same code also
>>> powers the LD views.
>>>
>>>
>>>
>>> The API code for this is currently available on release/76 of the
>>> ensembl-variation Git repo. If you'd like some help getting this set up, let
>>> me know and I can give you more details on what to do; the code introduces a
>>> couple of extra dependencies and requires some (very simple) configuration.
>>> It has not yet made it into the master branch as it's still somewhat under
>>> development, but it is stable enough for use for a number of applications.
>>>
>>>
>>>
>>> I realise you said you were on release/75, but it should be possible for
>>> you to simply patch your DB to release/76 using the
>>> ensembl/misc-scripts/schema_patcher.pl script. If not it may also be
>>> possible to push the code onto the release/75 branch too.
>>>
>>>
>>>
>>> Regarding transcript variation, the pipeline we use to populate the table
>>> is not currently documented for users outside the project. It may be
>>> possible to prepare some documentation suitable for external use. There is
>>> also a way to have import_vcf.pl use the same code and optimised caches that
>>> the Variant Effect Predictor uses, though this is alpha code at best.
>>>
>>>
>>>
>>> I hope this helps, though I realise it doesn't necessarily address all of
>>> the issues you've been having!
>>>
>>>
>>>
>>> Regards
>>>
>>>
>>>
>>> Will McLaren
>>>
>>> Ensembl Variation
>>>
>>>
>>>
>>>
>>>
>>> On 14 November 2014 01:47, Roberts, Drew (RY) <andrew_roberts at merck.com>
>>> wrote:
>>>
>>> Hello all-
>>>
>>> I am new to EnsEMBL software.  I am trying to use the vcf_import.pl
>>> script to load a set of VCF files produced by phase 3 of the 1000Genomes
>>> project into a custom EnsEMBL variation database I created in-house.  I
>>> confirmed that this process works for a smaller dataset, but am having
>>> issues with 1000Genomes.  The sheer size of the dataset is proving
>>> difficult.
>>>
>>> We are using an in-house EnsEMBL database and API tools with version
>>> 75_37...a little behind the latest, I know, but we wanted to stay with the
>>> human genome build 37 for now as most of our in-house data uses it.
>>>
>>> I ran a number of vcf_import.pl jobs in parallel -- one for each
>>> chromosome -- with the following config file:
>>>
>>> registry
>>> /tmp60days/robertsa/vcf_file_import/vcf_import_ensembl_registry.pl
>>> input_file
>>> /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/reduced.ALL.chr15.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
>>> source        1000Genomes
>>> population    1000Genomes:phase_3:MERCK_PRELOAD
>>> panel
>>> /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/sample_population_panel_file.txt
>>> species       homo_sapiens
>>> tmpdir       /tmp60days/robertsa/vcf_file_import/vcf_load_tmpdir
>>> add_tables   compressed_genotype_region
>>>
>>> The notable part of this is that we are loading all standard tables plus
>>> compressed_genotype_region.  We would also like to load
>>> transcript_variation, but tests on small data sets showed that this slowed
>>> down the import a lot, and the vcf_import web page claimed it was faster to
>>> populate this table later "using the standard transcript_variation
>>> pipeline", see:
>>> http://www.ensembl.org/info/genome/variation/import_vcf.html#tables
>>> However I have not been able to find any documentation for this "standard
>>> pipeline", and I found an exchange on this mailing list where a user was
>>> told not to try to use it.  So:
>>>
>>> Question 1:  Is there still not a standard transcript_variation pipeline?
>>> If it exists, can somebody point me to it?
>>>
>>> This upload using the config above runs, but still quite slowly...a
>>> little over 200 variants per minute.  It looked like it would take at least
>>> a week and a half to finish, running 10 or 12 jobs in parallel.  The MySQL
>>> database seemed to be holding up OK, while each individual perl script
>>> consumed close to 100% of the CPU time for the processor it was running on.
>>>
>>> About halfway through the process everything halted.  It turned out that
>>> the auto-increment "allele_id" column in the "allele" table had run out of
>>> integers!  It hit the maximum integer for the 'signed int' datatype which
>>> EnsEMBL uses for this column.  I have been converting the table to use
>>> "bigint" instead of "int" for this column.  However I wondered:
>>>
>>> Question 2:  Has anybody else run into this problem?  In particular, have
>>> the EnsEMBL folks encountered tried to load phase 3 1000GENOMES data yet?
>>> It feels like I must be doing something wrong, but I've been using the
>>> standard script.
>>>
>>> In general I notice that the standard EnsEMBL variation database has far
>>> fewer allele entries per variation than my custom vcf_import loaded database
>>> does.  In other ways it seems that the sheer size of my custom database
>>> (over a terabyte and only halfway through the load) is larger than the
>>> standard database, which manages to include earlier 1000GENOMES data plus
>>> plenty of other stuff despite its smaller size.  And so:
>>>
>>> Question 3:  Am I totally going about this the wrong way?  The website I
>>> linked above says the EnsEMBL team uses vcf_import.pl to load 1000GENOMES
>>> data.  If that is true, can they tell me what table options they use?
>>> Perhaps they are skipping tables I am keeping, or doing something else that
>>> I should know about.  Any suggestions would be welcome.
>>>
>>> Hope this makes sense -- I am new to all this, so I might have forgotten
>>> to provide information you need.
>>>
>>> In any event, thanks for any suggestions!
>>> Drew
>>> Notice:  This e-mail message, together with any attachments, contains
>>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
>>> New Jersey, USA 08889), and/or its affiliates Direct contact information
>>> for affiliates is available at
>>> http://www.merck.com/contact/contacts.html) that may be confidential,
>>> proprietary copyrighted and/or legally privileged. It is intended solely
>>> for the use of the individual or entity named on this message. If you are
>>> not the intended recipient, and have received this message in error,
>>> please notify us immediately by reply e-mail and then delete it from
>>> your system.
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>>
>>> Notice:  This e-mail message, together with any attachments, contains
>>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
>>> New Jersey, USA 08889), and/or its affiliates Direct contact information
>>> for affiliates is available at
>>> http://www.merck.com/contact/contacts.html) that may be confidential,
>>> proprietary copyrighted and/or legally privileged. It is intended solely
>>> for the use of the individual or entity named on this message. If you are
>>> not the intended recipient, and have received this message in error,
>>> please notify us immediately by reply e-mail and then delete it from
>>> your system.
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>
>
>
> --
> G.
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>