[ensembl-dev] vcf_import of 1000GENOMES phase 3 data

Fri Nov 14 17:58:39 GMT 2014

Will-

This is very useful information, thank you much.  If nothing else, it makes me feel better about having problems loading 1000GENOMES!

We will most likely follow your suggestions and try to implement the new features you mention, but in immediate short term I think we’ll move our focus to some simpler tasks.  When we’re ready to move forward I’ll post to this list again and take you up on your offer for configuration help.

For now, I will add 1 question and 1 comment:

Question:  Do you have any idea when 1000GENOMES phase 3 data will be included in the standard EnsEMBL release?

Comment:  I would request that a small update be made to the vcf_import.pl documentation online, at:
http://www.ensembl.org/info/genome/variation/import_vcf.html#tables
The current text implies that loading transcript_variation after the import_vcf is a simple and supported pipeline, which it sounds like it isn’t quite yet.  If the wording was changed to warn people that it isn’t so easy, they will know up front that they need to do an “add_tables transcript_variation” during their initial import_vcf.pl run if that data is required.

Again, thanks for the enlightening response, and I expect that I will be following up on your ideas at some point in the not-too-horribly-distant future.

Cheers!
Drew

From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf Of Will McLaren
Sent: Friday, November 14, 2014 12:52 AM
To: Ensembl developers list
Cc: Hughes, Jason
Subject: Re: [ensembl-dev] vcf_import of 1000GENOMES phase 3 data

Hi Drew,

The Ensembl Variation team created the variation DB for the 1000 genomes browser site, so we do have relevant experience here.

We too found that the volume of data stretches a lot of resources in ways they don't want to be stretched. To this end we have a couple of solutions in the API that mitigate these issues to some extent.

1) We don't load the allele or population_genotype table with 1000 genomes data; in fact we started doing this with phase 1. There exists code in the API to generate allele and population_genotype objects on the fly from just the data in the compressed_genotype_var table; in order to enable this, you simply need to set the freqs_from_gts column to 1 in the population table for the relevant populations.

Note that this data is then unavailable via direct SQL query as the genotype data in compressed_genotype_var is readable only by the API.

2) For the phase 3 data, we realised that populating compressed_genotype_region and compressed_genotype_var would also become impractical; tests indicated that each table would be around 1.5TB on disk!

To this end we developed some more API trickery to load genotype data on the fly from VCF files. The genotype objects from this are then also fed into the features mentioned above, such that all genotype and frequency data that you see at e.g. http://browser.1000genomes.org/Homo_sapiens/Variation/Population?v=rs699 come from the same VCFs you are using to load your DB. The same code also powers the LD views.

The API code for this is currently available on release/76 of the ensembl-variation Git repo. If you'd like some help getting this set up, let me know and I can give you more details on what to do; the code introduces a couple of extra dependencies and requires some (very simple) configuration. It has not yet made it into the master branch as it's still somewhat under development, but it is stable enough for use for a number of applications.

I realise you said you were on release/75, but it should be possible for you to simply patch your DB to release/76 using the ensembl/misc-scripts/schema_patcher.pl<http://schema_patcher.pl> script. If not it may also be possible to push the code onto the release/75 branch too.

Regarding transcript variation, the pipeline we use to populate the table is not currently documented for users outside the project. It may be possible to prepare some documentation suitable for external use. There is also a way to have import_vcf.pl<http://import_vcf.pl> use the same code and optimised caches that the Variant Effect Predictor uses, though this is alpha code at best.

I hope this helps, though I realise it doesn't necessarily address all of the issues you've been having!

Regards

Will McLaren
Ensembl Variation

On 14 November 2014 01:47, Roberts, Drew (RY) <andrew_roberts at merck.com<mailto:andrew_roberts at merck.com>> wrote:
Hello all-

I am new to EnsEMBL software.  I am trying to use the vcf_import.pl<http://vcf_import.pl> script to load a set of VCF files produced by phase 3 of the 1000Genomes project into a custom EnsEMBL variation database I created in-house.  I confirmed that this process works for a smaller dataset, but am having issues with 1000Genomes.  The sheer size of the dataset is proving difficult.

We are using an in-house EnsEMBL database and API tools with version 75_37...a little behind the latest, I know, but we wanted to stay with the human genome build 37 for now as most of our in-house data uses it.

I ran a number of vcf_import.pl<http://vcf_import.pl> jobs in parallel -- one for each chromosome -- with the following config file:

registry      /tmp60days/robertsa/vcf_file_import/vcf_import_ensembl_registry.pl<http://vcf_import_ensembl_registry.pl>
input_file    /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/reduced.ALL.chr15.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
source        1000Genomes
population    1000Genomes:phase_3:MERCK_PRELOAD
panel         /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/sample_population_panel_file.txt
species       homo_sapiens
tmpdir       /tmp60days/robertsa/vcf_file_import/vcf_load_tmpdir
add_tables   compressed_genotype_region

The notable part of this is that we are loading all standard tables plus compressed_genotype_region.  We would also like to load transcript_variation, but tests on small data sets showed that this slowed down the import a lot, and the vcf_import web page claimed it was faster to populate this table later "using the standard transcript_variation pipeline", see:
http://www.ensembl.org/info/genome/variation/import_vcf.html#tables
However I have not been able to find any documentation for this "standard pipeline", and I found an exchange on this mailing list where a user was told not to try to use it.  So:

Question 1:  Is there still not a standard transcript_variation pipeline?  If it exists, can somebody point me to it?

This upload using the config above runs, but still quite slowly...a little over 200 variants per minute.  It looked like it would take at least a week and a half to finish, running 10 or 12 jobs in parallel.  The MySQL database seemed to be holding up OK, while each individual perl script consumed close to 100% of the CPU time for the processor it was running on.

About halfway through the process everything halted.  It turned out that the auto-increment "allele_id" column in the "allele" table had run out of integers!  It hit the maximum integer for the 'signed int' datatype which EnsEMBL uses for this column.  I have been converting the table to use "bigint" instead of "int" for this column.  However I wondered:

Question 2:  Has anybody else run into this problem?  In particular, have the EnsEMBL folks encountered tried to load phase 3 1000GENOMES data yet?  It feels like I must be doing something wrong, but I've been using the standard script.

In general I notice that the standard EnsEMBL variation database has far fewer allele entries per variation than my custom vcf_import loaded database does.  In other ways it seems that the sheer size of my custom database (over a terabyte and only halfway through the load) is larger than the standard database, which manages to include earlier 1000GENOMES data plus plenty of other stuff despite its smaller size.  And so:

Question 3:  Am I totally going about this the wrong way?  The website I linked above says the EnsEMBL team uses vcf_import.pl<http://vcf_import.pl> to load 1000GENOMES data.  If that is true, can they tell me what table options they use?  Perhaps they are skipping tables I am keeping, or doing something else that I should know about.  Any suggestions would be welcome.

Hope this makes sense -- I am new to all this, so I might have forgotten to provide information you need.

In any event, thanks for any suggestions!
Drew
Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from
your system.

_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at 
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from 
your system.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20141114/3c9c8161/attachment.html>