[ensembl-dev] vcf_import of 1000GENOMES phase 3 data

Thu Nov 13 16:47:47 GMT 2014

Hello all-

I am new to EnsEMBL software.  I am trying to use the vcf_import.pl script to load a set of VCF files produced by phase 3 of the 1000Genomes project into a custom EnsEMBL variation database I created in-house.  I confirmed that this process works for a smaller dataset, but am having issues with 1000Genomes.  The sheer size of the dataset is proving difficult.

We are using an in-house EnsEMBL database and API tools with version 75_37...a little behind the latest, I know, but we wanted to stay with the human genome build 37 for now as most of our in-house data uses it.

I ran a number of vcf_import.pl jobs in parallel -- one for each chromosome -- with the following config file:

registry      /tmp60days/robertsa/vcf_file_import/vcf_import_ensembl_registry.pl
input_file    /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/reduced.ALL.chr15.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
source        1000Genomes
population    1000Genomes:phase_3:MERCK_PRELOAD
panel         /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/sample_population_panel_file.txt
species       homo_sapiens
tmpdir       /tmp60days/robertsa/vcf_file_import/vcf_load_tmpdir
add_tables   compressed_genotype_region

The notable part of this is that we are loading all standard tables plus compressed_genotype_region.  We would also like to load transcript_variation, but tests on small data sets showed that this slowed down the import a lot, and the vcf_import web page claimed it was faster to populate this table later "using the standard transcript_variation pipeline", see:
http://www.ensembl.org/info/genome/variation/import_vcf.html#tables
However I have not been able to find any documentation for this "standard pipeline", and I found an exchange on this mailing list where a user was told not to try to use it.  So:

Question 1:  Is there still not a standard transcript_variation pipeline?  If it exists, can somebody point me to it?

This upload using the config above runs, but still quite slowly...a little over 200 variants per minute.  It looked like it would take at least a week and a half to finish, running 10 or 12 jobs in parallel.  The MySQL database seemed to be holding up OK, while each individual perl script consumed close to 100% of the CPU time for the processor it was running on.

About halfway through the process everything halted.  It turned out that the auto-increment "allele_id" column in the "allele" table had run out of integers!  It hit the maximum integer for the 'signed int' datatype which EnsEMBL uses for this column.  I have been converting the table to use "bigint" instead of "int" for this column.  However I wondered:

Question 2:  Has anybody else run into this problem?  In particular, have the EnsEMBL folks encountered tried to load phase 3 1000GENOMES data yet?  It feels like I must be doing something wrong, but I've been using the standard script.

In general I notice that the standard EnsEMBL variation database has far fewer allele entries per variation than my custom vcf_import loaded database does.  In other ways it seems that the sheer size of my custom database (over a terabyte and only halfway through the load) is larger than the standard database, which manages to include earlier 1000GENOMES data plus plenty of other stuff despite its smaller size.  And so:

Question 3:  Am I totally going about this the wrong way?  The website I linked above says the EnsEMBL team uses vcf_import.pl to load 1000GENOMES data.  If that is true, can they tell me what table options they use?  Perhaps they are skipping tables I am keeping, or doing something else that I should know about.  Any suggestions would be welcome.

Hope this makes sense -- I am new to all this, so I might have forgotten to provide information you need.

In any event, thanks for any suggestions!
Drew
Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at 
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from 
your system.