[ensembl-dev] vcf_import of 1000GENOMES phase 3 data

Fri Nov 14 05:51:44 GMT 2014

Hi Drew,

The Ensembl Variation team created the variation DB for the 1000 genomes
browser site, so we do have relevant experience here.

We too found that the volume of data stretches a lot of resources in ways
they don't want to be stretched. To this end we have a couple of solutions
in the API that mitigate these issues to some extent.

1) We don't load the allele or population_genotype table with 1000 genomes
data; in fact we started doing this with phase 1. There exists code in the
API to generate allele and population_genotype objects on the fly from just
the data in the compressed_genotype_var table; in order to enable this, you
simply need to set the freqs_from_gts column to 1 in the population table
for the relevant populations.

Note that this data is then unavailable via direct SQL query as the
genotype data in compressed_genotype_var is readable only by the API.

2) For the phase 3 data, we realised that populating
compressed_genotype_region and compressed_genotype_var would also become
impractical; tests indicated that each table would be around 1.5TB on disk!

To this end we developed some more API trickery to load genotype data on
the fly from VCF files. The genotype objects from this are then also fed
into the features mentioned above, such that all genotype and frequency
data that you see at e.g.
http://browser.1000genomes.org/Homo_sapiens/Variation/Population?v=rs699
come from the same VCFs you are using to load your DB. The same code also
powers the LD views.

The API code for this is currently available on release/76 of the
ensembl-variation Git repo. If you'd like some help getting this set up,
let me know and I can give you more details on what to do; the code
introduces a couple of extra dependencies and requires some (very simple)
configuration. It has not yet made it into the master branch as it's still
somewhat under development, but it is stable enough for use for a number of
applications.

I realise you said you were on release/75, but it should be possible for
you to simply patch your DB to release/76 using the ensembl/misc-scripts/
schema_patcher.pl script. If not it may also be possible to push the code
onto the release/75 branch too.

Regarding transcript variation, the pipeline we use to populate the table
is not currently documented for users outside the project. It may be
possible to prepare some documentation suitable for external use. There is
also a way to have import_vcf.pl use the same code and optimised caches
that the Variant Effect Predictor uses, though this is alpha code at best.

I hope this helps, though I realise it doesn't necessarily address all of
the issues you've been having!

Regards

Will McLaren
Ensembl Variation

On 14 November 2014 01:47, Roberts, Drew (RY) <andrew_roberts at merck.com>
wrote:

> Hello all-
>
> I am new to EnsEMBL software.  I am trying to use the vcf_import.pl
> script to load a set of VCF files produced by phase 3 of the 1000Genomes
> project into a custom EnsEMBL variation database I created in-house.  I
> confirmed that this process works for a smaller dataset, but am having
> issues with 1000Genomes.  The sheer size of the dataset is proving
> difficult.
>
> We are using an in-house EnsEMBL database and API tools with version
> 75_37...a little behind the latest, I know, but we wanted to stay with the
> human genome build 37 for now as most of our in-house data uses it.
>
> I ran a number of vcf_import.pl jobs in parallel -- one for each
> chromosome -- with the following config file:
>
> registry      /tmp60days/robertsa/vcf_file_import/
> vcf_import_ensembl_registry.pl
> input_file
> /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/reduced.ALL.chr15.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
> source        1000Genomes
> population    1000Genomes:phase_3:MERCK_PRELOAD
> panel
>  /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/sample_population_panel_file.txt
> species       homo_sapiens
> tmpdir       /tmp60days/robertsa/vcf_file_import/vcf_load_tmpdir
> add_tables   compressed_genotype_region
>
> The notable part of this is that we are loading all standard tables plus
> compressed_genotype_region.  We would also like to load
> transcript_variation, but tests on small data sets showed that this slowed
> down the import a lot, and the vcf_import web page claimed it was faster to
> populate this table later "using the standard transcript_variation
> pipeline", see:
> http://www.ensembl.org/info/genome/variation/import_vcf.html#tables
> However I have not been able to find any documentation for this "standard
> pipeline", and I found an exchange on this mailing list where a user was
> told not to try to use it.  So:
>
> Question 1:  Is there still not a standard transcript_variation pipeline?
> If it exists, can somebody point me to it?
>
> This upload using the config above runs, but still quite slowly...a little
> over 200 variants per minute.  It looked like it would take at least a week
> and a half to finish, running 10 or 12 jobs in parallel.  The MySQL
> database seemed to be holding up OK, while each individual perl script
> consumed close to 100% of the CPU time for the processor it was running on.
>
> About halfway through the process everything halted.  It turned out that
> the auto-increment "allele_id" column in the "allele" table had run out of
> integers!  It hit the maximum integer for the 'signed int' datatype which
> EnsEMBL uses for this column.  I have been converting the table to use
> "bigint" instead of "int" for this column.  However I wondered:
>
> Question 2:  Has anybody else run into this problem?  In particular, have
> the EnsEMBL folks encountered tried to load phase 3 1000GENOMES data yet?
> It feels like I must be doing something wrong, but I've been using the
> standard script.
>
> In general I notice that the standard EnsEMBL variation database has far
> fewer allele entries per variation than my custom vcf_import loaded
> database does.  In other ways it seems that the sheer size of my custom
> database (over a terabyte and only halfway through the load) is larger than
> the standard database, which manages to include earlier 1000GENOMES data
> plus plenty of other stuff despite its smaller size.  And so:
>
> Question 3:  Am I totally going about this the wrong way?  The website I
> linked above says the EnsEMBL team uses vcf_import.pl to load 1000GENOMES
> data.  If that is true, can they tell me what table options they use?
> Perhaps they are skipping tables I am keeping, or doing something else that
> I should know about.  Any suggestions would be welcome.
>
> Hope this makes sense -- I am new to all this, so I might have forgotten
> to provide information you need.
>
> In any event, thanks for any suggestions!
> Drew
> Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> New Jersey, USA 08889), and/or its affiliates Direct contact information
> for affiliates is available at
> http://www.merck.com/contact/contacts.html) that may be confidential,
> proprietary copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error,
> please notify us immediately by reply e-mail and then delete it from
> your system.
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20141114/56d45413/attachment.html>