[ensembl-dev] vcf_import of 1000GENOMES phase 3 data

Mon Nov 17 09:27:40 GMT 2014

Hi Drew,

We hope to have 1000 genomes phase 3 data available in release 79 of
Ensembl, which is currently due for release early next year.

Thanks for the valuable feedback.

Regards

Will

On 15 November 2014 02:58, Roberts, Drew (RY) <andrew_roberts at merck.com>
wrote:

> Will-
>
>
>
> This is very useful information, thank you much.  If nothing else, it
> makes me feel better about having problems loading 1000GENOMES!
>
>
>
> We will most likely follow your suggestions and try to implement the new
> features you mention, but in immediate short term I think we’ll move our
> focus to some simpler tasks.  When we’re ready to move forward I’ll post to
> this list again and take you up on your offer for configuration help.
>
>
>
> For now, I will add 1 question and 1 comment:
>
>
>
> Question:  Do you have any idea when 1000GENOMES phase 3 data will be
> included in the standard EnsEMBL release?
>
>
>
> Comment:  I would request that a small update be made to the vcf_import.pl
> documentation online, at:
>
> http://www.ensembl.org/info/genome/variation/import_vcf.html#tables
> The current text implies that loading transcript_variation after the
> import_vcf is a simple and supported pipeline, which it sounds like it
> isn’t quite yet.  If the wording was changed to warn people that it isn’t
> so easy, they will know up front that they need to do an “add_tables
> transcript_variation” during their initial import_vcf.pl run if that data
> is required.
>
>
>
> Again, thanks for the enlightening response, and I expect that I will be
> following up on your ideas at some point in the not-too-horribly-distant
> future.
>
>
>
> Cheers!
>
> Drew
>
>
>
> *From:* dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] *On
> Behalf Of *Will McLaren
> *Sent:* Friday, November 14, 2014 12:52 AM
> *To:* Ensembl developers list
> *Cc:* Hughes, Jason
> *Subject:* Re: [ensembl-dev] vcf_import of 1000GENOMES phase 3 data
>
>
>
> Hi Drew,
>
>
>
> The Ensembl Variation team created the variation DB for the 1000 genomes
> browser site, so we do have relevant experience here.
>
>
>
> We too found that the volume of data stretches a lot of resources in ways
> they don't want to be stretched. To this end we have a couple of solutions
> in the API that mitigate these issues to some extent.
>
>
>
> 1) We don't load the allele or population_genotype table with 1000 genomes
> data; in fact we started doing this with phase 1. There exists code in the
> API to generate allele and population_genotype objects on the fly from just
> the data in the compressed_genotype_var table; in order to enable this, you
> simply need to set the freqs_from_gts column to 1 in the population table
> for the relevant populations.
>
>
>
> Note that this data is then unavailable via direct SQL query as the
> genotype data in compressed_genotype_var is readable only by the API.
>
>
>
> 2) For the phase 3 data, we realised that populating
> compressed_genotype_region and compressed_genotype_var would also become
> impractical; tests indicated that each table would be around 1.5TB on disk!
>
>
>
> To this end we developed some more API trickery to load genotype data on
> the fly from VCF files. The genotype objects from this are then also fed
> into the features mentioned above, such that all genotype and frequency
> data that you see at e.g.
> http://browser.1000genomes.org/Homo_sapiens/Variation/Population?v=rs699
> come from the same VCFs you are using to load your DB. The same code also
> powers the LD views.
>
>
>
> The API code for this is currently available on release/76 of the
> ensembl-variation Git repo. If you'd like some help getting this set up,
> let me know and I can give you more details on what to do; the code
> introduces a couple of extra dependencies and requires some (very simple)
> configuration. It has not yet made it into the master branch as it's still
> somewhat under development, but it is stable enough for use for a number of
> applications.
>
>
>
> I realise you said you were on release/75, but it should be possible for
> you to simply patch your DB to release/76 using the ensembl/misc-scripts/
> schema_patcher.pl script. If not it may also be possible to push the code
> onto the release/75 branch too.
>
>
>
> Regarding transcript variation, the pipeline we use to populate the table
> is not currently documented for users outside the project. It may be
> possible to prepare some documentation suitable for external use. There is
> also a way to have import_vcf.pl use the same code and optimised caches
> that the Variant Effect Predictor uses, though this is alpha code at best.
>
>
>
> I hope this helps, though I realise it doesn't necessarily address all of
> the issues you've been having!
>
>
>
> Regards
>
>
>
> Will McLaren
>
> Ensembl Variation
>
>
>
>
>
> On 14 November 2014 01:47, Roberts, Drew (RY) <andrew_roberts at merck.com>
> wrote:
>
> Hello all-
>
> I am new to EnsEMBL software.  I am trying to use the vcf_import.pl
> script to load a set of VCF files produced by phase 3 of the 1000Genomes
> project into a custom EnsEMBL variation database I created in-house.  I
> confirmed that this process works for a smaller dataset, but am having
> issues with 1000Genomes.  The sheer size of the dataset is proving
> difficult.
>
> We are using an in-house EnsEMBL database and API tools with version
> 75_37...a little behind the latest, I know, but we wanted to stay with the
> human genome build 37 for now as most of our in-house data uses it.
>
> I ran a number of vcf_import.pl jobs in parallel -- one for each
> chromosome -- with the following config file:
>
> registry      /tmp60days/robertsa/vcf_file_import/
> vcf_import_ensembl_registry.pl
> input_file
> /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/reduced.ALL.chr15.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
> source        1000Genomes
> population    1000Genomes:phase_3:MERCK_PRELOAD
> panel
>  /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/sample_population_panel_file.txt
> species       homo_sapiens
> tmpdir       /tmp60days/robertsa/vcf_file_import/vcf_load_tmpdir
> add_tables   compressed_genotype_region
>
> The notable part of this is that we are loading all standard tables plus
> compressed_genotype_region.  We would also like to load
> transcript_variation, but tests on small data sets showed that this slowed
> down the import a lot, and the vcf_import web page claimed it was faster to
> populate this table later "using the standard transcript_variation
> pipeline", see:
> http://www.ensembl.org/info/genome/variation/import_vcf.html#tables
> However I have not been able to find any documentation for this "standard
> pipeline", and I found an exchange on this mailing list where a user was
> told not to try to use it.  So:
>
> Question 1:  Is there still not a standard transcript_variation pipeline?
> If it exists, can somebody point me to it?
>
> This upload using the config above runs, but still quite slowly...a little
> over 200 variants per minute.  It looked like it would take at least a week
> and a half to finish, running 10 or 12 jobs in parallel.  The MySQL
> database seemed to be holding up OK, while each individual perl script
> consumed close to 100% of the CPU time for the processor it was running on.
>
> About halfway through the process everything halted.  It turned out that
> the auto-increment "allele_id" column in the "allele" table had run out of
> integers!  It hit the maximum integer for the 'signed int' datatype which
> EnsEMBL uses for this column.  I have been converting the table to use
> "bigint" instead of "int" for this column.  However I wondered:
>
> Question 2:  Has anybody else run into this problem?  In particular, have
> the EnsEMBL folks encountered tried to load phase 3 1000GENOMES data yet?
> It feels like I must be doing something wrong, but I've been using the
> standard script.
>
> In general I notice that the standard EnsEMBL variation database has far
> fewer allele entries per variation than my custom vcf_import loaded
> database does.  In other ways it seems that the sheer size of my custom
> database (over a terabyte and only halfway through the load) is larger than
> the standard database, which manages to include earlier 1000GENOMES data
> plus plenty of other stuff despite its smaller size.  And so:
>
> Question 3:  Am I totally going about this the wrong way?  The website I
> linked above says the EnsEMBL team uses vcf_import.pl to load 1000GENOMES
> data.  If that is true, can they tell me what table options they use?
> Perhaps they are skipping tables I am keeping, or doing something else that
> I should know about.  Any suggestions would be welcome.
>
> Hope this makes sense -- I am new to all this, so I might have forgotten
> to provide information you need.
>
> In any event, thanks for any suggestions!
> Drew
> Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> New Jersey, USA 08889), and/or its affiliates Direct contact information
> for affiliates is available at
> http://www.merck.com/contact/contacts.html) that may be confidential,
> proprietary copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error,
> please notify us immediately by reply e-mail and then delete it from
> your system.
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
>
> Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> New Jersey, USA 08889), and/or its affiliates Direct contact information
> for affiliates is available at
> http://www.merck.com/contact/contacts.html) that may be confidential,
> proprietary copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error,
> please notify us immediately by reply e-mail and then delete it from
> your system.
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20141117/d372990c/attachment.html>