[ensembl-dev] vcf_import of 1000GENOMES phase 3 data

Genomeo Dev genomeodev at gmail.com
Mon Nov 17 10:50:48 GMT 2014


Hi Will,

As you might be aware phase 3 has some known issues at the moment which
include a considerable fraction of variants missing compared to phase 1 v3
with no apparent reason.

For example, for EUR super population, there are about 1.6 million variants
missing.

>>bcftools isec -p dir -n-1 -w1 1000GENOMES-phase_1_EUR.vcf.gz
EUR.all.phase3_shapeit2_mvncall_integrated_v5.20130502.vcf.gz

>>gawk '{print $5}' dir/sites.txt  | sort -V | uniq -c

1,619,556 10
68,792,800 01

See:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/README_known_issues_20141030

See also a recent  thread on Ensembl developers email list.

I wonder whether Ensembl can do some merger between phase 1 and 3 while
waiting for a fix from 1000 genomes.

G.

On 17 November 2014 09:27, Will McLaren <wm2 at ebi.ac.uk> wrote:

> Hi Drew,
>
> We hope to have 1000 genomes phase 3 data available in release 79 of
> Ensembl, which is currently due for release early next year.
>
> Thanks for the valuable feedback.
>
> Regards
>
> Will
>
> On 15 November 2014 02:58, Roberts, Drew (RY) <andrew_roberts at merck.com>
> wrote:
>
>> Will-
>>
>>
>>
>> This is very useful information, thank you much.  If nothing else, it
>> makes me feel better about having problems loading 1000GENOMES!
>>
>>
>>
>> We will most likely follow your suggestions and try to implement the new
>> features you mention, but in immediate short term I think we’ll move our
>> focus to some simpler tasks.  When we’re ready to move forward I’ll post to
>> this list again and take you up on your offer for configuration help.
>>
>>
>>
>> For now, I will add 1 question and 1 comment:
>>
>>
>>
>> Question:  Do you have any idea when 1000GENOMES phase 3 data will be
>> included in the standard EnsEMBL release?
>>
>>
>>
>> Comment:  I would request that a small update be made to the
>> vcf_import.pl documentation online, at:
>>
>> http://www.ensembl.org/info/genome/variation/import_vcf.html#tables
>> The current text implies that loading transcript_variation after the
>> import_vcf is a simple and supported pipeline, which it sounds like it
>> isn’t quite yet.  If the wording was changed to warn people that it isn’t
>> so easy, they will know up front that they need to do an “add_tables
>> transcript_variation” during their initial import_vcf.pl run if that
>> data is required.
>>
>>
>>
>> Again, thanks for the enlightening response, and I expect that I will be
>> following up on your ideas at some point in the not-too-horribly-distant
>> future.
>>
>>
>>
>> Cheers!
>>
>> Drew
>>
>>
>>
>> *From:* dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] *On
>> Behalf Of *Will McLaren
>> *Sent:* Friday, November 14, 2014 12:52 AM
>> *To:* Ensembl developers list
>> *Cc:* Hughes, Jason
>> *Subject:* Re: [ensembl-dev] vcf_import of 1000GENOMES phase 3 data
>>
>>
>>
>> Hi Drew,
>>
>>
>>
>> The Ensembl Variation team created the variation DB for the 1000 genomes
>> browser site, so we do have relevant experience here.
>>
>>
>>
>> We too found that the volume of data stretches a lot of resources in ways
>> they don't want to be stretched. To this end we have a couple of solutions
>> in the API that mitigate these issues to some extent.
>>
>>
>>
>> 1) We don't load the allele or population_genotype table with 1000
>> genomes data; in fact we started doing this with phase 1. There exists code
>> in the API to generate allele and population_genotype objects on the fly
>> from just the data in the compressed_genotype_var table; in order to enable
>> this, you simply need to set the freqs_from_gts column to 1 in the
>> population table for the relevant populations.
>>
>>
>>
>> Note that this data is then unavailable via direct SQL query as the
>> genotype data in compressed_genotype_var is readable only by the API.
>>
>>
>>
>> 2) For the phase 3 data, we realised that populating
>> compressed_genotype_region and compressed_genotype_var would also become
>> impractical; tests indicated that each table would be around 1.5TB on disk!
>>
>>
>>
>> To this end we developed some more API trickery to load genotype data on
>> the fly from VCF files. The genotype objects from this are then also fed
>> into the features mentioned above, such that all genotype and frequency
>> data that you see at e.g.
>> http://browser.1000genomes.org/Homo_sapiens/Variation/Population?v=rs699
>> come from the same VCFs you are using to load your DB. The same code also
>> powers the LD views.
>>
>>
>>
>> The API code for this is currently available on release/76 of the
>> ensembl-variation Git repo. If you'd like some help getting this set up,
>> let me know and I can give you more details on what to do; the code
>> introduces a couple of extra dependencies and requires some (very simple)
>> configuration. It has not yet made it into the master branch as it's still
>> somewhat under development, but it is stable enough for use for a number of
>> applications.
>>
>>
>>
>> I realise you said you were on release/75, but it should be possible for
>> you to simply patch your DB to release/76 using the ensembl/misc-scripts/
>> schema_patcher.pl script. If not it may also be possible to push the
>> code onto the release/75 branch too.
>>
>>
>>
>> Regarding transcript variation, the pipeline we use to populate the table
>> is not currently documented for users outside the project. It may be
>> possible to prepare some documentation suitable for external use. There is
>> also a way to have import_vcf.pl use the same code and optimised caches
>> that the Variant Effect Predictor uses, though this is alpha code at best.
>>
>>
>>
>> I hope this helps, though I realise it doesn't necessarily address all of
>> the issues you've been having!
>>
>>
>>
>> Regards
>>
>>
>>
>> Will McLaren
>>
>> Ensembl Variation
>>
>>
>>
>>
>>
>> On 14 November 2014 01:47, Roberts, Drew (RY) <andrew_roberts at merck.com>
>> wrote:
>>
>> Hello all-
>>
>> I am new to EnsEMBL software.  I am trying to use the vcf_import.pl
>> script to load a set of VCF files produced by phase 3 of the 1000Genomes
>> project into a custom EnsEMBL variation database I created in-house.  I
>> confirmed that this process works for a smaller dataset, but am having
>> issues with 1000Genomes.  The sheer size of the dataset is proving
>> difficult.
>>
>> We are using an in-house EnsEMBL database and API tools with version
>> 75_37...a little behind the latest, I know, but we wanted to stay with the
>> human genome build 37 for now as most of our in-house data uses it.
>>
>> I ran a number of vcf_import.pl jobs in parallel -- one for each
>> chromosome -- with the following config file:
>>
>> registry      /tmp60days/robertsa/vcf_file_import/
>> vcf_import_ensembl_registry.pl
>> input_file
>> /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/reduced.ALL.chr15.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
>> source        1000Genomes
>> population    1000Genomes:phase_3:MERCK_PRELOAD
>> panel
>>  /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/sample_population_panel_file.txt
>> species       homo_sapiens
>> tmpdir       /tmp60days/robertsa/vcf_file_import/vcf_load_tmpdir
>> add_tables   compressed_genotype_region
>>
>> The notable part of this is that we are loading all standard tables plus
>> compressed_genotype_region.  We would also like to load
>> transcript_variation, but tests on small data sets showed that this slowed
>> down the import a lot, and the vcf_import web page claimed it was faster to
>> populate this table later "using the standard transcript_variation
>> pipeline", see:
>> http://www.ensembl.org/info/genome/variation/import_vcf.html#tables
>> However I have not been able to find any documentation for this "standard
>> pipeline", and I found an exchange on this mailing list where a user was
>> told not to try to use it.  So:
>>
>> Question 1:  Is there still not a standard transcript_variation
>> pipeline?  If it exists, can somebody point me to it?
>>
>> This upload using the config above runs, but still quite slowly...a
>> little over 200 variants per minute.  It looked like it would take at least
>> a week and a half to finish, running 10 or 12 jobs in parallel.  The MySQL
>> database seemed to be holding up OK, while each individual perl script
>> consumed close to 100% of the CPU time for the processor it was running on.
>>
>> About halfway through the process everything halted.  It turned out that
>> the auto-increment "allele_id" column in the "allele" table had run out of
>> integers!  It hit the maximum integer for the 'signed int' datatype which
>> EnsEMBL uses for this column.  I have been converting the table to use
>> "bigint" instead of "int" for this column.  However I wondered:
>>
>> Question 2:  Has anybody else run into this problem?  In particular, have
>> the EnsEMBL folks encountered tried to load phase 3 1000GENOMES data yet?
>> It feels like I must be doing something wrong, but I've been using the
>> standard script.
>>
>> In general I notice that the standard EnsEMBL variation database has far
>> fewer allele entries per variation than my custom vcf_import loaded
>> database does.  In other ways it seems that the sheer size of my custom
>> database (over a terabyte and only halfway through the load) is larger than
>> the standard database, which manages to include earlier 1000GENOMES data
>> plus plenty of other stuff despite its smaller size.  And so:
>>
>> Question 3:  Am I totally going about this the wrong way?  The website I
>> linked above says the EnsEMBL team uses vcf_import.pl to load
>> 1000GENOMES data.  If that is true, can they tell me what table options
>> they use?  Perhaps they are skipping tables I am keeping, or doing
>> something else that I should know about.  Any suggestions would be welcome.
>>
>> Hope this makes sense -- I am new to all this, so I might have forgotten
>> to provide information you need.
>>
>> In any event, thanks for any suggestions!
>> Drew
>> Notice:  This e-mail message, together with any attachments, contains
>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
>> New Jersey, USA 08889), and/or its affiliates Direct contact information
>> for affiliates is available at
>> http://www.merck.com/contact/contacts.html) that may be confidential,
>> proprietary copyrighted and/or legally privileged. It is intended solely
>> for the use of the individual or entity named on this message. If you are
>> not the intended recipient, and have received this message in error,
>> please notify us immediately by reply e-mail and then delete it from
>> your system.
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>>
>> Notice:  This e-mail message, together with any attachments, contains
>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
>> New Jersey, USA 08889), and/or its affiliates Direct contact information
>> for affiliates is available at
>> http://www.merck.com/contact/contacts.html) that may be confidential,
>> proprietary copyrighted and/or legally privileged. It is intended solely
>> for the use of the individual or entity named on this message. If you are
>> not the intended recipient, and have received this message in error,
>> please notify us immediately by reply e-mail and then delete it from
>> your system.
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>


-- 
G.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20141117/79411dd0/attachment.html>


More information about the Dev mailing list