[ensembl-dev] vcf_import of 1000GENOMES phase 3 data

Thu Nov 20 10:43:54 GMT 2014

Hi Laura/Will,

Thanks for the response. While waiting for that to be fixed, I have two
non-overlapping sets of variants one from phase 1 EUR and one from phase 3
EUR. I want to combine them into one VCF so I can do LD calculations using
Plink. I can't find a way to combine them because the samples only overlap
for 364 and there are 15 samples unique to phase 1 and 139 unique to phase
3. Any suggestion how I could do that? BCFtools is no good because it
renames overlapping samples. Is there a tool to transpose VCF, cat all
variants then convert back to VCF?

Thanks.

G.

On 17 November 2014 11:39, Laura Clarke <laura at ebi.ac.uk> wrote:

> Hello G
>
> The 1000 Genomes Project is currently undertaking a QC of P3.
>
> We aim to provide a list to the community categorizing the different
> reasons variants aren't part of the P3 sites list.  This will allow
> everyone to see if we no longer called a site, filtered it out or
> missed it from our final build. We hope to release the list as soon as
> possible.
>
> We will ensure that Ensembl gets that list and it is available for
> display as soon as is feasible.
>
> thanks
>
> Laura
>
> On 17 November 2014 10:50, Genomeo Dev <genomeodev at gmail.com> wrote:
> > Hi Will,
> >
> > As you might be aware phase 3 has some known issues at the moment which
> > include a considerable fraction of variants missing compared to phase 1
> v3
> > with no apparent reason.
> >
> > For example, for EUR super population, there are about 1.6 million
> variants
> > missing.
> >
> >>>bcftools isec -p dir -n-1 -w1 1000GENOMES-phase_1_EUR.vcf.gz
> >>> EUR.all.phase3_shapeit2_mvncall_integrated_v5.20130502.vcf.gz
> >
> >>>gawk '{print $5}' dir/sites.txt  | sort -V | uniq -c
> >
> > 1,619,556 10
> > 68,792,800 01
> >
> > See:
> >
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/README_known_issues_20141030
> >
> > See also a recent  thread on Ensembl developers email list.
> >
> > I wonder whether Ensembl can do some merger between phase 1 and 3 while
> > waiting for a fix from 1000 genomes.
> >
> > G.
> >
> > On 17 November 2014 09:27, Will McLaren <wm2 at ebi.ac.uk> wrote:
> >>
> >> Hi Drew,
> >>
> >> We hope to have 1000 genomes phase 3 data available in release 79 of
> >> Ensembl, which is currently due for release early next year.
> >>
> >> Thanks for the valuable feedback.
> >>
> >> Regards
> >>
> >> Will
> >>
> >> On 15 November 2014 02:58, Roberts, Drew (RY) <andrew_roberts at merck.com
> >
> >> wrote:
> >>>
> >>> Will-
> >>>
> >>>
> >>>
> >>> This is very useful information, thank you much.  If nothing else, it
> >>> makes me feel better about having problems loading 1000GENOMES!
> >>>
> >>>
> >>>
> >>> We will most likely follow your suggestions and try to implement the
> new
> >>> features you mention, but in immediate short term I think we'll move
> our
> >>> focus to some simpler tasks.  When we're ready to move forward I'll
> post to
> >>> this list again and take you up on your offer for configuration help.
> >>>
> >>>
> >>>
> >>> For now, I will add 1 question and 1 comment:
> >>>
> >>>
> >>>
> >>> Question:  Do you have any idea when 1000GENOMES phase 3 data will be
> >>> included in the standard EnsEMBL release?
> >>>
> >>>
> >>>
> >>> Comment:  I would request that a small update be made to the
> >>> vcf_import.pl documentation online, at:
> >>>
> >>> http://www.ensembl.org/info/genome/variation/import_vcf.html#tables
> >>> The current text implies that loading transcript_variation after the
> >>> import_vcf is a simple and supported pipeline, which it sounds like it
> isn't
> >>> quite yet.  If the wording was changed to warn people that it isn't so
> easy,
> >>> they will know up front that they need to do an "add_tables
> >>> transcript_variation" during their initial import_vcf.pl run if that
> data is
> >>> required.
> >>>
> >>>
> >>>
> >>> Again, thanks for the enlightening response, and I expect that I will
> be
> >>> following up on your ideas at some point in the
> not-too-horribly-distant
> >>> future.
> >>>
> >>>
> >>>
> >>> Cheers!
> >>>
> >>> Drew
> >>>
> >>>
> >>>
> >>> From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On
> Behalf
> >>> Of Will McLaren
> >>> Sent: Friday, November 14, 2014 12:52 AM
> >>> To: Ensembl developers list
> >>> Cc: Hughes, Jason
> >>> Subject: Re: [ensembl-dev] vcf_import of 1000GENOMES phase 3 data
> >>>
> >>>
> >>>
> >>> Hi Drew,
> >>>
> >>>
> >>>
> >>> The Ensembl Variation team created the variation DB for the 1000
> genomes
> >>> browser site, so we do have relevant experience here.
> >>>
> >>>
> >>>
> >>> We too found that the volume of data stretches a lot of resources in
> ways
> >>> they don't want to be stretched. To this end we have a couple of
> solutions
> >>> in the API that mitigate these issues to some extent.
> >>>
> >>>
> >>>
> >>> 1) We don't load the allele or population_genotype table with 1000
> >>> genomes data; in fact we started doing this with phase 1. There exists
> code
> >>> in the API to generate allele and population_genotype objects on the
> fly
> >>> from just the data in the compressed_genotype_var table; in order to
> enable
> >>> this, you simply need to set the freqs_from_gts column to 1 in the
> >>> population table for the relevant populations.
> >>>
> >>>
> >>>
> >>> Note that this data is then unavailable via direct SQL query as the
> >>> genotype data in compressed_genotype_var is readable only by the API.
> >>>
> >>>
> >>>
> >>> 2) For the phase 3 data, we realised that populating
> >>> compressed_genotype_region and compressed_genotype_var would also
> become
> >>> impractical; tests indicated that each table would be around 1.5TB on
> disk!
> >>>
> >>>
> >>>
> >>> To this end we developed some more API trickery to load genotype data
> on
> >>> the fly from VCF files. The genotype objects from this are then also
> fed
> >>> into the features mentioned above, such that all genotype and
> frequency data
> >>> that you see at e.g.
> >>>
> http://browser.1000genomes.org/Homo_sapiens/Variation/Population?v=rs699
> >>> come from the same VCFs you are using to load your DB. The same code
> also
> >>> powers the LD views.
> >>>
> >>>
> >>>
> >>> The API code for this is currently available on release/76 of the
> >>> ensembl-variation Git repo. If you'd like some help getting this set
> up, let
> >>> me know and I can give you more details on what to do; the code
> introduces a
> >>> couple of extra dependencies and requires some (very simple)
> configuration.
> >>> It has not yet made it into the master branch as it's still somewhat
> under
> >>> development, but it is stable enough for use for a number of
> applications.
> >>>
> >>>
> >>>
> >>> I realise you said you were on release/75, but it should be possible
> for
> >>> you to simply patch your DB to release/76 using the
> >>> ensembl/misc-scripts/schema_patcher.pl script. If not it may also be
> >>> possible to push the code onto the release/75 branch too.
> >>>
> >>>
> >>>
> >>> Regarding transcript variation, the pipeline we use to populate the
> table
> >>> is not currently documented for users outside the project. It may be
> >>> possible to prepare some documentation suitable for external use.
> There is
> >>> also a way to have import_vcf.pl use the same code and optimised
> caches that
> >>> the Variant Effect Predictor uses, though this is alpha code at best.
> >>>
> >>>
> >>>
> >>> I hope this helps, though I realise it doesn't necessarily address all
> of
> >>> the issues you've been having!
> >>>
> >>>
> >>>
> >>> Regards
> >>>
> >>>
> >>>
> >>> Will McLaren
> >>>
> >>> Ensembl Variation
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On 14 November 2014 01:47, Roberts, Drew (RY) <
> andrew_roberts at merck.com>
> >>> wrote:
> >>>
> >>> Hello all-
> >>>
> >>> I am new to EnsEMBL software.  I am trying to use the vcf_import.pl
> >>> script to load a set of VCF files produced by phase 3 of the
> 1000Genomes
> >>> project into a custom EnsEMBL variation database I created in-house.  I
> >>> confirmed that this process works for a smaller dataset, but am having
> >>> issues with 1000Genomes.  The sheer size of the dataset is proving
> >>> difficult.
> >>>
> >>> We are using an in-house EnsEMBL database and API tools with version
> >>> 75_37...a little behind the latest, I know, but we wanted to stay with
> the
> >>> human genome build 37 for now as most of our in-house data uses it.
> >>>
> >>> I ran a number of vcf_import.pl jobs in parallel -- one for each
> >>> chromosome -- with the following config file:
> >>>
> >>> registry
> >>> /tmp60days/robertsa/vcf_file_import/vcf_import_ensembl_registry.pl
> >>> input_file
> >>>
> /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/reduced.ALL.chr15.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
> >>> source        1000Genomes
> >>> population    1000Genomes:phase_3:MERCK_PRELOAD
> >>> panel
> >>>
> /tmp60days/robertsa/vcf_file_import/1000Genomes_phase3_v5_vcf_files/sample_population_panel_file.txt
> >>> species       homo_sapiens
> >>> tmpdir       /tmp60days/robertsa/vcf_file_import/vcf_load_tmpdir
> >>> add_tables   compressed_genotype_region
> >>>
> >>> The notable part of this is that we are loading all standard tables
> plus
> >>> compressed_genotype_region.  We would also like to load
> >>> transcript_variation, but tests on small data sets showed that this
> slowed
> >>> down the import a lot, and the vcf_import web page claimed it was
> faster to
> >>> populate this table later "using the standard transcript_variation
> >>> pipeline", see:
> >>> http://www.ensembl.org/info/genome/variation/import_vcf.html#tables
> >>> However I have not been able to find any documentation for this
> "standard
> >>> pipeline", and I found an exchange on this mailing list where a user
> was
> >>> told not to try to use it.  So:
> >>>
> >>> Question 1:  Is there still not a standard transcript_variation
> pipeline?
> >>> If it exists, can somebody point me to it?
> >>>
> >>> This upload using the config above runs, but still quite slowly...a
> >>> little over 200 variants per minute.  It looked like it would take at
> least
> >>> a week and a half to finish, running 10 or 12 jobs in parallel.  The
> MySQL
> >>> database seemed to be holding up OK, while each individual perl script
> >>> consumed close to 100% of the CPU time for the processor it was
> running on.
> >>>
> >>> About halfway through the process everything halted.  It turned out
> that
> >>> the auto-increment "allele_id" column in the "allele" table had run
> out of
> >>> integers!  It hit the maximum integer for the 'signed int' datatype
> which
> >>> EnsEMBL uses for this column.  I have been converting the table to use
> >>> "bigint" instead of "int" for this column.  However I wondered:
> >>>
> >>> Question 2:  Has anybody else run into this problem?  In particular,
> have
> >>> the EnsEMBL folks encountered tried to load phase 3 1000GENOMES data
> yet?
> >>> It feels like I must be doing something wrong, but I've been using the
> >>> standard script.
> >>>
> >>> In general I notice that the standard EnsEMBL variation database has
> far
> >>> fewer allele entries per variation than my custom vcf_import loaded
> database
> >>> does.  In other ways it seems that the sheer size of my custom database
> >>> (over a terabyte and only halfway through the load) is larger than the
> >>> standard database, which manages to include earlier 1000GENOMES data
> plus
> >>> plenty of other stuff despite its smaller size.  And so:
> >>>
> >>> Question 3:  Am I totally going about this the wrong way?  The website
> I
> >>> linked above says the EnsEMBL team uses vcf_import.pl to load
> 1000GENOMES
> >>> data.  If that is true, can they tell me what table options they use?
> >>> Perhaps they are skipping tables I am keeping, or doing something else
> that
> >>> I should know about.  Any suggestions would be welcome.
> >>>
> >>> Hope this makes sense -- I am new to all this, so I might have
> forgotten
> >>> to provide information you need.
> >>>
> >>> In any event, thanks for any suggestions!
> >>> Drew
> >>> Notice:  This e-mail message, together with any attachments, contains
> >>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> >>> New Jersey, USA 08889), and/or its affiliates Direct contact
> information
> >>> for affiliates is available at
> >>> http://www.merck.com/contact/contacts.html) that may be confidential,
> >>> proprietary copyrighted and/or legally privileged. It is intended
> solely
> >>> for the use of the individual or entity named on this message. If you
> are
> >>> not the intended recipient, and have received this message in error,
> >>> please notify us immediately by reply e-mail and then delete it from
> >>> your system.
> >>>
> >>>
> >>> _______________________________________________
> >>> Dev mailing list    Dev at ensembl.org
> >>> Posting guidelines and subscribe/unsubscribe info:
> >>> http://lists.ensembl.org/mailman/listinfo/dev
> >>> Ensembl Blog: http://www.ensembl.info/
> >>>
> >>>
> >>>
> >>> Notice:  This e-mail message, together with any attachments, contains
> >>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> >>> New Jersey, USA 08889), and/or its affiliates Direct contact
> information
> >>> for affiliates is available at
> >>> http://www.merck.com/contact/contacts.html) that may be confidential,
> >>> proprietary copyrighted and/or legally privileged. It is intended
> solely
> >>> for the use of the individual or entity named on this message. If you
> are
> >>> not the intended recipient, and have received this message in error,
> >>> please notify us immediately by reply e-mail and then delete it from
> >>> your system.
> >>>
> >>>
> >>> _______________________________________________
> >>> Dev mailing list    Dev at ensembl.org
> >>> Posting guidelines and subscribe/unsubscribe info:
> >>> http://lists.ensembl.org/mailman/listinfo/dev
> >>> Ensembl Blog: http://www.ensembl.info/
> >>>
> >>
> >>
> >> _______________________________________________
> >> Dev mailing list    Dev at ensembl.org
> >> Posting guidelines and subscribe/unsubscribe info:
> >> http://lists.ensembl.org/mailman/listinfo/dev
> >> Ensembl Blog: http://www.ensembl.info/
> >>
> >
> >
> >
> > --
> > G.
> >
> > _______________________________________________
> > Dev mailing list    Dev at ensembl.org
> > Posting guidelines and subscribe/unsubscribe info:
> > http://lists.ensembl.org/mailman/listinfo/dev
> > Ensembl Blog: http://www.ensembl.info/
> >
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>

-- 
G.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20141120/dbeb8e47/attachment.html>