[ensembl-dev] getting watson snps

Tue Dec 14 13:26:56 GMT 2010

Hello Andrea

there are several things than can help you (as a hint not as a solution :-():

A) First, when you are uploading big data sets to mysql, you should turn down the 
indexes or it would take ages. For a few tables you can do it manually 
commenting the index creation in the sql, and later use the 'create index' 
command. I don't know if in mysql you can update everything skipping the indexes 
with mysqladmin and then redo all the indexes automatically.

B) The approach that I follow when working with full genome snps is to download 
once the tables that I need directly with mysql and then I have ensembl-like 
adaptors to work with them:

mysql -B --port 5306 -u anonymous -h ensembldb.ensembl.org -e 'use \ 
homo_sapiens_variation_60_37e; select variation_id as var_id, \
variation_name as snp, sr.name as chr, \
seq_region_start as start, seq_region_end as end, seq_region_strand as \
strand, allele_string, map_weight, flags, validation_status, consequence_type \
from  variation_feature vf, seq_region sr where \
vf.seq_region_id=sr.seq_region_id' >  variations_build_37_ens-60.tab

note:
   This makes 2Gb file variations_build_37_ens-60.tab (you can make more specific
   queries joining source-variation_synonym-variation_features in order to reduce
   the data). Also the memory report tell me that mysql was taking 4.5 Gb during
   the download so you should make more specific queries in small computers. You
   can always download these three tables (plus seq_region for the chr names)
   from the ensemble db dumps and load to your mysql locally (It is not necessary
   to download the full variation database).

You can upload the query-downloaded data to a local mysql with:

============ loading script: load_ens_vf_b37.sql

-- load with:
--    mysql --host=variation_local  --port=3636 --user=pmg -p < load_ens_vf_b37.sql
--

use pmGWAS;

drop table if exists ens_vf_b37;
create table ens_vf_b37 (
    var_id              int(20),
    snp                 varchar(20),
    chr                 varchar(20),
    start               int(10) UNSIGNED NOT NULL,
    end                 int(10) UNSIGNED NOT NULL,
    strand              enum('-1','1'),
    allele_string       varchar(10),
    map_weight          tinyint(2),
    flags               varchar(100),
    validation_status   varchar(100),
    consequence_type    varchar(100)
);

load data local infile 
'/homes/pmg/ens_dump/variations_build_37_ens-60.tab'
into table ens_vf_b37
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(
var_id
,snp
,chr
,start
,end
,strand
,allele_string
,map_weight
,flags
,validation_status
,consequence_type
)

-- after insertion do the index to prevent slowing down the insertions
-- alter table ens_vf_b37 add index snp(snp);
-- alter table ens_vf_b37 add index pos(chr, start);

=============

If I need specific data sets (Watson, Venter, dbsnp, ) I make a second table with 
the output of a join with 'source' table and 'variation_synonyms' to know 
which snps belong to a specific source. That way I can make any genomic query 
faster locally joining by variation_id  my variation_feature with my 
source-synonym table to 
extract the SNPS wanted

For your case in build_36 was strait forward because one of the sources was 
"ENSEMBL:Watson", but in the new build 37 source they have added a lot more 
sources and there is not a explicit watson anymore, so I don't know where to 
define it (now you have only ENSEMBL but in 36 was ENSEMBL:celera/watson/venter):

$  mysql_ensembl_variation_b37 'select name  from source'
+----------------------------+
| name                       |
+----------------------------+
| dbSNP                      |
| Affy GeneChip 100K Array   |
| Affy GeneChip 500K Array   |
| Affy GenomeWideSNP_6.0     |
| NHGRI_GWAS_catalog         |
| EGA                        |
| Illumina_CytoSNP12v1       |
| Illumina_Human660W-quad    |
| Illumina_Human1M-duoV3     |
| Uniprot                    |
| COSMIC                     |
| ENSEMBL                    |
| DGVa:estd1                 |
| DGVa:estd3                 |
| Open Access GWAS Database  |
| HGMD-PUBLIC                |
| DGVa:nstd1                 |
| DGVa:nstd2                 |
| DGVa:nstd4                 |
| DGVa:nstd9                 |
| DGVa:nstd14                |
| DGVa:nstd16                |
| DGVa:nstd17                |
| DGVa:nstd20                |
| DGVa:nstd11                |
| DGVa:nstd22                |
| DGVa:nstd23                |
| DGVa:nstd27                |
| DGVa:nstd28                |
| DGVa:nstd29                |
| DGVa:nstd30                |
| DGVa:nstd31                |
| DGVa:nstd32                |
| DGVa:nstd34                |
| DGVa:nstd35                |
| Affy GenomeWideSNP_6.0 CNV |
| DGVa:nstd8                 |
| DGVa:estd19                |
| DGVa:estd20                |
| DGVa:estd21                |
| DGVa:estd22                |
| DGVa:estd24                |
| DGVa:nstd36                |
| DGVa:nstd39                |
| DGVa:estd48                |
| DGVa:estd49                |
| DGVa:estd50                |
+----------------------------+

  On Sun, 12 Dec 2010, Andrea Edwards wrote:

> Hi
>
> The dump files would be great but I am also retreiving lots of other 
> information about the snps with the snps and that might not necessarily be in 
> your dump file so i think i have to try other options too.
>
> This is what i have tried so far to get the watson snps and not getting 
> anywhere fast :)
>
> 1. Written perl script to download them from ensembl human variation 
> database. This works but will take over a month to get all the snps at the 
> rate at which it seems to be running and i imagine you'll block my  ip 
> address if i leave it running :) Plus I can't leave it a month anyway.
>
> 2. I've tried to install the human variation database locally but that also 
> seems to be having problems. It has been installing the allele table now for 
> 3 days i think. It is running on a very slow machine but there are far bigger 
> tables than the allele table so i dread to think how long they will take. I 
> tried to get access to a better machine but i wasn't give enough hard disk 
> space but perhaps that will solve the problem! How long should it take to 
> install the human variation database (roughly) on a 64 bit linux machine with 
> 2 gig of ram and intel xeon @ 2.27GHz? Will it take hours or days?
>
> Is there anything else i can try. I do appreciate that the dataset is vast 
> and these things will be slow? Perhaps the answer is simply a faster machine 
> to install the local database and I am looking into this.
>
> I have already looked at getting the snps from dbsnp or directly from source 
> but i need to get information associated with the snps so will have the same 
> problems i think of retreiving the associated data even if i got the 'raw 
> snps' by other means
>
> Many thanks
>
> On 09/12/2010 16:53, Fiona Cunningham wrote:
>>   Dear Andrea,
>> 
>> We will look into producing the dump file of all SNPs in Watson for
>> the next release which should make your life easier. Biomart is really
>> best suited to specific queries and so we should provide dump files
>> where large amounts of information across the entire genome is
>> required.
>> 
>> Fiona
>> 
>> ------------------------------------------------------
>> Fiona Cunningham
>> Ensembl Variation Project Leader, EBI
>> www.ensembl.org
>> www.lrg-sequence.org
>> t: 01223 494612 || e: fiona at ebi.ac.uk
>> 
>> 
>> 
>> On 9 December 2010 13:46, Andrea Edwards<edwardsa at cs.man.ac.uk>  wrote:
>>> Dear all
>>> 
>>> I've tried downloading watson snps from biomart by a) the whole set and b)
>>> chromosome by chromosome and i can't get the data. I have tried requesting
>>> the data by email (no email received) and direct download (download starts
>>> but at a rate of 1kb per second and times out after about 12 hours/10 mb
>>> downloaded).
>>> 
>>> I have written a script to get the watson snps via the perl api but that 
>>> is
>>> running and taking hours so I am scared I will get my ip blocked! There 
>>> are
>>> 3 million snps and it took an hour to get 3000 i think
>>> 
>>> I was thinking of getting the human databases directly but i am awaiting a
>>> new machine and totally out of disk space. Does anyone you know how big 
>>> the
>>> human core and variation databases are when installed?
>>> 
>>> thanks a lot
>>> 
>>> _______________________________________________
>>> Dev mailing list
>>> Dev at ensembl.org
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> 
>
>
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev
>

-----

   Pablo Marin-Garcia