[ensembl-dev] getting watson snps

Fri Dec 17 14:29:59 GMT 2010

Pablo

Thanks for your very helpful reply.

How do you know all of this information about the database schema? My 
plan was to install the database and work out the schema by 'playing' 
with the data but as yet I can't install the database

How do you know what tables contain the data you need and the 
relationships between them? Did you just get it from these docs
http://www.ensembl.org/info/docs/api/core/core_schema.html

I'm on a much faster machine now and the core database is still taking 
over 3 days and is at the ditag table. I think I'll stop the process and 
comment out the indexes and start again!

I'm happy not to download the whole database but i don't know which 
tables i need. I have posted the script i was going to run against the 
remote ensembl database (but it would take 2 weeks to run remotely!!) in 
case you can offer some advice of which tables to get. It's a very 
simple script even if it looks longish. It is basically creating one big 
table with a row for each watson snp consequence. Each consequence is a 
unique gene/exon/transcript combination that the snp affects (e.g. if 
the snp in a gene, called gene1, and is in exon1 of that gene and that 
exon is in 2 splice variants transcript1 and transcript2 then there are 
2 rows in the table: gene1/exon1/transcript1 and 
gene1/exon1/transcript2) I used this denormalised database design just 
as a shortcut while i was trying to actually get the data. I'm going to 
model it differently if/when i get it :)

I really appreciate all the help you have already given me. I just 
assumed I could download the whole database and then query it with the 
api. Perhaps i need to get more familiar with the schema so i can just 
get the subsets of data i need

#!/usr/bin/perl -w

#use lib 'C:\Perl\site\lib\ensembl-api\ensembl\modules';
#use lib 'C:\Perl\site\lib\ensembl-api\ensembl-variation\modules';

use lib '/home/andrea/ensembl-src/ensembl/modules';
use lib '/home/andrea/ensembl-src/ensembl-variation/modules';

use strict;
use warnings;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::Utils::Slice qw(split_Slices);
use DBI;
use DBD::mysql;

my $species = 'homo_sapiens';
#my $reg = 'Bio::EnsEMBL::Registry';
#my $registry_file= "C:\\Documents and 
Settings\\Administrator\\Desktop\\PHD\\java\\Code\\SNP\\config.txt";
#$reg->load_all($registry_file);
my $reg = 'Bio::EnsEMBL::Registry';
$reg->load_registry_from_db(-host => 'ensembldb.ensembl.org',-user => 
'anonymous');

my $variation_set_adaptor= $reg->get_adaptor('human', 'variation', 
'variationset');
my $watson_set = $variation_set_adaptor->fetch_by_name("ENSEMBL:Watson");
my $variation_feature_adaptor=$reg->get_adaptor('human', 'variation', 
'variationfeature');
my $variation_adaptor=$reg->get_adaptor('human', 'variation', 'variation');
my $slice_adaptor = $reg->get_adaptor($species, 'core', 'slice');
my $gene_adaptor =$reg->get_adaptor($species, 'core','gene');
my $transcript_adaptor =$reg->get_adaptor($species, 'core','Transcript');

###################################
#get the database schema
###################################

my @db_adaptors = @{ $reg->get_all_DBAdaptors(-group => 'variation') };
my $ens_schema;
foreach my $db_adaptor (@db_adaptors) {
     my $db_connection = $db_adaptor->dbc();
     $ens_schema= $db_adaptor->_get_schema_build;
}

my $snp;

my $i = 0;
my $slices = $slice_adaptor->fetch_all('chromosome', undef, 0, 1);

# Base pair overlap between returned slices
my $overlap = 0;

# Maximum size of returned slices
my $max_size = 100_000;

# Break chromosomal slices into smaller 100k component slices
my @slices = @{split_Slices( $slices, $max_size, $overlap )  };

foreach my $slice (@slices) {

     unless ($slice->seq_region_name() =~ /Un/) {
         my @vfs 
=@{$watson_set->get_all_VariationFeatures_by_Slice($slice)};
         foreach my $vf (@vfs) {
             next if ($vf->var_class ne 'snp');
             $snp++;

             print "variation feature found: $snp\n";

             my ($chromosome, $locus, $gene_id, $biotype, $description, 
$gene_name, $transcript_id, $exon_id, $exon_start, $exon_end,$dbsnp_ref, 
$dbsnp_synonyms, $reference, $allele, $genotype, $main_consequence, 
$all_consequences, $peptide_alleles, $unique_mapping) =
         (undef, undef, undef, undef, undef, undef, undef, undef, undef, 
undef, undef, undef, undef, undef, undef, undef, undef, undef, undef);

             $chromosome=$slice->seq_region_name();

             #$a_allele, $t_allele, $g_allele, $c_allele, 
$a_main_consequence, $a_all_consequences, $t_main_consequence, 
$t_all_consequences, $c_main_consequence, $c_all_consequences, 
$g_main_consequence, $g_all_consequences, $a_aa, $t_aa, $c_aa, $g_aa,

             $locus = $vf->start();
             $allele=$vf->allele_string();
             $genotype = $vf->ambig_code(); #just a quick hack to allow 
the workflow will convert it back again

             # my @alleles = split /\//, $vf->allele_string();
             # foreach (@alleles) {
             # if (/A/i){$a_allele='Y';}
             # if (/T/i){$t_allele='Y';}
             # if (/C/i){$c_allele='Y';}
             # if (/G/i){$g_allele='Y';}
             # }

             $dbsnp_ref = $vf->variation_name();
             $dbsnp_synonyms = join ',', 
@{$vf->variation->get_all_synonyms('dbSNP')};
             $reference = substr $allele, 0, 1;
             $main_consequence = $vf->display_consequence();
             $all_consequences = join ',', @{$vf->get_consequence_type()};

             # fetch all genome hits for a particular variation
             my $variation = $variation_adaptor->fetch_by_name($dbsnp_ref);

             my @vars =  
@{$variation_feature_adaptor->fetch_all_by_Variation($variation)};
             my $count = @vars;
             $count > 1 ? $unique_mapping = 'FALSE': $unique_mapping = 
'TRUE';

             my @tvs = @{$vf->get_all_TranscriptVariations};
             if (@tvs) {

                 print "vf has transcript variants\n";
                 foreach my $tv (@tvs) {

                     my $transcript = $tv->transcript();
                     if ($transcript) {

                         $transcript_id= $transcript->stable_id();
                         my $gene = 
$gene_adaptor->fetch_by_transcript_stable_id($transcript_id);
                         ($gene_id, $biotype, $description, $gene_name) 
= ($gene->stable_id,$gene->biotype, 
$gene->description,$gene->external_name);
                         $main_consequence = $tv->display_consequence();
                         $all_consequences = join ',', 
@{$tv->consequence_type()};
                         $peptide_alleles = $tv->pep_allele_string();
                         my @exons = @{$tv->transcript->get_all_Exons() };

                         foreach my $exon (@exons) {

                             if ($vf->seq_region_start >= 
$exon->seq_region_start and $vf->seq_region_start <= 
$exon->seq_region_end) {

                                 ($exon_start, $exon_end, $exon_id) = 
($exon->seq_region_start, $exon->seq_region_end, $exon->stable_id) ;

                                 last;

                             }#end if in the exon
                         }#end for each exon

                     }#end if transcript

                     #insert row here, can still update the record with 
nulls if transcript not available
&insert_snp($chromosome, $locus, $gene_id, $biotype, $description, 
$gene_name, $transcript_id, $exon_id, $exon_start, $exon_end,$dbsnp_ref, 
$dbsnp_synonyms, $reference, $allele, $genotype, $main_consequence, 
$all_consequences, $peptide_alleles, $ens_schema, $unique_mapping);
                 }#end for each transcript variant

             }
             else {

                 print "vf does not have transcript variants\n";

                 #insert row here for snp not in a gene

&insert_snp($chromosome, $locus, $gene_id, $biotype, $description, 
$gene_name, $transcript_id, $exon_id, $exon_start, $exon_end,$dbsnp_ref, 
$dbsnp_synonyms, $reference, $allele, $genotype, $main_consequence, 
$all_consequences, $peptide_alleles, $ens_schema, $unique_mapping);

             }#end if any transcript variants
         }#end for each variation feature
     }#end unless UnChr
}#end for each chromosome

sub insert_snp {

#my ($chromosome, $locus, $gene_id, $biotype, $description, $gene_name, 
$transcript_id, $exon_id, $exon_start, $exon_end,$dbsnp_ref, 
$dbsnp_synonyms, $reference, $allele, $main_consequence, 
$all_consequences, $peptide_alleles, $ens_schema) = @_;

     my @params = @_;
     my $db = 
DBI->connect("DBI:mysql:database=annotationDB;host=localhost;port=3306", 
"root", "and1jon1",{'RaiseError' => 0}) or die "Cannot connect: " . 
$DBI::errstr;
     my $sth = $db->prepare("insert into watson(chromosome, locus, 
gene_id, biotype, description, gene_name, transcript_id, exon_id, 
exon_start, exon_end, dbsnp_ref, dbsnp_synonyms, reference, allele, 
genotype, main_consequence, all_consequences, peptide_alleles, 
ens_schema, unique_mapping) 
values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?, ?)") or die "Cannot 
prepare: " . $db->errstr();
     $sth->execute(@params) or die "Cannot execute: " . $db->errstr();;

}

On 14/12/2010 13:26, Pablo Marin-Garcia wrote:
>
> Hello Andrea
>
> there are several things than can help you (as a hint not as a 
> solution :-():
>
> A) First, when you are uploading big data sets to mysql, you should 
> turn down the indexes or it would take ages. For a few tables you can 
> do it manually commenting the index creation in the sql, and later use 
> the 'create index' command. I don't know if in mysql you can update 
> everything skipping the indexes with mysqladmin and then redo all the 
> indexes automatically.
>
>
> B) The approach that I follow when working with full genome snps is to 
> download once the tables that I need directly with mysql and then I 
> have ensembl-like adaptors to work with them:
>
> mysql -B --port 5306 -u anonymous -h ensembldb.ensembl.org -e 'use \ 
> homo_sapiens_variation_60_37e; select variation_id as var_id, \
> variation_name as snp, sr.name as chr, \
> seq_region_start as start, seq_region_end as end, seq_region_strand as \
> strand, allele_string, map_weight, flags, validation_status, 
> consequence_type \
> from  variation_feature vf, seq_region sr where \
> vf.seq_region_id=sr.seq_region_id' >  variations_build_37_ens-60.tab
>
> note:
>   This makes 2Gb file variations_build_37_ens-60.tab (you can make 
> more specific
>   queries joining source-variation_synonym-variation_features in order 
> to reduce
>   the data). Also the memory report tell me that mysql was taking 4.5 
> Gb during
>   the download so you should make more specific queries in small 
> computers. You
>   can always download these three tables (plus seq_region for the chr 
> names)
>   from the ensemble db dumps and load to your mysql locally (It is not 
> necessary
>   to download the full variation database).
>
> You can upload the query-downloaded data to a local mysql with:
>
> ============ loading script: load_ens_vf_b37.sql
>
> -- load with:
> --    mysql --host=variation_local  --port=3636 --user=pmg -p < 
> load_ens_vf_b37.sql
> -- 
>
> use pmGWAS;
>
> drop table if exists ens_vf_b37;
> create table ens_vf_b37 (
>    var_id              int(20),
>    snp                 varchar(20),
>    chr                 varchar(20),
>    start               int(10) UNSIGNED NOT NULL,
>    end                 int(10) UNSIGNED NOT NULL,
>    strand              enum('-1','1'),
>    allele_string       varchar(10),
>    map_weight          tinyint(2),
>    flags               varchar(100),
>    validation_status   varchar(100),
>    consequence_type    varchar(100)
> );
>
> load data local infile 
> '/homes/pmg/ens_dump/variations_build_37_ens-60.tab'
> into table ens_vf_b37
> FIELDS TERMINATED BY '\t'
> LINES TERMINATED BY '\n'
> IGNORE 1 LINES
> (
> var_id
> ,snp
> ,chr
> ,start
> ,end
> ,strand
> ,allele_string
> ,map_weight
> ,flags
> ,validation_status
> ,consequence_type
> )
>
> -- after insertion do the index to prevent slowing down the insertions
> -- alter table ens_vf_b37 add index snp(snp);
> -- alter table ens_vf_b37 add index pos(chr, start);
>
> =============
>
>
> If I need specific data sets (Watson, Venter, dbsnp, ) I make a second 
> table with the output of a join with 'source' table and 
> 'variation_synonyms' to know which snps belong to a specific source. 
> That way I can make any genomic query faster locally joining by 
> variation_id  my variation_feature with my source-synonym table to 
> extract the SNPS wanted
>
>
> For your case in build_36 was strait forward because one of the 
> sources was "ENSEMBL:Watson", but in the new build 37 source they have 
> added a lot more sources and there is not a explicit watson anymore, 
> so I don't know where to define it (now you have only ENSEMBL but in 
> 36 was ENSEMBL:celera/watson/venter):
>
> $  mysql_ensembl_variation_b37 'select name  from source'
> +----------------------------+
> | name                       |
> +----------------------------+
> | dbSNP                      |
> | Affy GeneChip 100K Array   |
> | Affy GeneChip 500K Array   |
> | Affy GenomeWideSNP_6.0     |
> | NHGRI_GWAS_catalog         |
> | EGA                        |
> | Illumina_CytoSNP12v1       |
> | Illumina_Human660W-quad    |
> | Illumina_Human1M-duoV3     |
> | Uniprot                    |
> | COSMIC                     |
> | ENSEMBL                    |
> | DGVa:estd1                 |
> | DGVa:estd3                 |
> | Open Access GWAS Database  |
> | HGMD-PUBLIC                |
> | DGVa:nstd1                 |
> | DGVa:nstd2                 |
> | DGVa:nstd4                 |
> | DGVa:nstd9                 |
> | DGVa:nstd14                |
> | DGVa:nstd16                |
> | DGVa:nstd17                |
> | DGVa:nstd20                |
> | DGVa:nstd11                |
> | DGVa:nstd22                |
> | DGVa:nstd23                |
> | DGVa:nstd27                |
> | DGVa:nstd28                |
> | DGVa:nstd29                |
> | DGVa:nstd30                |
> | DGVa:nstd31                |
> | DGVa:nstd32                |
> | DGVa:nstd34                |
> | DGVa:nstd35                |
> | Affy GenomeWideSNP_6.0 CNV |
> | DGVa:nstd8                 |
> | DGVa:estd19                |
> | DGVa:estd20                |
> | DGVa:estd21                |
> | DGVa:estd22                |
> | DGVa:estd24                |
> | DGVa:nstd36                |
> | DGVa:nstd39                |
> | DGVa:estd48                |
> | DGVa:estd49                |
> | DGVa:estd50                |
> +----------------------------+
>
>
>
>  On Sun, 12 Dec 2010, Andrea Edwards wrote:
>
>> Hi
>>
>> The dump files would be great but I am also retreiving lots of other 
>> information about the snps with the snps and that might not 
>> necessarily be in your dump file so i think i have to try other 
>> options too.
>>
>> This is what i have tried so far to get the watson snps and not 
>> getting anywhere fast :)
>>
>> 1. Written perl script to download them from ensembl human variation 
>> database. This works but will take over a month to get all the snps 
>> at the rate at which it seems to be running and i imagine you'll 
>> block my  ip address if i leave it running :) Plus I can't leave it a 
>> month anyway.
>>
>> 2. I've tried to install the human variation database locally but 
>> that also seems to be having problems. It has been installing the 
>> allele table now for 3 days i think. It is running on a very slow 
>> machine but there are far bigger tables than the allele table so i 
>> dread to think how long they will take. I tried to get access to a 
>> better machine but i wasn't give enough hard disk space but perhaps 
>> that will solve the problem! How long should it take to install the 
>> human variation database (roughly) on a 64 bit linux machine with 2 
>> gig of ram and intel xeon @ 2.27GHz? Will it take hours or days?
>>
>> Is there anything else i can try. I do appreciate that the dataset is 
>> vast and these things will be slow? Perhaps the answer is simply a 
>> faster machine to install the local database and I am looking into this.
>>
>> I have already looked at getting the snps from dbsnp or directly from 
>> source but i need to get information associated with the snps so will 
>> have the same problems i think of retreiving the associated data even 
>> if i got the 'raw snps' by other means
>>
>> Many thanks
>>
>> On 09/12/2010 16:53, Fiona Cunningham wrote:
>>>   Dear Andrea,
>>>
>>> We will look into producing the dump file of all SNPs in Watson for
>>> the next release which should make your life easier. Biomart is really
>>> best suited to specific queries and so we should provide dump files
>>> where large amounts of information across the entire genome is
>>> required.
>>>
>>> Fiona
>>>
>>> ------------------------------------------------------
>>> Fiona Cunningham
>>> Ensembl Variation Project Leader, EBI
>>> www.ensembl.org
>>> www.lrg-sequence.org
>>> t: 01223 494612 || e: fiona at ebi.ac.uk
>>>
>>>
>>>
>>> On 9 December 2010 13:46, Andrea Edwards<edwardsa at cs.man.ac.uk>  wrote:
>>>> Dear all
>>>>
>>>> I've tried downloading watson snps from biomart by a) the whole set 
>>>> and b)
>>>> chromosome by chromosome and i can't get the data. I have tried 
>>>> requesting
>>>> the data by email (no email received) and direct download (download 
>>>> starts
>>>> but at a rate of 1kb per second and times out after about 12 
>>>> hours/10 mb
>>>> downloaded).
>>>>
>>>> I have written a script to get the watson snps via the perl api but 
>>>> that is
>>>> running and taking hours so I am scared I will get my ip blocked! 
>>>> There are
>>>> 3 million snps and it took an hour to get 3000 i think
>>>>
>>>> I was thinking of getting the human databases directly but i am 
>>>> awaiting a
>>>> new machine and totally out of disk space. Does anyone you know how 
>>>> big the
>>>> human core and variation databases are when installed?
>>>>
>>>> thanks a lot
>>>>
>>>> _______________________________________________
>>>> Dev mailing list
>>>> Dev at ensembl.org
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>
>>
>>
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
>>
>
>
> -----
>
>   Pablo Marin-Garcia
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20101217/8549ae5c/attachment.html>