[ensembl-dev] getting gene exons and transcripts that overlap only the original slice

Wed Jan 12 14:50:29 GMT 2011

Hello

When i was looking at just exons I used to use exactly the same approach 
as Alison. Now i want to annotate my snps to store their relationships 
to the exons / genes / transcripts they affect I have decided to 
approach the problem from the other side as it were.

Pablo, the idea about flattening the data once per release is brilliant. 
I shall defininitely adopt that approach in the long term. Would you be 
willing to post your script to the group? I'm glad I asked now. I bet 
flattening the data takes hours off the run time?

Many thanks :)

On 12/01/2011 10:36, Pablo Marin-Garcia wrote:
> On Wed, 12 Jan 2011, Alison Meynert wrote:
>
>> Hi Andrea,
>>
>> I have the same issue with mapping lots of SNPs to Ensembl features. 
>> My solution, at least for exons or other reasonably small sets, is to 
>> invert the problem by iterating over all features and asking which of 
>> my SNPs overlap a given feature. Also, I have a local installation of 
>> some Ensembl databases, which really helps.
>>
>> To do the comparison, I load all my SNPs into a MySQL database table 
>> indexed on the sequence region/chromosome and position fields, and 
>> query that table with given feature coordinates. Example Perl code 
>> below.
>>
>> I'd be interested to hear if anyone has more efficient/faster 
>> solutions as this is something that seems good enough for my purposes 
>> at the moment but may not scale well.
>
>
> Hello Alison,
>
> I follow a similar approach when I want to describe all snps in a 
> gene, but also when my starting point, like Andrea, are SNPs (usually 
> a sparse set in my case) not genes I do the inverse than you, because 
> a SNP could lay in several exons or genes so is more natural for me 
> going the other way around.
>
> For each ensembl release I run a script that loop all 
> genes-transcript-exons and flattens the data by exons and then load 
> the tab file to mysql into a table [exon_id exon_start exon_end 
> transcript_id gen_id gen_name] being the exon_id-transcript_id the 
> uniq key, and exon_start, exon_end, and gene_id being indexed.
>
> That way I can quickly find which genes, transcript, exons overlap my 
> SNP. And also I am able to find exons/genes n-bases nearby my SNP so I 
> can look for other LD SNPs in this genes.
>
> So my scripts and SQL queries are basically like yours but starting 
> the problem from the other end (looking which exons overlap the SNP).
>
> I think that for a whole genome approach this is faster than 
> connecting directly to ensembl each time, and instantiating all the 
> time ensembl objects. The drawback is that this is a specific 
> solution, so you loose flexibility in order to gain efficiency, but 
> this is how life is. Finally, as you said, hearing about other 
> approaches is welcome.
>
>
>    -Pablo
>
>
>>
>> Cheers,
>> Alison
>>
>> # connect to the database
>> my $dbh = DBI->connect("DBI:mysql:database=$dbname;host=$hostname", 
>> $user, $pass) or die "Cannot connect to database\n$!";
>>
>> # set up the select statement
>> my $sth = $dbh->prepare(SELECT * FROM snp WHERE seq_region = ? AND 
>> pos >= ? AND pos <= ?);
>>
>> # iterate over exons or other features
>> my $exons = $exon_adaptor->fetch_all ... ;
>> while (my $exon = shift @{ $exons })
>> {
>>    # select SNPs
>>    $sth->execute($exon->seq_region_name, $exon->start, $exon->end);
>>    while (my $ref = $sth->fetchrow_hashref())
>>    {
>>        # do something with SNPs
>>        my $snp_id = $ref->{'snp_id'};
>>        ...
>>    }
>> }
>>
>> On 11/01/2011 21:14, Andrea Edwards wrote:
>>> Sorry, missed the last bit off my last message. This was what i did but
>>> was wondering if there was more efficient way as I have such a lot 
>>> of data.
>>> I have just given an example for one locus as i'm sure you can imagine
>>> this code will be looped millions of times
>>>
>>> $slice = $slice_adaptor->fetch_by_region( 'chromosome', '9', 21816758,
>>> 21816758 );
>>> $exons = $slice->get_all_Exons;
>>> while (my $exon = shift @{$exons}) {
>>>
>>> my $gene = $gene_adaptor->fetch_by_exon_stable_id($exon->stable_id);
>>> #process gene
>>> my $transcripts = $transcript_adaptor->fetch_all_by_exon_stable_id
>>> ($exon->stable_id);
>>>
>>> while (my $transcript = shift @{$transcripts}) {
>>> #process transcript
>>> }
>>>
>>> }
>>>
>>>
>>>
>>>
>>> On 11/01/2011 19:38, Andrea Edwards wrote:
>>>> Hello
>>>>
>>>> i have this code below taken from the core api tutorial which gets me
>>>> all the exons and transcripts for the gene(s) that overlap a slice.
>>>>
>>>> I was hoping for an easy way to get those features of the gene that
>>>> only overlap the original one bp slice; this code gets all exons and
>>>> transcripts
>>>> associated with the gene
>>>>
>>>> I thought you might be able to call 'get_all_Object' methods with a
>>>> parameter which represents a region of sequence overlap but it 
>>>> seems not.
>>>> I also thought they might be filtered automatically based on the
>>>> underlying slice but it seems not.
>>>>
>>>> Naturally i can filter the features in the list based on their start
>>>> and end positions but for speed it would be easier not to retrieve
>>>> them all at.
>>>> I have a lot of data so speed is important. Please can you advise the
>>>> best way to do this.
>>>>
>>>> $slice = $slice_adaptor->fetch_by_region( 'chromosome', '9', 21816758,
>>>> 21816758 );
>>>>
>>>> my $genes = $slice->get_all_Genes();
>>>> while ( my $gene = shift @{$genes} ) {
>>>> my $gstring = feature2string($gene);
>>>> print "$gstring\n";
>>>>
>>>> my $transcripts = $gene->get_all_Transcripts();
>>>> while ( my $transcript = shift @{$transcripts} ) {
>>>> my $tstring = feature2string($transcript);
>>>> print "\t$tstring\n";
>>>>
>>>> foreach my $exon ( @{ $transcript->get_all_Exons() } ) {
>>>> my $estring = feature2string($exon);
>>>> print "\t\t$estring\n";
>>>> }
>>>> }
>>>> }
>>>>
>>>> print "done\n";
>>>>
>>>> Many thanks
>>>>
>>>> _______________________________________________
>>>> Dev mailing list
>>>> Dev at ensembl.org
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list
>>> Dev at ensembl.org
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>
>> -- 
>> Alison Meynert
>> MRC Human Genetics Unit, Edinburgh
>> alison.meynert at hgu.mrc.ac.uk
>>
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
>>
>
>
> -----
>
>   Pablo Marin-Garcia
>
>
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev