[ensembl-dev] getting gene exons and transcripts that overlap only the original slice

Wed Jan 12 10:36:11 GMT 2011

On Wed, 12 Jan 2011, Alison Meynert wrote:

> Hi Andrea,
>
> I have the same issue with mapping lots of SNPs to Ensembl features. My 
> solution, at least for exons or other reasonably small sets, is to invert the 
> problem by iterating over all features and asking which of my SNPs overlap a 
> given feature. Also, I have a local installation of some Ensembl databases, 
> which really helps.
>
> To do the comparison, I load all my SNPs into a MySQL database table indexed 
> on the sequence region/chromosome and position fields, and query that table 
> with given feature coordinates. Example Perl code below.
>
> I'd be interested to hear if anyone has more efficient/faster solutions as 
> this is something that seems good enough for my purposes at the moment but 
> may not scale well.

Hello Alison,

I follow a similar approach when I want to describe all snps in a gene, but also 
when my starting point, like Andrea, are SNPs (usually a sparse set in my case) 
not genes I do the inverse than you, because a SNP could lay in several exons or 
genes so is more natural for me going the other way around.

For each ensembl release I run a script that loop all genes-transcript-exons and 
flattens the data by exons and then load the tab file to mysql into a table 
[exon_id exon_start exon_end transcript_id gen_id gen_name] being the 
exon_id-transcript_id the uniq key, and exon_start, exon_end, and gene_id being 
indexed.

That way I can quickly find which genes, transcript, exons overlap my SNP. And 
also I am able to find exons/genes n-bases nearby my SNP so I can look for other 
LD SNPs in this genes.

So my scripts and SQL queries are basically like yours but starting the problem 
from the other end (looking which exons overlap the SNP).

I think that for a whole genome approach this is faster than connecting directly 
to ensembl each time, and instantiating all the time ensembl objects. The 
drawback is that this is a specific solution, so you loose flexibility in order 
to gain efficiency, but this is how life is. Finally, as you said, hearing about 
other approaches is welcome.

    -Pablo

>
> Cheers,
> Alison
>
> # connect to the database
> my $dbh = DBI->connect("DBI:mysql:database=$dbname;host=$hostname", $user, 
> $pass) or die "Cannot connect to database\n$!";
>
> # set up the select statement
> my $sth = $dbh->prepare(SELECT * FROM snp WHERE seq_region = ? AND pos >= ? 
> AND pos <= ?);
>
> # iterate over exons or other features
> my $exons = $exon_adaptor->fetch_all ... ;
> while (my $exon = shift @{ $exons })
> {
>    # select SNPs
>    $sth->execute($exon->seq_region_name, $exon->start, $exon->end);
>    while (my $ref = $sth->fetchrow_hashref())
>    {
>        # do something with SNPs
>        my $snp_id = $ref->{'snp_id'};
>        ...
>    }
> }
>
> On 11/01/2011 21:14, Andrea Edwards wrote:
>> Sorry, missed the last bit off my last message. This was what i did but
>> was wondering if there was more efficient way as I have such a lot of data.
>> I have just given an example for one locus as i'm sure you can imagine
>> this code will be looped millions of times
>> 
>> $slice = $slice_adaptor->fetch_by_region( 'chromosome', '9', 21816758,
>> 21816758 );
>> $exons = $slice->get_all_Exons;
>> while (my $exon = shift @{$exons}) {
>> 
>> my $gene = $gene_adaptor->fetch_by_exon_stable_id($exon->stable_id);
>> #process gene
>> my $transcripts = $transcript_adaptor->fetch_all_by_exon_stable_id
>> ($exon->stable_id);
>> 
>> while (my $transcript = shift @{$transcripts}) {
>> #process transcript
>> }
>> 
>> }
>> 
>> 
>> 
>> 
>> On 11/01/2011 19:38, Andrea Edwards wrote:
>>> Hello
>>> 
>>> i have this code below taken from the core api tutorial which gets me
>>> all the exons and transcripts for the gene(s) that overlap a slice.
>>> 
>>> I was hoping for an easy way to get those features of the gene that
>>> only overlap the original one bp slice; this code gets all exons and
>>> transcripts
>>> associated with the gene
>>> 
>>> I thought you might be able to call 'get_all_Object' methods with a
>>> parameter which represents a region of sequence overlap but it seems not.
>>> I also thought they might be filtered automatically based on the
>>> underlying slice but it seems not.
>>> 
>>> Naturally i can filter the features in the list based on their start
>>> and end positions but for speed it would be easier not to retrieve
>>> them all at.
>>> I have a lot of data so speed is important. Please can you advise the
>>> best way to do this.
>>> 
>>> $slice = $slice_adaptor->fetch_by_region( 'chromosome', '9', 21816758,
>>> 21816758 );
>>> 
>>> my $genes = $slice->get_all_Genes();
>>> while ( my $gene = shift @{$genes} ) {
>>> my $gstring = feature2string($gene);
>>> print "$gstring\n";
>>> 
>>> my $transcripts = $gene->get_all_Transcripts();
>>> while ( my $transcript = shift @{$transcripts} ) {
>>> my $tstring = feature2string($transcript);
>>> print "\t$tstring\n";
>>> 
>>> foreach my $exon ( @{ $transcript->get_all_Exons() } ) {
>>> my $estring = feature2string($exon);
>>> print "\t\t$estring\n";
>>> }
>>> }
>>> }
>>> 
>>> print "done\n";
>>> 
>>> Many thanks
>>> 
>>> _______________________________________________
>>> Dev mailing list
>>> Dev at ensembl.org
>>> http://lists.ensembl.org/mailman/listinfo/dev
>> 
>> 
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
>
> -- 
> Alison Meynert
> MRC Human Genetics Unit, Edinburgh
> alison.meynert at hgu.mrc.ac.uk
>
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev
>

-----

   Pablo Marin-Garcia