[ensembl-dev] SQL query to retrieve gene sequence...

Andrew Yates ayates at ebi.ac.uk
Tue Dec 16 16:58:56 GMT 2014


Hi Rob,

I won’t ever say it cannot be done just in the database. More I don’t think it’ll be as performant or as easy as the alternatives myself & Kieron suggested :).

Andy

------------
Andrew Yates - Ensembl Support Coordinator
European Molecular Biology Laboratory
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridge
CB10 1SD, United Kingdom
Tel: +44-(0)1223-492538
Fax: +44-(0)1223-494468
Skype: andrewyatz
http://www.ensembl.org/

> On 16 Dec 2014, at 16:40, Rob Sargent <rob.sargent at utah.edu> wrote:
> 
> A function/stored-procedure using a recursive CTE might be the way for Steve to go.
> 
> On 12/16/2014 09:28 AM, Andrew Yates wrote:
>> Hey Steve,
>> 
>> The problem with using the database is that sequence is not stored against the top-level sequences annotation is held against. Instead sequence is held against the contig sequence regions which requires descending through the assembly table an unspecified number of times (once for each mapping e.g. chromosome -> supercontig -> contig). 
>> 
>> I would seriously *not* recommend doing this. Not only do you have to deal with descending down the assembly but also having to think about concatenating the sequence & paying attention to the orientation of assembly. Instead you could use the Perl API (probably not an option considering you’re a Python guy), BioMart (you can access unspliced gene sequence quite easily), the REST API or download the full genome sequence from FTP and doing subslices. The faindex index tool from htslib/samtools is pretty good at extracting arbitrary sequence from very large FASTA files.
>> 
>> Andy
>> 
>> ------------
>> Andrew Yates - Ensembl Support Coordinator
>> European Molecular Biology Laboratory
>> European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton, Cambridge
>> CB10 1SD, United Kingdom
>> Tel: +44-(0)1223-492538
>> Fax: +44-(0)1223-494468
>> Skype: andrewyatz
>> http://www.ensembl.org/ <http://www.ensembl.org/>
>>> On 16 Dec 2014, at 16:15, Steve Moss <gawbul at gmail.com <mailto:gawbul at gmail.com>> wrote:
>>> 
>>> Dear EnsEMBL Dev,
>>> 
>>> I'm trying to write a raw SQL query to retrieve the sequence for the human BRCA2 gene to compare different methods of accessing EnsEMBL data. I'm currently doing the following, but getting an empty set.
>>> 
>>> SELECT SUBSTRING(sequence, g.seq_region_start, g.seq_region_end)
>>> FROM dna d
>>> JOIN gene g
>>> ON d.seq_region_id = g.seq_region_id
>>> WHERE g.stable_id="ENSG00000139618"
>>> 
>>> What am I missing? I think I'm falling short on working out the coord. system mapping stuff. Any pointers to help in fixing please?
>>> 
>>> Cheers,
>>> 
>>> Steve
>>> 
>>> -- 
>>>  
>>> Steve Moss
>>> about.me/gawbul
>>> 
>>>   <http://about.me/gawbul>_______________________________________________
>>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>> 
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20141216/69582488/attachment.html>


More information about the Dev mailing list