[ensembl-dev] getting gene exons and transcripts that overlap only the original slice

Pablo Marin-Garcia pg4 at sanger.ac.uk
Wed Jan 12 15:58:04 GMT 2011


On Wed, 12 Jan 2011, Andrea Edwards wrote:

> Hello
>
> When i was looking at just exons I used to use exactly the same approach as 
> Alison. Now i want to annotate my snps to store their relationships to the 
> exons / genes / transcripts they affect I have decided to approach the 
> problem from the other side as it were.
>
> Pablo, the idea about flattening the data once per release is brilliant.

But This should do it in a very polite way. In my case when using 
ensembldb.ensembl.org for extraxting whole genome data once, I sleep(1) between 
genes and sleep(600) between chromosomes, so it takes 10 hours or so (only in 
the waiting). I don't know if nowadays is necessary to be so careful because the 
current ensembl servers seems to be powerful, but better be safe than sorry.


> shall defininitely adopt that approach in the long term. Would you be willing 
> to post your script to the group? I'm glad I asked now.

I would try to find time next week to upload it to github. If you send me 
a personal reminder next week I will tell you where to find it.

> I bet flattening the 
> data takes hours off the run time?

Well, genome wide approaches are going to take long time unless you are able to 
parallelize. For single user/single multicore computer bioinformatics, still is 
safe to use mysql with 26 concurrent scripts (one per autosome, X, XY, Y and Mt) 
but this would depend on how powerful is your machine (you can also split 
tables per chromosome). Remember that large parallelization against the public 
server is not permitted. In order to speed up things, one way to go is to use 
parallelization and local copies of the data in mysql or, better, memory hashes 
from flat files. If you can not parallelize at all and you computer is not 
powerful you will not see much difference, I would say, but YMMV..


  -Pablo



------------------------

Pablo Marin-Garcia
Team: EGA (vertebrate genomics)
European Bioinformatics Institute.
Cambrige(UK)




More information about the Dev mailing list