[ensembl-dev] Parsing Variant files and some way of speeding up orthologue mapping between species.

Thu Feb 25 11:22:37 GMT 2021

Hello,

I am downloading various ENSEMBL data in order to build a new custom database for a product called geneweaver<http://www.geneweaver.org/>

I can ingest gvf/gtf reasonably quickly using a library<https://mvnrepository.com/artifact/org.geneweaver/gweaver-stream-io>. It averages around 0.01ms / Variant or Gene-like object over the full data.

In addition to this, the database adds other species and links their orthologues to human. To find orthologs I hit the web service at http://rest.ensembl.org using "/homology/id/{geneId}" API. Using this lookup is slow (admittedly it is only for genes and not all variants). My times are coming out 3-4s per node.

That makes finding orthologues a considerably time consuming process, it takes longer than adding all 700mill odd human variants.

My question is if it is possible to do it faster? For instance can I download the orthologue data as a file or database? Or the data in which it is held and parse it myself? Can I bulk export somehow? Perhaps more than one orthologue at a time? (If this is not the correct email list, what is? 😊)

Thanks,

Matt Gerring

---

The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20210225/e8a124fa/attachment.html>