[ensembl-dev] Parsing Variant files and some way of speeding up orthologue mapping between species.

Matthew Gerring Matthew.Gerring at jax.org
Thu Feb 25 11:22:37 GMT 2021


I am downloading various ENSEMBL data in order to build a new custom database for a product called geneweaver<http://www.geneweaver.org/>

I can ingest gvf/gtf reasonably quickly using a library<https://mvnrepository.com/artifact/org.geneweaver/gweaver-stream-io>. It averages around 0.01ms / Variant or Gene-like object over the full data.

In addition to this, the database adds other species and links their orthologues to human. To find orthologs I hit the web service at http://rest.ensembl.org using "/homology/id/{geneId}" API. Using this lookup is slow (admittedly it is only for genes and not all variants). My times are coming out 3-4s per node.

That makes finding orthologues a considerably time consuming process, it takes longer than adding all 700mill odd human variants.

My question is if it is possible to do it faster? For instance can I download the orthologue data as a file or database? Or the data in which it is held and parse it myself? Can I bulk export somehow? Perhaps more than one orthologue at a time? (If this is not the correct email list, what is? 😊)


Matt Gerring


