[ensembl-dev] Parsing Variant files and some way of speeding up orthologue mapping between species.

Olson, Andrew olson at cshl.edu
Thu Feb 25 13:15:36 GMT 2021


Hi Matt,
There is also a public MySQL server where you can access the compara database.  https://useast.ensembl.org/info/data/mysql.html<https://useast.ensembl.org/info/data/mysql.html>
It has a fairly complex schema but if you need rapid ETL once every few months, it’s worth the effort.
Andrew

On Feb 25, 2021, at 6:23 AM, Matthew Gerring <Matthew.Gerring at jax.org> wrote:


Hello,

I am downloading various ENSEMBL data in order to build a new custom database for a product called geneweaver<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.geneweaver.org_&d=DwMGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=9H6mwBKXeZaEyywsP0b19xlWFYHz1H4FAgtSA_eeK_A&s=w3yW2SxyvWnIgi6-zlAXnVAqU8iGx9iuP9wrsjEfe48&e=>

I can ingest gvf/gtf reasonably quickly using a library<https://urldefense.proofpoint.com/v2/url?u=https-3A__mvnrepository.com_artifact_org.geneweaver_gweaver-2Dstream-2Dio&d=DwMGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=9H6mwBKXeZaEyywsP0b19xlWFYHz1H4FAgtSA_eeK_A&s=WzThNaA1KRTvvzf0LZQMz0ZhQjxGNQ1dnNoD2YmBTaM&e=>. It averages around 0.01ms / Variant or Gene-like object over the full data.

In addition to this, the database adds other species and links their orthologues to human. To find orthologs I hit the web service at http://rest.ensembl.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__rest.ensembl.org&d=DwMGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=9H6mwBKXeZaEyywsP0b19xlWFYHz1H4FAgtSA_eeK_A&s=j7kGnM0GMk6GaMKQyDG0Np8SNM0mz6McBM0CbzKpatg&e=> using "/homology/id/{geneId}" API. Using this lookup is slow (admittedly it is only for genes and not all variants). My times are coming out 3-4s per node.

That makes finding orthologues a considerably time consuming process, it takes longer than adding all 700mill odd human variants.

My question is if it is possible to do it faster? For instance can I download the orthologue data as a file or database? Or the data in which it is held and parse it myself? Can I bulk export somehow? Perhaps more than one orthologue at a time? (If this is not the correct email list, what is? 😊)

Thanks,

Matt Gerring

---

The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible. _______________________________________________
Dev mailing list    Dev at ensembl.org
Posting guidelines and subscribe/unsubscribe info: https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.ensembl.org_mailman_listinfo_dev-5Fensembl.org&d=DwICAg&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=9H6mwBKXeZaEyywsP0b19xlWFYHz1H4FAgtSA_eeK_A&s=DEmfJXlFF5s-Nws4N_EIOjU3oMO-SSc975MaEA3JTxM&e=
Ensembl Blog: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ensembl.info_&d=DwICAg&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=9H6mwBKXeZaEyywsP0b19xlWFYHz1H4FAgtSA_eeK_A&s=SDv3c_E7C8Wt18kY8UF3SRJUrZJ-odtjBEByK5dggQ8&e=


More information about the Dev mailing list