[ensembl-dev] Parsing Variant files and some way of speeding up orthologue mapping between species.

Nicolas Thierry-Mieg Nicolas.Thierry-Mieg at univ-grenoble-alpes.fr
Fri Feb 26 09:55:57 GMT 2021


Hi Matt,

I second Andrew's suggestion.

For a much simpler but similar issue I had a while back, following his 
advice (direct mysql query rather than query using the API) the runtime 
for a simple task went from 16h34m to 1.2s .

Check out the thread here:
https://lists.ensembl.org/pipermail/dev_ensembl.org/2019-September/000189.html

Regards,
Nicolas


On 25/02/2021 14:15, Olson, Andrew wrote:
> Hi Matt,
> There is also a public MySQL server where you can access the compara database.  https://useast.ensembl.org/info/data/mysql.html<https://useast.ensembl.org/info/data/mysql.html>
> It has a fairly complex schema but if you need rapid ETL once every few months, it’s worth the effort.
> Andrew
> 
> On Feb 25, 2021, at 6:23 AM, Matthew Gerring <Matthew.Gerring at jax.org> wrote:
> 
> 
> Hello,
> 
> I am downloading various ENSEMBL data in order to build a new custom database for a product called geneweaver<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.geneweaver.org_&d=DwMGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=9H6mwBKXeZaEyywsP0b19xlWFYHz1H4FAgtSA_eeK_A&s=w3yW2SxyvWnIgi6-zlAXnVAqU8iGx9iuP9wrsjEfe48&e=>
> 
> I can ingest gvf/gtf reasonably quickly using a library<https://urldefense.proofpoint.com/v2/url?u=https-3A__mvnrepository.com_artifact_org.geneweaver_gweaver-2Dstream-2Dio&d=DwMGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=9H6mwBKXeZaEyywsP0b19xlWFYHz1H4FAgtSA_eeK_A&s=WzThNaA1KRTvvzf0LZQMz0ZhQjxGNQ1dnNoD2YmBTaM&e=>. It averages around 0.01ms / Variant or Gene-like object over the full data.
> 
> In addition to this, the database adds other species and links their orthologues to human. To find orthologs I hit the web service at http://rest.ensembl.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__rest.ensembl.org&d=DwMGaQ&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=9H6mwBKXeZaEyywsP0b19xlWFYHz1H4FAgtSA_eeK_A&s=j7kGnM0GMk6GaMKQyDG0Np8SNM0mz6McBM0CbzKpatg&e=> using "/homology/id/{geneId}" API. Using this lookup is slow (admittedly it is only for genes and not all variants). My times are coming out 3-4s per node.
> 
> That makes finding orthologues a considerably time consuming process, it takes longer than adding all 700mill odd human variants.
> 
> My question is if it is possible to do it faster? For instance can I download the orthologue data as a file or database? Or the data in which it is held and parse it myself? Can I bulk export somehow? Perhaps more than one orthologue at a time? (If this is not the correct email list, what is? 😊)
> 
> Thanks,
> 
> Matt Gerring
> 
> ---
> 
> The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible. _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.ensembl.org_mailman_listinfo_dev-5Fensembl.org&d=DwICAg&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=9H6mwBKXeZaEyywsP0b19xlWFYHz1H4FAgtSA_eeK_A&s=DEmfJXlFF5s-Nws4N_EIOjU3oMO-SSc975MaEA3JTxM&e=
> Ensembl Blog: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ensembl.info_&d=DwICAg&c=mkpgQs82XaCKIwNV8b32dmVOmERqJe4bBOtF0CetP9Y&r=ic-pQ08gnhTpvpqfp3_6Uw&m=9H6mwBKXeZaEyywsP0b19xlWFYHz1H4FAgtSA_eeK_A&s=SDv3c_E7C8Wt18kY8UF3SRJUrZJ-odtjBEByK5dggQ8&e=
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/
> 

-- 
-----------------------------------------------------------
Nicolas Thierry-Mieg
Laboratoire TIMC-IMAG/BCM, CNRS UMR 5525
Pavillon Taillefer, Faculte de Medecine
38700 La Tronche, France
tel: (+33)456.520.067, fax: (+33)456.520.055
------------------------------------------------------------




More information about the Dev mailing list