[ensembl-dev] Using the Ensembl REST API to determine FTP URLs for genomes

Thu Dec 19 18:35:33 GMT 2019

Okay, I agree that's definitely better than crawling the FTP site!

Thanks for answering this, it's very very helpful.

> My answer to your other question might be a good source of unique species
names depending on how you're working with our services.

It is! I actually was already using that endpoint, I just didn't realize it
had all the assemblies in that response and thought I needed an additional
endpoint. As far as I can tell I now know everything I need to to make this
work.

Thanks a bunch,

- Kurt

On Thu, Dec 19, 2019 at 12:50 PM Kieron Taylor <ktaylor at ebi.ac.uk> wrote:

> Unforunately, you are into rather inconvenient methods to do this. We have
> plans for a service that indexes the FTP content and will make it
> searchable, but you'll have to wait for that.
>
> In the meantime, you can:
>
> Get
> ftp://ftp.ensemblgenomes.org/pub/bacteria/current/species_EnsemblBacteria.txt
> Extract coloumn 13 for the rows you are interested in
> The column contains values like bacteria_177_collection_core_45_98_1
> Regex out the tail end after _collection and then you have the collection
> name: bacteria_177_collection
> Put that into your URL, then add "species" name as per the tab-separated
> file above
>
> It's not pretty but it's better than crawling the FTP site! My answer to
> your other question might be a good source of unique species names
> depending on how you're working with our services.
>
> Hopefully you are already aware of the "current" link in the path should
> you not wish to work on a specific release.
>
>
> Hope that helps,
>
> Kieron
>
>
> Kieron Taylor PhD.
> Ensembl Developer
>
> EMBL, European Bioinformatics Institute
>
>
>
>
>
>
> > On 19 Dec 2019, at 15:10, Kurt Wheeler <kurt.wheeler91 at gmail.com> wrote:
> >
> > Hello,
> >
> > I'm trying to figure out how to programmatically find this URL:
> >
> ftp://ftp.ensemblgenomes.org/pub/bacteria/release-45/fasta/bacteria_13_collection/pseudomonas_aeruginosa_pao1/dna/
> >
> > I got that URL by going to
> https://bacteria.ensembl.org/Pseudomonas_aeruginosa_pao1/Info/Index/ and
> clicking a link that said: "Download DNA sequence (FASTA)". However I can't
> figure out how to get the API to tell me that and I don't want to scrape
> the HTML for the link.
> >
> > Does anyone know how to find that URL for a given organism/strain?
> >
> > Thanks,
> >
> > - Kurt
> >
> > P.S. I solved this problem for divisions other than bacteria by building
> the URLs with information that the API does provide:
> https://github.com/AlexsLemonade/refinebio/blob/dev/foreman/data_refinery_foreman/surveyor/transcriptome_index.py#L48
> >
> > However in the FTP server the bacteria are broken up into collections
> which I'm having trouble figuring out how to determine.
> > _______________________________________________
> > Dev mailing list    Dev at ensembl.org
> > Posting guidelines and subscribe/unsubscribe info:
> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> > Ensembl Blog: http://www.ensembl.info/
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20191219/5471b4f5/attachment.html>