[ensembl-dev] Using the Ensembl REST API to determine FTP URLs for genomes

Kieron Taylor ktaylor at ebi.ac.uk
Thu Dec 19 17:50:13 GMT 2019


Unforunately, you are into rather inconvenient methods to do this. We have plans for a service that indexes the FTP content and will make it searchable, but you'll have to wait for that.

In the meantime, you can:

Get ftp://ftp.ensemblgenomes.org/pub/bacteria/current/species_EnsemblBacteria.txt 
Extract coloumn 13 for the rows you are interested in
The column contains values like bacteria_177_collection_core_45_98_1
Regex out the tail end after _collection and then you have the collection name: bacteria_177_collection
Put that into your URL, then add "species" name as per the tab-separated file above 

It's not pretty but it's better than crawling the FTP site! My answer to your other question might be a good source of unique species names depending on how you're working with our services.

Hopefully you are already aware of the "current" link in the path should you not wish to work on a specific release.


Hope that helps,

Kieron


Kieron Taylor PhD.
Ensembl Developer

EMBL, European Bioinformatics Institute






> On 19 Dec 2019, at 15:10, Kurt Wheeler <kurt.wheeler91 at gmail.com> wrote:
> 
> Hello,
> 
> I'm trying to figure out how to programmatically find this URL:
> ftp://ftp.ensemblgenomes.org/pub/bacteria/release-45/fasta/bacteria_13_collection/pseudomonas_aeruginosa_pao1/dna/
> 
> I got that URL by going to https://bacteria.ensembl.org/Pseudomonas_aeruginosa_pao1/Info/Index/ and clicking a link that said: "Download DNA sequence (FASTA)". However I can't figure out how to get the API to tell me that and I don't want to scrape the HTML for the link.
> 
> Does anyone know how to find that URL for a given organism/strain?
> 
> Thanks,
> 
> - Kurt
> 
> P.S. I solved this problem for divisions other than bacteria by building the URLs with information that the API does provide: https://github.com/AlexsLemonade/refinebio/blob/dev/foreman/data_refinery_foreman/surveyor/transcriptome_index.py#L48
> 
> However in the FTP server the bacteria are broken up into collections which I'm having trouble figuring out how to determine.
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list