[ensembl-dev] wget/curl access to ftp data among other frustrations

W. Augustine Dunn III wadunn83 at gmail.com
Mon Oct 31 20:54:43 GMT 2011


In short:
why is this not enabled?!

But with a few more words and hopefully a bit less fragrant frustration
bleeding though (I'll TRY anyway):

I have spent a very frustrating few hours trying to mirror a few species
from one of the metazoa.ensembl releases (Aedes,Culex,Anopheles,Dmel).  You
may prefer that people go to the html ftp interface (
http://metazoa.ensembl.org/info/data/ftp/index.html) and repetitively sit
at their desk and right-click>download for every file needed for every
species needed but this very annoying and results in an un-organized set of
files that then need to be put BACK in the nice little order that they were
ALREADY in before I had the misfortune of realizing that I needed to try to
get data from you guys.

I appreciate that for the lay-person it is useful to have point-and-click
a-la-carte downloads but why prevent those that can actually make the MOST
use of your data (people who know their way around a cmd line and
understand the need to keep data in a predetermined/organized structure)
from using standard methods to access your data in a way that is able to be
automated and scripted for reliable updating in the future?

Actually, on that note can you PLEASE standardize the names of the files
included in diff "release" folders? Exp: in relese9 the gtf for anopheles
is
ftp://ftp.ensemblgenomes.org/pub/metazoa/release-9/gtf/anopheles_gambiae/anopheles_gambiae.AgamP3.62.gtf.gzwhich
is GREAT because it gives me some idea of which DB version the gtf is
based from.  BUT in release 10 its now named:
ftp://ftp.ensemblgenomes.org/pub/metazoa/release-10/gtf/anopheles_gambiae/Anopheles_gambiae.AgamP3.gtf.gz.
This gives people no idea which gene-build this gtf represents bc
AgamP3
refers to a genome assembly version NOT a gene-build.

Basically, PLEASE PLEASE PLEASE decide on a standard way to name files so
that they reflect what version of data they contain. PREFERABLY just use
Ensembl's conventions!  If you guys are gonna split these orphan geneomes
out into their own little "kids table at Christmas dinner" the LEAST you
could do is try to make that as un-obstructive to the science that relies
on this data as possible.  Its not like people work only in metazoa.ensembl
OR the "blessed" normal ensembl.  Why put easily avoidable stumbling blocks
in the way?  All that we should have to do is use a different url. All
other internals should be made to behave identically.  Anything else is
directly contributing to important science NOT GETTING DONE.  Not because
the data is unavailable or tainted but because the scientists are spending
all their time dealing with cryptically broken scripts and trying to learn
another schema if there even IS a cohearent new one to begin with.

PLEASE make this better!  A the VERY least please let us use standard tools
like wget/curl/rsync to pull whole directory structures down intact.
PLEASE PLEASE PLEASE.

I am sorry if I have offended.  it is NOT my intention.  I am simply
incredibly frustrated because I have to work in both these domains and
simple things like easy access and standardized naming conventions does NOT
seem like that hard of a thing to get right.

Gus Dunn

 --

W. Augustine Dunn, III
Ph.D. Candidate
Laboratory of Dr. Anthony James
Department of Molecular Biology and Biochemistry
University of California, Irvine
(949) 824-3210 - Lab
(949) 824-8551 - Fax
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20111031/bba05546/attachment.html>


More information about the Dev mailing list