[ensembl-dev] wget/curl access to ftp data among other frustrations

W. Augustine Dunn III wadunn83 at gmail.com
Mon Oct 31 22:29:21 GMT 2011


I owe the group an apology:

Here is my foot: *me eating it*

I am afraid that my frustrations with wget were due to a problem with my
linux 'group' permissions not playing nice and resulting in this ftp error:

wget -m ftp://ftp.ensemblgenomes.org/pub/metazoa/release-9/
--2011-10-31 15:19:08--  ftp://ftp.ensemblgenomes.org/pub/metazoa/release-9/
           => `ftp.ensemblgenomes.org/pub/metazoa/release-9/.listing'
Resolving ftp.ensemblgenomes.org... 193.62.197.94
Connecting to ftp.ensemblgenomes.org|193.62.197.94|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/metazoa/release-9 ... done.
==> PASV ... done.    ==> LIST ... done.
ftp.ensemblgenomes.org/pub/metazoa/release-9: No such file or
directoryftp.ensemblgenomes.org/pub/metazoa/release-9/.listing: No such
file or directory

I had never encountered this error before and it was not obvious what was
wrong.  All the stackoverflow/googling I did blamed that response on the
robots.txt file which is why I thought wget was being blocked.

Fixing my "group" rules has fixed the problem.  Thank you for your help.

Even though I am quite ashamed about such an oversight leading to such a
diatribe on my part with the wget problem, I would like to humbly maintain
my suggestion about trying to get ALL the ensembls on the same page at
least with file version naming conventions to allow unified file handling
from species to species and sub-site to sub-site.

Enjoying a beautifully bitter slice of humble-pie:

Gus


On Mon, Oct 31, 2011 at 1:54 PM, W. Augustine Dunn III
<wadunn83 at gmail.com>wrote:

> In short:
> why is this not enabled?!
>
> But with a few more words and hopefully a bit less fragrant frustration
> bleeding though (I'll TRY anyway):
>
> I have spent a very frustrating few hours trying to mirror a few species
> from one of the metazoa.ensembl releases (Aedes,Culex,Anopheles,Dmel).  You
> may prefer that people go to the html ftp interface (
> http://metazoa.ensembl.org/info/data/ftp/index.html) and repetitively sit
> at their desk and right-click>download for every file needed for every
> species needed but this very annoying and results in an un-organized set of
> files that then need to be put BACK in the nice little order that they were
> ALREADY in before I had the misfortune of realizing that I needed to try to
> get data from you guys.
>
> I appreciate that for the lay-person it is useful to have point-and-click
> a-la-carte downloads but why prevent those that can actually make the MOST
> use of your data (people who know their way around a cmd line and
> understand the need to keep data in a predetermined/organized structure)
> from using standard methods to access your data in a way that is able to be
> automated and scripted for reliable updating in the future?
>
> Actually, on that note can you PLEASE standardize the names of the files
> included in diff "release" folders? Exp: in relese9 the gtf for anopheles
> is
> ftp://ftp.ensemblgenomes.org/pub/metazoa/release-9/gtf/anopheles_gambiae/anopheles_gambiae.AgamP3.62.gtf.gzwhich is GREAT because it gives me some idea of which DB version the gtf is
> based from.  BUT in release 10 its now named:
> ftp://ftp.ensemblgenomes.org/pub/metazoa/release-10/gtf/anopheles_gambiae/Anopheles_gambiae.AgamP3.gtf.gz. <ftp://ftp.ensemblgenomes.org/pub/metazoa/release-10/gtf/anopheles_gambiae/Anopheles_gambiae.AgamP3.gtf.gz.>This gives people no idea which gene-build this gtf represents bc AgamP3
> refers to a genome assembly version NOT a gene-build.
>
> Basically, PLEASE PLEASE PLEASE decide on a standard way to name files so
> that they reflect what version of data they contain. PREFERABLY just use
> Ensembl's conventions!  If you guys are gonna split these orphan geneomes
> out into their own little "kids table at Christmas dinner" the LEAST you
> could do is try to make that as un-obstructive to the science that relies
> on this data as possible.  Its not like people work only in metazoa.ensembl
> OR the "blessed" normal ensembl.  Why put easily avoidable stumbling blocks
> in the way?  All that we should have to do is use a different url. All
> other internals should be made to behave identically.  Anything else is
> directly contributing to important science NOT GETTING DONE.  Not because
> the data is unavailable or tainted but because the scientists are spending
> all their time dealing with cryptically broken scripts and trying to learn
> another schema if there even IS a cohearent new one to begin with.
>
> PLEASE make this better!  A the VERY least please let us use standard
> tools like wget/curl/rsync to pull whole directory structures down intact.
> PLEASE PLEASE PLEASE.
>
> I am sorry if I have offended.  it is NOT my intention.  I am simply
> incredibly frustrated because I have to work in both these domains and
> simple things like easy access and standardized naming conventions does NOT
> seem like that hard of a thing to get right.
>
> Gus Dunn
>
>  --
>
> W. Augustine Dunn, III
> Ph.D. Candidate
> Laboratory of Dr. Anthony James
> Department of Molecular Biology and Biochemistry
> University of California, Irvine
> (949) 824-3210 - Lab
> (949) 824-8551 - Fax
>



-- 
In science, "fact" can only mean "confirmed to such a degree that it would
be perverse to withhold provisional assent." I suppose that apples might
start to rise tomorrow, but the possibility does not merit equal time in
physics classrooms.
*-Stephen Jay Gould*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20111031/ec8ee161/attachment.html>


More information about the Dev mailing list