[ensembl-dev] Ensembl annotation pipeline software available?
David Mathog
mathog at caltech.edu
Fri Apr 6 22:12:19 BST 2018
Greetings all,
Is the software used for this
http://uswest.ensembl.org/info/genome/genebuild/automatic_coding.html
publicly available? That is, can it be downloaded and run locally?
Background:
We have been working on a couple of echinoderm genomes, including S.
purpuratus for which a different assembly is available at the NCBI.
https://www.ncbi.nlm.nih.gov/genome/?term=strongylocentrotus
When we annotate these with Maker and then compare those results with
the NCBI annotations it is largely an apples and oranges situation,
making it hard to determine if a given genome region is actually an
improvement or not over the earlier assembly. This was true even when
we used Maker to reannotate the 4.2 Sp genome itself using the NCBI's
own annotation files for that same assembly to train Snap and Augustus.
For instance, in 16 randomly selected test scaffolds (2.5Mbp total)
Maker predicted 90 proteins and only 3 of the amino acids sequences were
identical with the NCBI predictions.
The best thing to do would be to just run the NCBI's annotation
pipeline. Alas, we cannot, at least not locally. The NCBI will not
release enough information about this:
https://www.ncbi.nlm.nih.gov/core/assets/genome/images/Pipeline_RFAM.png
to reproduce any of it accurately. The situation is at times baroque -
we know they use prosplign for part of it, and they even distribute
prosplign binaries. But I had problems running those binaries locally
(library version mismatch) so tried to build it from the toolkit - and
that cannot be done either. The prosplign algorithm implementation is
in there but not the code for the program itself.
Taking another tack, it seemed likely that Ensembl at some point made an
effort to make its gene prediction pipeline produce results which are
largely compatible with those from the NCBI. So if the Ensembl software
is available perhaps it will give predictions which match the NCBI
results more closely than do those from Maker.
I'm not running down Maker here, it isn't wrong, just different. The
issues we are having seem mostly to result from the large number of
rough spots in these genomes, and the NCBI's software is able to gloss
over some of that. For instance in the full Sp 4.2 genome NCBI
predicted 273 genes (not including tRNA and Mir genes) with assigned
gene names (ie, like NP_999661.1, gene=dlp) and of those a full 207 are
annotated with "exception:" (200 "annotated by transcript or proteomic
data", 6 "unclassified", and 1 "ribosomal slippage"). So 76% of the
_best_ NCBI results have iffy sequence regions of one type or another.
Thanks,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Dev
mailing list