[ensembl-dev] Ensembl annotation pipeline software available?

Fri Apr 6 22:12:19 BST 2018

Greetings all,

Is the software used for this

    http://uswest.ensembl.org/info/genome/genebuild/automatic_coding.html

publicly available?  That is, can it be downloaded and run locally?

Background:

We have been working on a couple of echinoderm genomes, including S. 
purpuratus for which a different assembly is available at the NCBI.

    https://www.ncbi.nlm.nih.gov/genome/?term=strongylocentrotus

When we annotate these with Maker and then compare those results with 
the NCBI annotations it is largely an apples and oranges situation, 
making it hard to determine if a given genome region is actually an 
improvement or not over the earlier assembly.  This was true even when 
we used Maker to reannotate the 4.2 Sp genome itself using the NCBI's 
own annotation files for that same assembly to train Snap and Augustus.  
For instance, in 16 randomly selected test scaffolds (2.5Mbp total) 
Maker predicted 90 proteins and only 3 of the amino acids sequences were 
identical with the NCBI predictions.

The best thing to do would be to just run the NCBI's annotation 
pipeline. Alas, we cannot, at least not locally.  The NCBI will not 
release enough information about this:

https://www.ncbi.nlm.nih.gov/core/assets/genome/images/Pipeline_RFAM.png

to reproduce any of it accurately.  The situation is at times baroque - 
we know they use prosplign for part of it, and they even distribute 
prosplign binaries.  But I had problems running those binaries locally 
(library version mismatch) so tried to build it from the toolkit - and 
that cannot be done either.  The prosplign algorithm implementation is 
in there but not the code for the program itself.

Taking another tack, it seemed likely that Ensembl at some point made an 
effort to make its gene prediction pipeline produce results which are 
largely compatible with those from the NCBI.  So if the Ensembl software 
is available perhaps it will give predictions which match the NCBI 
results more closely than do those from Maker.

I'm not running down Maker here, it isn't wrong, just different.  The 
issues we are having seem mostly to result from the large number of 
rough spots in these genomes, and the NCBI's software is able to gloss 
over some of that.  For instance in the full Sp 4.2 genome NCBI 
predicted 273 genes (not including tRNA and Mir genes) with assigned 
gene names (ie, like NP_999661.1, gene=dlp) and of those a full 207 are 
annotated with "exception:" (200 "annotated by transcript or proteomic 
data", 6 "unclassified", and 1 "ribosomal slippage").  So 76% of the 
_best_ NCBI results have iffy sequence regions of one type or another.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech