[ensembl-dev] Ensembl annotation pipeline software available?

Tue Apr 10 17:23:03 BST 2018

On 10-Apr-2018 03:10, Thibaut Hourlier wrote:
> Yes all the code used by Ensembl is free to use and can be found on
> github.com/Ensembl <http://github.com/Ensembl>. Unfortunately we do
> not have a proper documentation on how to install the pipelines and
> how to use them but we are working on it.

OK

> 
> If by locally you mean on your laptop, it might take some time,
> probably more than a month but it is hard to predict. Our pipeline is
> made to be run on a cluster with hundreds of job running in parallel.

This would be on a ~40 thread large Dell server.

> 
> All our pipelines are made to use MySQL databases which are created
> when the pipeline needs them. You need to have a database with the
> Ensembl schema containing your dna.

Why?  The input dna consists of a fasta header (with completely 
arbitrary information, might as well just be the numbers 1->N) and the 
sequence.  That's it.  Other than the read mappings, what other 
information would there be in a pre-annotated genome?

> If the assembly is available at
> NCBI the pipeline will do the right thing. Otherwise you will need to
> manually load your assembly into the database.

Nope, not there.

> 
> We are using linuxbrew to install all the software we need:
> https://github.com/Ensembl/homebrew-ensembl
> <https://github.com/Ensembl/homebrew-ensembl>
> https://github.com/Ensembl/homebrew-cask
> <https://github.com/Ensembl/homebrew-cask>
> https://github.com/Ensembl/homebrew-external
> <https://github.com/Ensembl/homebrew-external>
> https://github.com/Ensembl/homebrew-moonshine
> <https://github.com/Ensembl/homebrew-moonshine> (you will need to get
> the license and archive for software like genscan)

This genscan? http://genes.mit.edu/license.html

Is there a list somewhere in the github repository of the dependencies?  
Does one of the scripts check for these and report when it starts up?

> brew tap ensembl/ensembl
> brew tap ensembl/cask
> brew tap ensembl/external
> brew tap ensembl/moonshine
> brew install genebuild-annotation
> brew install rnaseq-pipeline
> 
> Once all the softwares are installed, you will need these repositories
> to run the pipeline:
> https://github.com/Ensembl/ensembl <https://github.com/Ensembl/ensembl>
> https://github.com/Ensembl/ensembl-analysis
> <https://github.com/Ensembl/ensembl-analysis> dev/hive_master (branch)
> https://github.com/Ensembl/ensembl-hive
> <https://github.com/Ensembl/ensembl-hive>
> https://github.com/Ensembl/ensembl-compara
> <https://github.com/Ensembl/ensembl-compara>
> https://github.com/Ensembl/ensembl-io 
> <https://github.com/Ensembl/ensembl-io>
> https://github.com/Ensembl/ensembl-killlist
> <https://github.com/Ensembl/ensembl-killlist>
> https://github.com/Ensembl/ensembl-production
> <https://github.com/Ensembl/ensembl-production>
> https://github.com/bioperl/bioperl-live
> <https://github.com/bioperl/bioperl-live> release-1-6-924 (tag)
> 
> ensembl-hive is our job manager which we use with LSF, SGE is
> supported and some others job scheduler too. If you want to run jobs
> locally a bit more tuning might be required.
> 
> The configuration of the pipeline will need some tweaking but we will
> be happy to help.

Before going through all of that, is there a way I could manually run a 
few tests through just the mapping and gene prediction phases?   As 
noted in an earlier post the biggest problem seems to be when protein 
and mRNAs are mapped onto the DNA, and the DNA typically has some rough 
spots.  The NCBI's code notes and works around those rough spots, Maker 
by and large does not.  It would be good to put through a few test sets 
of known genomic DNA, corresponding mRNA and protein to see if the 
results are "NCBI like" or "Maker like".

Basically this would just be:

0.  mask (by whichever method is preferred, repeats are known)
1.  map corresponding mRNA to genome
2.  map corresponding protein to genome
3.  run gene prediction on raw genome + mapping

Ideally the predicted gene's mRNA/protein will match the input fairly 
closely.

If you could tell me the names of the programs used at each of these 
steps it would be a big help in finding the corresponding commands in 
all of the code you cited.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

> 
> Thanks
> Thibaut
> 
> 
>> On 9 Apr 2018, at 17:46, David Mathog <mathog at caltech.edu> wrote:
>> 
>> On 06-Apr-2018 14:12, David Mathog wrote:
>>> Greetings all,
>>> Is the software used for this
>>>   
>>> http://uswest.ensembl.org/info/genome/genebuild/automatic_coding.html
>>> publicly available?  That is, can it be downloaded and run locally?
>> 
>> Found these:
>> 
>> https://github.com/Ensembl/ensembl-analysis
>>  Modules to interface with tools used in Ensembl Gene Annotation
>>  Process and scripts to run pipelines
>> 
>> https://github.com/Ensembl/ensembl
>>  The Ensembl Core Perl API and SQL schema
>> 
>> https://github.com/Ensembl/ensembl-annotation
>>  The Ensembl gene annotation pipeline (a work in progress)
>> 
>> and dozens of others.  Have not located any documentation about how to 
>> install and run the pipeline though.  Anybody know where that might 
>> be, or who to ask???
>> 
>> I only need the parts to work from data in (genome, proteins, RNA) to 
>> gff output.  Anything having to do with checking data into or out of 
>> EMBL databases is not required.
>> 
>> Thanks,
>> 
>> David Mathog
>> mathog at caltech.edu
>> Manager, Sequence Analysis Facility, Biology Division, Caltech
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: 
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/