[ensembl-dev] Ensembl annotation pipeline software available?

Mon Apr 16 14:46:12 BST 2018

> On 10 Apr 2018, at 17:23, David Mathog <mathog at caltech.edu> wrote:
> 
> On 10-Apr-2018 03:10, Thibaut Hourlier wrote:
>> Yes all the code used by Ensembl is free to use and can be found on
>> github.com/Ensembl <http://github.com/Ensembl>. Unfortunately we do
>> not have a proper documentation on how to install the pipelines and
>> how to use them but we are working on it.
> 
> OK
> 
>> If by locally you mean on your laptop, it might take some time,
>> probably more than a month but it is hard to predict. Our pipeline is
>> made to be run on a cluster with hundreds of job running in parallel.
> 
> This would be on a ~40 thread large Dell server.
> 
>> All our pipelines are made to use MySQL databases which are created
>> when the pipeline needs them. You need to have a database with the
>> Ensembl schema containing your dna.
> 
> Why?  The input dna consists of a fasta header (with completely arbitrary information, might as well just be the numbers 1->N) and the sequence.  That's it.  Other than the read mappings, what other information would there be in a pre-annotated genome?

Yes technically you only need your sequences. We store all the data we produce in databases. The first step of our annotation pipeline is to store the sequences in the database that will be later used for the website, the public MySQL instance and anyone in Ensembl who needs the sequences for a species. It makes more sense for us to have this database ready at the beginning rather than the end of our production cycle.

> 
>> If the assembly is available at
>> NCBI the pipeline will do the right thing. Otherwise you will need to
>> manually load your assembly into the database.
> 
> Nope, not there.
> 
>> We are using linuxbrew to install all the software we need:
>> https://github.com/Ensembl/homebrew-ensembl
>> <https://github.com/Ensembl/homebrew-ensembl>
>> https://github.com/Ensembl/homebrew-cask
>> <https://github.com/Ensembl/homebrew-cask>
>> https://github.com/Ensembl/homebrew-external
>> <https://github.com/Ensembl/homebrew-external>
>> https://github.com/Ensembl/homebrew-moonshine
>> <https://github.com/Ensembl/homebrew-moonshine> (you will need to get
>> the license and archive for software like genscan)
> 
> This genscan? http://genes.mit.edu/license.html <http://genes.mit.edu/license.html>

Yes this Genscan. I understand that in your case you will not want to run it so asking for the license will be useless. As it is a dependency of the pipeline I wanted to make you aware that some of the software might need a license.

> 
> Is there a list somewhere in the github repository of the dependencies?  Does one of the scripts check for these and report when it starts up?

Linuxbrew is a package manager which doesn’t require admin rights. So by installing the software with the commands below you should have all the software and dependencies required

> 
>> brew tap ensembl/ensembl
>> brew tap ensembl/cask
>> brew tap ensembl/external
>> brew tap ensembl/moonshine
>> brew install genebuild-annotation
>> brew install rnaseq-pipeline
>> Once all the softwares are installed, you will need these repositories
>> to run the pipeline:
>> https://github.com/Ensembl/ensembl <https://github.com/Ensembl/ensembl>
>> https://github.com/Ensembl/ensembl-analysis
>> <https://github.com/Ensembl/ensembl-analysis> dev/hive_master (branch)
>> https://github.com/Ensembl/ensembl-hive
>> <https://github.com/Ensembl/ensembl-hive>
>> https://github.com/Ensembl/ensembl-compara
>> <https://github.com/Ensembl/ensembl-compara>
>> https://github.com/Ensembl/ensembl-io <https://github.com/Ensembl/ensembl-io>
>> https://github.com/Ensembl/ensembl-killlist
>> <https://github.com/Ensembl/ensembl-killlist>
>> https://github.com/Ensembl/ensembl-production
>> <https://github.com/Ensembl/ensembl-production>
>> https://github.com/bioperl/bioperl-live
>> <https://github.com/bioperl/bioperl-live> release-1-6-924 (tag)
>> ensembl-hive is our job manager which we use with LSF, SGE is
>> supported and some others job scheduler too. If you want to run jobs
>> locally a bit more tuning might be required.
>> The configuration of the pipeline will need some tweaking but we will
>> be happy to help.
> 
> Before going through all of that, is there a way I could manually run a few tests through just the mapping and gene prediction phases?   As noted in an earlier post the biggest problem seems to be when protein and mRNAs are mapped onto the DNA, and the DNA typically has some rough spots.  The NCBI's code notes and works around those rough spots, Maker by and large does not.  It would be good to put through a few test sets of known genomic DNA, corresponding mRNA and protein to see if the results are "NCBI like" or "Maker like".
> 
> Basically this would just be:
> 
> 0.  mask (by whichever method is preferred, repeats are known)
> 1.  map corresponding mRNA to genome
> 2.  map corresponding protein to genome
> 3.  run gene prediction on raw genome + mapping
> 
> Ideally the predicted gene's mRNA/protein will match the input fairly closely.

We do not use gene prediction to generate the annotation. We only base our annotation on cDNA/transcriptomic data and proteic data by aligning them on the genome. What we do is:
• Mask the genome using RepeatMasker and repbase. In some case we would use repeatmodeler to create a repeat library
• Align species specific data with exonerate/genewise
• Align protein from other species with genBlast
• Select the best gene model at each location using Perl code in ensembl-analysis

Thanks
Thibaut

> 
> If you could tell me the names of the programs used at each of these steps it would be a big help in finding the corresponding commands in all of the code you cited.
> 
> Thanks,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> 
>> Thanks
>> Thibaut
>>> On 9 Apr 2018, at 17:46, David Mathog <mathog at caltech.edu> wrote:
>>> On 06-Apr-2018 14:12, David Mathog wrote:
>>>> Greetings all,
>>>> Is the software used for this
>>>>  http://uswest.ensembl.org/info/genome/genebuild/automatic_coding.html
>>>> publicly available?  That is, can it be downloaded and run locally?
>>> Found these:
>>> https://github.com/Ensembl/ensembl-analysis
>>> Modules to interface with tools used in Ensembl Gene Annotation
>>> Process and scripts to run pipelines
>>> https://github.com/Ensembl/ensembl
>>> The Ensembl Core Perl API and SQL schema
>>> https://github.com/Ensembl/ensembl-annotation
>>> The Ensembl gene annotation pipeline (a work in progress)
>>> and dozens of others.  Have not located any documentation about how to install and run the pipeline though.  Anybody know where that might be, or who to ask???
>>> I only need the parts to work from data in (genome, proteins, RNA) to gff output.  Anything having to do with checking data into or out of EMBL databases is not required.
>>> Thanks,
>>> David Mathog
>>> mathog at caltech.edu
>>> Manager, Sequence Analysis Facility, Biology Division, Caltech
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20180416/ed195686/attachment.html>