[ensembl-dev] Ensembl annotation pipeline software available?
David Mathog
mathog at caltech.edu
Tue Apr 10 17:23:03 BST 2018
On 10-Apr-2018 03:10, Thibaut Hourlier wrote:
> Yes all the code used by Ensembl is free to use and can be found on
> github.com/Ensembl <http://github.com/Ensembl>. Unfortunately we do
> not have a proper documentation on how to install the pipelines and
> how to use them but we are working on it.
OK
>
> If by locally you mean on your laptop, it might take some time,
> probably more than a month but it is hard to predict. Our pipeline is
> made to be run on a cluster with hundreds of job running in parallel.
This would be on a ~40 thread large Dell server.
>
> All our pipelines are made to use MySQL databases which are created
> when the pipeline needs them. You need to have a database with the
> Ensembl schema containing your dna.
Why? The input dna consists of a fasta header (with completely
arbitrary information, might as well just be the numbers 1->N) and the
sequence. That's it. Other than the read mappings, what other
information would there be in a pre-annotated genome?
> If the assembly is available at
> NCBI the pipeline will do the right thing. Otherwise you will need to
> manually load your assembly into the database.
Nope, not there.
>
> We are using linuxbrew to install all the software we need:
> https://github.com/Ensembl/homebrew-ensembl
> <https://github.com/Ensembl/homebrew-ensembl>
> https://github.com/Ensembl/homebrew-cask
> <https://github.com/Ensembl/homebrew-cask>
> https://github.com/Ensembl/homebrew-external
> <https://github.com/Ensembl/homebrew-external>
> https://github.com/Ensembl/homebrew-moonshine
> <https://github.com/Ensembl/homebrew-moonshine> (you will need to get
> the license and archive for software like genscan)
This genscan? http://genes.mit.edu/license.html
Is there a list somewhere in the github repository of the dependencies?
Does one of the scripts check for these and report when it starts up?
> brew tap ensembl/ensembl
> brew tap ensembl/cask
> brew tap ensembl/external
> brew tap ensembl/moonshine
> brew install genebuild-annotation
> brew install rnaseq-pipeline
>
> Once all the softwares are installed, you will need these repositories
> to run the pipeline:
> https://github.com/Ensembl/ensembl <https://github.com/Ensembl/ensembl>
> https://github.com/Ensembl/ensembl-analysis
> <https://github.com/Ensembl/ensembl-analysis> dev/hive_master (branch)
> https://github.com/Ensembl/ensembl-hive
> <https://github.com/Ensembl/ensembl-hive>
> https://github.com/Ensembl/ensembl-compara
> <https://github.com/Ensembl/ensembl-compara>
> https://github.com/Ensembl/ensembl-io
> <https://github.com/Ensembl/ensembl-io>
> https://github.com/Ensembl/ensembl-killlist
> <https://github.com/Ensembl/ensembl-killlist>
> https://github.com/Ensembl/ensembl-production
> <https://github.com/Ensembl/ensembl-production>
> https://github.com/bioperl/bioperl-live
> <https://github.com/bioperl/bioperl-live> release-1-6-924 (tag)
>
> ensembl-hive is our job manager which we use with LSF, SGE is
> supported and some others job scheduler too. If you want to run jobs
> locally a bit more tuning might be required.
>
> The configuration of the pipeline will need some tweaking but we will
> be happy to help.
Before going through all of that, is there a way I could manually run a
few tests through just the mapping and gene prediction phases? As
noted in an earlier post the biggest problem seems to be when protein
and mRNAs are mapped onto the DNA, and the DNA typically has some rough
spots. The NCBI's code notes and works around those rough spots, Maker
by and large does not. It would be good to put through a few test sets
of known genomic DNA, corresponding mRNA and protein to see if the
results are "NCBI like" or "Maker like".
Basically this would just be:
0. mask (by whichever method is preferred, repeats are known)
1. map corresponding mRNA to genome
2. map corresponding protein to genome
3. run gene prediction on raw genome + mapping
Ideally the predicted gene's mRNA/protein will match the input fairly
closely.
If you could tell me the names of the programs used at each of these
steps it would be a big help in finding the corresponding commands in
all of the code you cited.
Thanks,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
>
> Thanks
> Thibaut
>
>
>> On 9 Apr 2018, at 17:46, David Mathog <mathog at caltech.edu> wrote:
>>
>> On 06-Apr-2018 14:12, David Mathog wrote:
>>> Greetings all,
>>> Is the software used for this
>>>
>>> http://uswest.ensembl.org/info/genome/genebuild/automatic_coding.html
>>> publicly available? That is, can it be downloaded and run locally?
>>
>> Found these:
>>
>> https://github.com/Ensembl/ensembl-analysis
>> Modules to interface with tools used in Ensembl Gene Annotation
>> Process and scripts to run pipelines
>>
>> https://github.com/Ensembl/ensembl
>> The Ensembl Core Perl API and SQL schema
>>
>> https://github.com/Ensembl/ensembl-annotation
>> The Ensembl gene annotation pipeline (a work in progress)
>>
>> and dozens of others. Have not located any documentation about how to
>> install and run the pipeline though. Anybody know where that might
>> be, or who to ask???
>>
>> I only need the parts to work from data in (genome, proteins, RNA) to
>> gff output. Anything having to do with checking data into or out of
>> EMBL databases is not required.
>>
>> Thanks,
>>
>> David Mathog
>> mathog at caltech.edu
>> Manager, Sequence Analysis Facility, Biology Division, Caltech
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
More information about the Dev
mailing list