[ensembl-dev] Ensembl annotation pipeline software available?

David Mathog mathog at caltech.edu
Mon Apr 16 20:15:51 BST 2018


On 16-Apr-2018 06:46, Thibaut Hourlier wrote:
> • Mask the genome using RepeatMasker and repbase. In some case we
> would use repeatmodeler to create a repeat library

NP_999661.1.fasta    (NCBI predicted protein 3908aa)
dystrophin_pep.fasta (evigene predicted protein, 3908aa,
    differs from preceding by 12 aa changes, no indels).
Scaffold162.fasta    (Sea urchin genome DNA for region, not masked)


> • Align species specific data with exonerate/genewise

Tried NP_999661 and Scaffold162 on the genewise web
service with these results:

https://www.ebi.ac.uk/Tools/services/web_genewise/toolresult.ebi?jobId=genewise-E20180416-173516-0354-73422181-p2m&analysis=alignment

The prediction is on the wrong strand.  I don't see a way to set that.

exonerate gives slightly different, and not so great, results with
these two peptide sequences:

exonerate --model protein2genome --percent 20 -q dystrophin_pep.fasta \
   -Q protein -T dna -t Scaffold162.fasta >/tmp/evigenePep_vs_genome.txt
#maps 486-2794  402842-264692
#maps 2798-2845 392816-392724 with 141348bp intron and then
#maps 2846-3668 251325-193226

So missing 485aa at the start, 4aa in the middle, 240aa at the end, plus 
a few aa inside the alignments.  When run instead with NP_999661.1.fasta 
the first and last alignments are still found but the center one is not. 
  If the missing pieces of the protein query are extracted and run 
separately more alignments are found, but exonerate doesn't map 
everything in one pass.  The "--percent 20" came from Maker.  If it is 
lowered to 2 from 20, or omitted, then NP_999661 maps like:

    0 -   29  517984 - 517897
    0 -  455  513283 - 404660
  485 - 2794  402842 - 264691
2845 - 3668  251324 - 193225
3667 - 3835  186617 - 380919
3749 - 3908  184419 - 181360

and dystrophin_pep like:
    0 -   29  517984 - 517897
    0 -  455  513283 - 404660
  485 - 2794  402842 - 264691
2797 - 3668  392816 - 193225
3667 - 3835  186617 - 380919
3749 - 3908  184419 - 181360

Which has things in more or less the right order, but it seems pretty 
confused about the 181k-392k region.

> • Align protein from other species with genBlast

Cheating here, using the same species, positive control...

#downloaded genblastg, link "genblast" to it.
ln -s /home/mathog/src/genblast/alignscore.txt alignscore.txt
ln -s /home/mathog/bin/blastall blastall
genblast -p genblastg -q NP_999661.1.fasta -t Scaffold162.fasta \
   -o myoutput -gff -id 58 -cdna -pro

and

genblast -p genblastg -q NP_999661.1.fasta -t Scaffold162.fasta \
   -o myoutput -gff -id 58 -cdna -pro

Both have the same 3725aa peptide for the best prediction.  See the 
attached dotplot to see which pieces are missing, which not surprisingly 
are somewhat correlated with the ends of the exonerate ranges. So far 
genblast has given the best non-NCBI mapping of this protein to the 
genome.  Unfortunately it is still missing chunks here and there.  
Perhaps one of genblast's many parameters compensates for bad sequence 
and will put them back in?

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dystrophin_pep_vs_genblast.PNG
Type: image/png
Size: 43147 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20180416/07cda21d/attachment.png>


More information about the Dev mailing list