[ensembl-dev] Ensembl annotation pipeline software available?

Tue Apr 17 10:51:15 BST 2018

Hi.
Recently, I kept receving your emails.
But, I have done nothing about this work.
So, is there any wrong message?
Best wish for your reaserch.

At 2018-04-17 03:15:51, "David Mathog" <mathog at caltech.edu> wrote:
>On 16-Apr-2018 06:46, Thibaut Hourlier wrote:
>> • Mask the genome using RepeatMasker and repbase. In some case we
>> would use repeatmodeler to create a repeat library
>
>NP_999661.1.fasta    (NCBI predicted protein 3908aa)
>dystrophin_pep.fasta (evigene predicted protein, 3908aa,
>    differs from preceding by 12 aa changes, no indels).
>Scaffold162.fasta    (Sea urchin genome DNA for region, not masked)
>
>
>> • Align species specific data with exonerate/genewise
>
>Tried NP_999661 and Scaffold162 on the genewise web
>service with these results:
>
>https://www.ebi.ac.uk/Tools/services/web_genewise/toolresult.ebi?jobId=genewise-E20180416-173516-0354-73422181-p2m&analysis=alignment
>
>The prediction is on the wrong strand.  I don't see a way to set that.
>
>exonerate gives slightly different, and not so great, results with
>these two peptide sequences:
>
>exonerate --model protein2genome --percent 20 -q dystrophin_pep.fasta \
>   -Q protein -T dna -t Scaffold162.fasta >/tmp/evigenePep_vs_genome.txt
>#maps 486-2794  402842-264692
>#maps 2798-2845 392816-392724 with 141348bp intron and then
>#maps 2846-3668 251325-193226
>
>So missing 485aa at the start, 4aa in the middle, 240aa at the end, plus 
>a few aa inside the alignments.  When run instead with NP_999661.1.fasta 
>the first and last alignments are still found but the center one is not. 
>  If the missing pieces of the protein query are extracted and run 
>separately more alignments are found, but exonerate doesn't map 
>everything in one pass.  The "--percent 20" came from Maker.  If it is 
>lowered to 2 from 20, or omitted, then NP_999661 maps like:
>
>    0 -   29  517984 - 517897
>    0 -  455  513283 - 404660
>  485 - 2794  402842 - 264691
>2845 - 3668  251324 - 193225
>3667 - 3835  186617 - 380919
>3749 - 3908  184419 - 181360
>
>and dystrophin_pep like:
>    0 -   29  517984 - 517897
>    0 -  455  513283 - 404660
>  485 - 2794  402842 - 264691
>2797 - 3668  392816 - 193225
>3667 - 3835  186617 - 380919
>3749 - 3908  184419 - 181360
>
>Which has things in more or less the right order, but it seems pretty 
>confused about the 181k-392k region.
>
>> • Align protein from other species with genBlast
>
>Cheating here, using the same species, positive control...
>
>#downloaded genblastg, link "genblast" to it.
>ln -s /home/mathog/src/genblast/alignscore.txt alignscore.txt
>ln -s /home/mathog/bin/blastall blastall
>genblast -p genblastg -q NP_999661.1.fasta -t Scaffold162.fasta \
>   -o myoutput -gff -id 58 -cdna -pro
>
>and
>
>genblast -p genblastg -q NP_999661.1.fasta -t Scaffold162.fasta \
>   -o myoutput -gff -id 58 -cdna -pro
>
>Both have the same 3725aa peptide for the best prediction.  See the 
>attached dotplot to see which pieces are missing, which not surprisingly 
>are somewhat correlated with the ends of the exonerate ranges. So far 
>genblast has given the best non-NCBI mapping of this protein to the 
>genome.  Unfortunately it is still missing chunks here and there.  
>Perhaps one of genblast's many parameters compensates for bad sequence 
>and will put them back in?
>
>Thanks,
>
>David Mathog
>mathog at caltech.edu
>Manager, Sequence Analysis Facility, Biology Division, Caltech
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20180417/743b1220/attachment.html>