[ensembl-dev] Ensembl annotation pipeline software available?
mawenlong
mawenlong_nwsuaf at 163.com
Tue Apr 17 10:51:15 BST 2018
Hi.
Recently, I kept receving your emails.
But, I have done nothing about this work.
So, is there any wrong message?
Best wish for your reaserch.
At 2018-04-17 03:15:51, "David Mathog" <mathog at caltech.edu> wrote:
>On 16-Apr-2018 06:46, Thibaut Hourlier wrote:
>> • Mask the genome using RepeatMasker and repbase. In some case we
>> would use repeatmodeler to create a repeat library
>
>NP_999661.1.fasta (NCBI predicted protein 3908aa)
>dystrophin_pep.fasta (evigene predicted protein, 3908aa,
> differs from preceding by 12 aa changes, no indels).
>Scaffold162.fasta (Sea urchin genome DNA for region, not masked)
>
>
>> • Align species specific data with exonerate/genewise
>
>Tried NP_999661 and Scaffold162 on the genewise web
>service with these results:
>
>https://www.ebi.ac.uk/Tools/services/web_genewise/toolresult.ebi?jobId=genewise-E20180416-173516-0354-73422181-p2m&analysis=alignment
>
>The prediction is on the wrong strand. I don't see a way to set that.
>
>exonerate gives slightly different, and not so great, results with
>these two peptide sequences:
>
>exonerate --model protein2genome --percent 20 -q dystrophin_pep.fasta \
> -Q protein -T dna -t Scaffold162.fasta >/tmp/evigenePep_vs_genome.txt
>#maps 486-2794 402842-264692
>#maps 2798-2845 392816-392724 with 141348bp intron and then
>#maps 2846-3668 251325-193226
>
>So missing 485aa at the start, 4aa in the middle, 240aa at the end, plus
>a few aa inside the alignments. When run instead with NP_999661.1.fasta
>the first and last alignments are still found but the center one is not.
> If the missing pieces of the protein query are extracted and run
>separately more alignments are found, but exonerate doesn't map
>everything in one pass. The "--percent 20" came from Maker. If it is
>lowered to 2 from 20, or omitted, then NP_999661 maps like:
>
> 0 - 29 517984 - 517897
> 0 - 455 513283 - 404660
> 485 - 2794 402842 - 264691
>2845 - 3668 251324 - 193225
>3667 - 3835 186617 - 380919
>3749 - 3908 184419 - 181360
>
>and dystrophin_pep like:
> 0 - 29 517984 - 517897
> 0 - 455 513283 - 404660
> 485 - 2794 402842 - 264691
>2797 - 3668 392816 - 193225
>3667 - 3835 186617 - 380919
>3749 - 3908 184419 - 181360
>
>Which has things in more or less the right order, but it seems pretty
>confused about the 181k-392k region.
>
>> • Align protein from other species with genBlast
>
>Cheating here, using the same species, positive control...
>
>#downloaded genblastg, link "genblast" to it.
>ln -s /home/mathog/src/genblast/alignscore.txt alignscore.txt
>ln -s /home/mathog/bin/blastall blastall
>genblast -p genblastg -q NP_999661.1.fasta -t Scaffold162.fasta \
> -o myoutput -gff -id 58 -cdna -pro
>
>and
>
>genblast -p genblastg -q NP_999661.1.fasta -t Scaffold162.fasta \
> -o myoutput -gff -id 58 -cdna -pro
>
>Both have the same 3725aa peptide for the best prediction. See the
>attached dotplot to see which pieces are missing, which not surprisingly
>are somewhat correlated with the ends of the exonerate ranges. So far
>genblast has given the best non-NCBI mapping of this protein to the
>genome. Unfortunately it is still missing chunks here and there.
>Perhaps one of genblast's many parameters compensates for bad sequence
>and will put them back in?
>
>Thanks,
>
>David Mathog
>mathog at caltech.edu
>Manager, Sequence Analysis Facility, Biology Division, Caltech
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20180417/743b1220/attachment.html>
More information about the Dev
mailing list