[ensembl-dev] Ensembl annotation pipeline software available?
David Mathog
mathog at caltech.edu
Mon Apr 16 20:15:51 BST 2018
On 16-Apr-2018 06:46, Thibaut Hourlier wrote:
> • Mask the genome using RepeatMasker and repbase. In some case we
> would use repeatmodeler to create a repeat library
NP_999661.1.fasta (NCBI predicted protein 3908aa)
dystrophin_pep.fasta (evigene predicted protein, 3908aa,
differs from preceding by 12 aa changes, no indels).
Scaffold162.fasta (Sea urchin genome DNA for region, not masked)
> • Align species specific data with exonerate/genewise
Tried NP_999661 and Scaffold162 on the genewise web
service with these results:
https://www.ebi.ac.uk/Tools/services/web_genewise/toolresult.ebi?jobId=genewise-E20180416-173516-0354-73422181-p2m&analysis=alignment
The prediction is on the wrong strand. I don't see a way to set that.
exonerate gives slightly different, and not so great, results with
these two peptide sequences:
exonerate --model protein2genome --percent 20 -q dystrophin_pep.fasta \
-Q protein -T dna -t Scaffold162.fasta >/tmp/evigenePep_vs_genome.txt
#maps 486-2794 402842-264692
#maps 2798-2845 392816-392724 with 141348bp intron and then
#maps 2846-3668 251325-193226
So missing 485aa at the start, 4aa in the middle, 240aa at the end, plus
a few aa inside the alignments. When run instead with NP_999661.1.fasta
the first and last alignments are still found but the center one is not.
If the missing pieces of the protein query are extracted and run
separately more alignments are found, but exonerate doesn't map
everything in one pass. The "--percent 20" came from Maker. If it is
lowered to 2 from 20, or omitted, then NP_999661 maps like:
0 - 29 517984 - 517897
0 - 455 513283 - 404660
485 - 2794 402842 - 264691
2845 - 3668 251324 - 193225
3667 - 3835 186617 - 380919
3749 - 3908 184419 - 181360
and dystrophin_pep like:
0 - 29 517984 - 517897
0 - 455 513283 - 404660
485 - 2794 402842 - 264691
2797 - 3668 392816 - 193225
3667 - 3835 186617 - 380919
3749 - 3908 184419 - 181360
Which has things in more or less the right order, but it seems pretty
confused about the 181k-392k region.
> • Align protein from other species with genBlast
Cheating here, using the same species, positive control...
#downloaded genblastg, link "genblast" to it.
ln -s /home/mathog/src/genblast/alignscore.txt alignscore.txt
ln -s /home/mathog/bin/blastall blastall
genblast -p genblastg -q NP_999661.1.fasta -t Scaffold162.fasta \
-o myoutput -gff -id 58 -cdna -pro
and
genblast -p genblastg -q NP_999661.1.fasta -t Scaffold162.fasta \
-o myoutput -gff -id 58 -cdna -pro
Both have the same 3725aa peptide for the best prediction. See the
attached dotplot to see which pieces are missing, which not surprisingly
are somewhat correlated with the ends of the exonerate ranges. So far
genblast has given the best non-NCBI mapping of this protein to the
genome. Unfortunately it is still missing chunks here and there.
Perhaps one of genblast's many parameters compensates for bad sequence
and will put them back in?
Thanks,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dystrophin_pep_vs_genblast.PNG
Type: image/png
Size: 43147 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20180416/07cda21d/attachment.png>
More information about the Dev
mailing list