[ensembl-dev] Some RefSeq transcripts seem broken

Will McLaren wm2 at ebi.ac.uk
Tue Sep 27 10:01:23 BST 2016


Hi João,

RefSeq transcripts sequences can differ from the underlying reference
genome sequence. Ensembl, Ensembl transcripts and therefore the VEP always
use the reference genome.

When we import RefSeq transcripts we are given coordinate mappings for the
exons that are a best match to the reference - if there are any
substitutions or indels relative to the reference sequence these can go
unaccounted for. Then when our API constructs the the transcript from the
reference genome and these coordinates, such differences can give rise to
erroneous translations such as the one you've found.

There's a note to this effect in our documentation:
http://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#refseq

For our human database we also produce some flags indicating when a RefSeq
transcript differs from the reference and/or the matched Ensembl
transcript:
http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#refseq_match

Hope that helps

Will McLaren
Ensembl Variation

On 27 September 2016 at 03:48, João Eiras <joao.eiras at gmail.com> wrote:

> Hi.
>
> I did a small VEP plugin that outputs the wild type protein sequence
> from the database together with its annotations, so then I get extract
> some k-mers around annotations.
>
> I got a bit confused to see the amino-acid sequence for some refseq
> transcripts containing many stop codons. One such example are the
> transcripts ENSMUST00000114099 and NM_172709.3 affected by variant
> rs223913170.
>
> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
> chr5 38300289 rs223913170 TG T 5755.73 . . .
>
> The correct sequence is:
> MPGGPGAPSSPAASSGSSRAAPSGIAACPLSPPPLARGSPQASGPRRGASVPQKLAETLSSQYGLNVFVA
> GLLFLLAWAVHATGVGKSDLLCVLTALMLLQLLWMLWYVGRSYMQRRLIRPKDTHAGARWLRGSITLFAF
> ITVVLGCLKVAYFIGFSECLSATEGVFPVTHAVHTLLQVYFLWGHAKDIIMSFKTLERFGVIHSVFTNLL
> LWANSVLNESKHQLNEHKERLITLGFGNITIVLDDHTPQCNCTPPALCSALSHGIYYLYPFNIEYQILAS
> TMLYVLWKNIGRRVDSSQHQKMQCRFDGVLVGSVLGLTVLAATIAVVVVYMIHIGRSKSKSESALIMFYL
> YAITVLLLMGAAGLVGSWIYRVDEKSLDESKNPARKLDVDLLVATGSGSWLLSWGSILAIACAETRPPYT
> WYNLPYSVLVIVEKYVQNIFIIESVHLEPEGVPEDVRTLRVVTVCSSEAAALAASTLGSQGMAQDGSPAV
> NGNLCLQQRCGKEDQESGWEGATGTTRCLDFLQGGMKRRLLRNITAFLFLCNISLWIPPAFGCRPEYDNG
> LEEIVFGFEPWIIVVNLAMPFSIFYRMHAAAALFEVYCKI
>
> while VEP returns (difference in lower case).
> MPGGPGAPSSPAASSGSSRAAPSGIAACPLSPPPLARGSPQASGPRRGASVPQKLAETLSSQYGLNVFVA
> GLLFLLAWAVHATGVGKSDLLCVLTALMLLQLLWMLWYVGRSYMQRRLIRPKDTHAGARWLRGSITLFAF
> ITVVLGCLKVAYFIGFSECLSATEGVFPVTHAVHTLLQVYFLWGHAKDIIMSFKTLERFGVIHSVFTNLL
> LWANSVLNESKHQLNEHKERLITLGFGNITIVLDDHTPQCNCTPPALCSALSHGIYYLYPFNIEYQILAS
> TMLYVLWKNIGRRVDSSQHQKMQCRFDGVLVGSVLGLTVLAATIAVVVVYMIHIGRSKSKSESALIMFYL
> YAITVLLLMGAAGLVGSWIYRVDEKSLDESKNPARKLDVDLLVATGSGSWLLSWGSILAIACAETRPPYT
> WYNLPYSVLVIVEKYVQNIFIIESVHLEPEGVPEDVRTLRVVTV lqqrgcrtgcihsrepgdgpgwvtcc
> qwksvsaaevwergpgvwlgrsygdnpmsglpsgrheeeasqkhhglsvslqhlaldspclwlpsrv*qr
> iggnclwl*tldncgqpghalfhflpdarsccpl*gll*dl
>
> This was not the only case I saw, but didn't gather any other
> examples. Shouldn't be too hard to make a script find refseq
> transcripts that start at the same index as some ensembl tramscripts
> and compare the AA sequences, but my perl-fu is weak.
>
> What's up with this ?
>
> Thank you.
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20160927/4238e5b7/attachment.html>


More information about the Dev mailing list