[ensembl-dev] VEP: reporting HGVS identifiers with RefSeq accessions

Wed Feb 15 13:53:47 GMT 2012

Hi Reece

CCDS is only matching CDS structure not complete transcript structure.  
The case of type 2 below actually does match CDS sequence and  
structure. Your script contains a couple of bugs which mean that it  
doesn't identify the CDS as matching (you compare complete transcript  
sequence ($t->seq->seq) rather than CDS sequence ($t- 
 >translateable_seq), and your CDS exon structure comparison does not  
trim the exons at both ends of the transcript to CDS start and end  
(only one end is trimmed depending on strand - you can use $t- 
 >get_all_translateable_Exons to give you the trimmed exons so you  
don't need to do the trimming yourself)).

Regards

Steve

On 15 Feb 2012, at 06:39, Reece Hart wrote:

> Will-
>
> Fast Forward 10 hours...
> I wrote a reprehensible hack to loop over otherfeatures NMs, then find
> overlapping ENSTs on the same slice. The code is at
> http://goo.gl/drlJX . Results look like this (chr Y):
>
> # 141 transcripts
> *         NM_006883.2     6473
>         Y  1  1951     541633     569564  6  5
>   NnLCEeS ENST00000334060 ENSG00000185960    CCDS14106.1,CCDS14107.1
> NM_006883.2,NM_000451.3   Y  1  1951     541633     569564  6  5
> *         NM_018390.3     55344
>        Y  1  5305     150855     166002  7  6
>    n    S ENST00000399012 ENSG00000182378    CCDS14103.1
> NM_018390.3       Y  1  5287     150855     166002  8  6
> *         NM_001006120.2  378949
>        Y -1  1881   24026501   24038660 12  1
>  NnLC  S ENST00000382673 ENSG00000242389    CCDS35481.1
> NM_001006118.2    Y -1  1881   24026501   24062201 12  1
>
> * lines indicate the NMs from otherfeatures. Beneath that are 0 or
> more overlapping ENSTs. The first part of the line is a 7-character
> summary: N=exon number matches, n=cds-trimmed exon numbers match,
> L=cds length matches, C=cds sequence matches, E=exon boundaries match,
> e=cds-trimmed exon boundaries match, S=strand matches. Columns are
> display id, gene_id, ccds, nm, chr, cds start, cds end, transcript
> start, end, exon count, cds-trimmed exon count (e.g., cds in second
> exon). Not shown are the exon arrays, which you'll get if you run the
> script.
>
> In the above I excerpted 3 prominent cases.
>
> 1) NM_006883.2 matches ENST00000334060 in all respects: exon number,
> length, cds, exon structure, etc.
> 2) NM_018390.3 overlaps ENST00000399012, but is not the same
> translation *even though that ENST shows CCDS and NM_018390.3 as
> xrefs*.
> 3) NM_001006120.2 overlaps ENST00000382673 and has an identical
> translation *but has a different exon structure*. This is the case I
> alluded to in my previous email that might cause a coding variant to
> appear as non-coding or vice versa.
>
> Caveat: The code probably contains bugs or abuses of the API.
>
> So, does your comment about using --ccds and --xref_refseq still hold?
>
> -Reece
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/