[ensembl-dev] VEP: reporting HGVS identifiers with RefSeq accessions

Reece Hart reece at harts.net
Wed Feb 15 06:39:09 GMT 2012


Will-

Fast Forward 10 hours...
I wrote a reprehensible hack to loop over otherfeatures NMs, then find
overlapping ENSTs on the same slice. The code is at
http://goo.gl/drlJX . Results look like this (chr Y):

# 141 transcripts
*         NM_006883.2     6473
        Y  1  1951     541633     569564  6  5
  NnLCEeS ENST00000334060 ENSG00000185960    CCDS14106.1,CCDS14107.1
NM_006883.2,NM_000451.3   Y  1  1951     541633     569564  6  5
*         NM_018390.3     55344
        Y  1  5305     150855     166002  7  6
   n    S ENST00000399012 ENSG00000182378    CCDS14103.1
NM_018390.3       Y  1  5287     150855     166002  8  6
*         NM_001006120.2  378949
        Y -1  1881   24026501   24038660 12  1
  NnLC  S ENST00000382673 ENSG00000242389    CCDS35481.1
NM_001006118.2    Y -1  1881   24026501   24062201 12  1

* lines indicate the NMs from otherfeatures. Beneath that are 0 or
more overlapping ENSTs. The first part of the line is a 7-character
summary: N=exon number matches, n=cds-trimmed exon numbers match,
L=cds length matches, C=cds sequence matches, E=exon boundaries match,
e=cds-trimmed exon boundaries match, S=strand matches. Columns are
display id, gene_id, ccds, nm, chr, cds start, cds end, transcript
start, end, exon count, cds-trimmed exon count (e.g., cds in second
exon). Not shown are the exon arrays, which you'll get if you run the
script.

In the above I excerpted 3 prominent cases.

1) NM_006883.2 matches ENST00000334060 in all respects: exon number,
length, cds, exon structure, etc.
2) NM_018390.3 overlaps ENST00000399012, but is not the same
translation *even though that ENST shows CCDS and NM_018390.3 as
xrefs*.
3) NM_001006120.2 overlaps ENST00000382673 and has an identical
translation *but has a different exon structure*. This is the case I
alluded to in my previous email that might cause a coding variant to
appear as non-coding or vice versa.

Caveat: The code probably contains bugs or abuses of the API.

So, does your comment about using --ccds and --xref_refseq still hold?

-Reece




More information about the Dev mailing list