[ensembl-dev] VEP: reporting HGVS identifiers with RefSeq accessions
Steve Searle
searle at sanger.ac.uk
Wed Feb 15 13:53:47 GMT 2012
Hi Reece
CCDS is only matching CDS structure not complete transcript structure.
The case of type 2 below actually does match CDS sequence and
structure. Your script contains a couple of bugs which mean that it
doesn't identify the CDS as matching (you compare complete transcript
sequence ($t->seq->seq) rather than CDS sequence ($t-
>translateable_seq), and your CDS exon structure comparison does not
trim the exons at both ends of the transcript to CDS start and end
(only one end is trimmed depending on strand - you can use $t-
>get_all_translateable_Exons to give you the trimmed exons so you
don't need to do the trimming yourself)).
Regards
Steve
On 15 Feb 2012, at 06:39, Reece Hart wrote:
> Will-
>
> Fast Forward 10 hours...
> I wrote a reprehensible hack to loop over otherfeatures NMs, then find
> overlapping ENSTs on the same slice. The code is at
> http://goo.gl/drlJX . Results look like this (chr Y):
>
> # 141 transcripts
> * NM_006883.2 6473
> Y 1 1951 541633 569564 6 5
> NnLCEeS ENST00000334060 ENSG00000185960 CCDS14106.1,CCDS14107.1
> NM_006883.2,NM_000451.3 Y 1 1951 541633 569564 6 5
> * NM_018390.3 55344
> Y 1 5305 150855 166002 7 6
> n S ENST00000399012 ENSG00000182378 CCDS14103.1
> NM_018390.3 Y 1 5287 150855 166002 8 6
> * NM_001006120.2 378949
> Y -1 1881 24026501 24038660 12 1
> NnLC S ENST00000382673 ENSG00000242389 CCDS35481.1
> NM_001006118.2 Y -1 1881 24026501 24062201 12 1
>
> * lines indicate the NMs from otherfeatures. Beneath that are 0 or
> more overlapping ENSTs. The first part of the line is a 7-character
> summary: N=exon number matches, n=cds-trimmed exon numbers match,
> L=cds length matches, C=cds sequence matches, E=exon boundaries match,
> e=cds-trimmed exon boundaries match, S=strand matches. Columns are
> display id, gene_id, ccds, nm, chr, cds start, cds end, transcript
> start, end, exon count, cds-trimmed exon count (e.g., cds in second
> exon). Not shown are the exon arrays, which you'll get if you run the
> script.
>
> In the above I excerpted 3 prominent cases.
>
> 1) NM_006883.2 matches ENST00000334060 in all respects: exon number,
> length, cds, exon structure, etc.
> 2) NM_018390.3 overlaps ENST00000399012, but is not the same
> translation *even though that ENST shows CCDS and NM_018390.3 as
> xrefs*.
> 3) NM_001006120.2 overlaps ENST00000382673 and has an identical
> translation *but has a different exon structure*. This is the case I
> alluded to in my previous email that might cause a coding variant to
> appear as non-coding or vice versa.
>
> Caveat: The code probably contains bugs or abuses of the API.
>
> So, does your comment about using --ccds and --xref_refseq still hold?
>
> -Reece
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
More information about the Dev
mailing list