[ensembl-dev] exon coordinate discrepancy between NCBI and Ensembl

Kiran Mukhyala mukhyala at gmail.com
Wed May 25 05:29:09 BST 2011

On Tue, May 24, 2011 at 12:46 PM, Reece Hart <reece at harts.net> wrote:

> Because it's so convenient to code for Ensembl, I'd still like to see if
> there's a way to accomplish what I want with Ensembl. The goal is convert
> HGVS variants specified using NCBI accessions between genomic, raw
> transcript (i.e., 'r.' variants), CDS, and protein coordinate systems. To
> achieve accurate conversion in the general case, it is necessary to have a
> single, shared understanding of the exon structure, accurate to nucleotide
> level, as implied by the named transcript. Exon-level similarity, even when
> the CDS is unchanged, doesn't cut it in this case.
> Does anyone know whether it would work to load NCBI exons directly into
> Ensembl? I'm hoping that populating the transcript, transcript_stable_id,
> exon, and exon_transcript tables with original NCBI data would suffice. Is
> that too naive?
In order to map genomic to transcript coordinates using the Ensembl API, one
requirement is that the transcript be derived from the reference genome.
Unfortunately, this is not true for a small percentage of RefSeqs. RefSeq
UTRs especially do not match the reference genome well.

What that means is that if you load NCBI exons directly into Ensembl, since
the API constructs the transcript sequence from the genome, the reference
genome derived transcript will not match the RefSeq sequence and so you will
not be able to accurately convert genomic to RefSeq coordinates.

This theoretically should not happen with the CCDS genes but I haven't
tested it. By the way, Ensembl does import RefSeq and CCDS genes into the
otherfeatures database.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20110524/a2d834ca/attachment.html>

More information about the Dev mailing list