[ensembl-dev] exon coordinate discrepancy between NCBI and Ensembl

Mon May 23 18:48:15 BST 2011

On Mon, 2011-05-23 at 09:43 -0700, Kiran Mukhyala wrote:
> As far as I understand the Ensembl model, a refseq_dna (an external
> reference) linked to an ensembl transcript, does not mean their exons
> are identical. NCBI's mapping of NM_023035.2 to the genome could be
> different from Ensembl's mapping of ENST00000360228 because the
> mapping methods are different and the sequences could also be
> different.
> 
> Is there any other reason you expect the transcript models to be
> identical?

On Mon, 2011-05-23 at 17:47 +0100, Susan Fairley wrote:
> You note that both Ensembl and NCBI map rs58729888 to the same genomic
> position. As the two transcript structures you are looking at differ, 
> then the positions of rs58729888 in the two transcripts also differ
> when viewed at the transcript level, although it is the same genomic
> location.

I understand the desire to group transcript based on various criteria,
and such grouping is extremely valuable. There are many criteria on
which we might call transcripts identical, similar, or different, such
as exon structure, process transcript, or translation product. 

There are two reasons that I expected the exon structure to be
*identical* at least over the CDS. First and most important is that exon
structure is at the heart of mapping genome, transcript, and protein
variants consistently within the community. If exon structure isn't
preserved exactly for a given reference sequence, we won't be able to
exchange variants reliably. The second reason is more subjective: as a
design tenet, I believe that objects from a primary database should be
represented as-is in any recapitulation. If I look up NM_023035.2, I
want exactly that; anything else is something else.

Over the weekend, I undertook a much lengthier comparison of the exon
structures in genomic coordinates for ~25K transcripts as fetched from
Ensembl 61 and NCBI by NM accession. I'm still comparing, but here's
what I know so far:

      * Of the 25K, 3190 (~12%) have exon structures that are identical
        between the two sources.
      * Another ~70% appear to differ in the 3' and 5' UTR exons only
        (i.e., no CDS change)
      * I've looked at ~50 of the remaining and found only a difference
        in only one transcript (NM_000256.3) in which a pair of 3+18 nt
        exons are substituted for a 21nt exon. (This is in addition to
        NM_023035.2, in the original post).
      * An Amazon-based instance with Ensembl on EBS storage is >100x
        faster than remote lookups to NCBI (comparing single threaded
        runs in both cases).

For the purposes of general and reliable mapping variants between
genome, transcript, and protein coordinates, exact exon boundaries are
essential. Similar CDS is necessary but not sufficient.

At this point, I would really like to understand the ways (and
frequencies) in which Ensembl and NCBI transcripts differ. If somebody
has this data, I'd love to see it. If not, I need to generate it.

For the record, this experience has only bolstered my affection for
Ensembl. This is difficult stuff and they do a great job -- responsive
help desk and community, a functional API, and sane schema are all very
much appreciated. It took me ~5 minutes to write the code to fetch exon
data from Ensembl. Thank you!

-Reece