[ensembl-dev] exon coordinate discrepancy between NCBI and Ensembl
sf7 at sanger.ac.uk
Tue May 24 10:13:48 BST 2011
I'm glad you found the supporting evidence view useful. As Paul said,
during the genebuild process many pieces of evidence are aligned to the
genome and used to build transcripts. Not all of the evidence will
exactly match the whole structure that is included in the final set of
annotation. This is one of the reasons that we provide the view. It also
allows you to see the evidence that does support the UTRs, which would
appear to include NM_001127222.1. UTRs are not explicitly marked in the
evidence (i.e. no clear box). The view shows which parts of the
transcript are supported by certain pieces of evidence.
NM_023035.2 is not associated with the transcript instead of
NM_001127222.1. They are both listed as external identifiers along with
other accessions and links to many other databases. The full list is
For further information on the views available in Ensembl you might like
to look at the information here:
Helpdesk are happy to assist with questions regarding the website and
can be reached from here:
For more information about CCDS, please see:
Additional documentation about the genebuild process can also be found
on the website.
Andrea Edwards wrote:
> This isn't my question on the list but I was reading it and found it
> interesting. I hadn't seen that supporting evidence view before and
> found it very useful. However it does look from that view that
> NM_001127222.1 <http://www.ncbi.nlm.nih.gov/nuccore/NM_001127222.1>
> doesn't have the same 5' and 3' UTRs as the ensembl transcript - from
> the view it looks as if all of the 2 terminal exons of NM... are
> translated. How can there be a CCDS based on these 2 transcripts as they
> are not 'consistently annotated' (if I am reading the diagram correctly
> that is)? Also, why would the ensembl transcript be associated with
> NM_023035.2 rather than NM_001127222.1
> <http://www.ncbi.nlm.nih.gov/nuccore/NM_001127222.1> which it is linked
> to via CCDS. Is it purely because overall it has the highest sequence
> similarity to that transcript?
> thank you very much
> On 23/05/11 17:47, Susan Fairley wrote:
>> Hi Reece,
>> Looking at the Ensembl transcript, ENST00000360228, here:
>> I noticed that it is one of two CCDS transcripts for the gene.
>> Checking the exon coordinates at NCBI for the CCDS, the structures
>> seem to be the same.
>> Looking further at the Ensembl transcript, it is possible to see that
>> NM_023035.2 is one of the pieces of evidence that were aligned to the
>> genome and used to construct the transcript ENST00000360228 during the
>> Ensembl genebuild process (with the three points where the evidence
>> extends beyond the structure highlighted).
>> It is at the end of the genebuild process that external identifiers
>> are associated with genes, on the basis of sequence similarity. It
>> would be at this stage that the ID NM_023035.2 became associated with
>> the transcript(ENST00000360228) that it contributed to building. As I
>> understand things, it is not necessary for there to be an exact match
>> for an external ID to be associated with an Ensembl transcript.
>> You note that both Ensembl and NCBI map rs58729888 to the same genomic
>> position. As the two transcript structures you are looking at differ,
>> then the positions of rs58729888 in the two transcripts also differ
>> when viewed at the transcript level, although it is the same genomic
>> I'm not sure that this directly answers your question but I hope it
>> may be of some assistance.
>> Kind regards,
>> Reece Hart wrote:
>>> Dear devs-
>>> NCBI and Ensembl return different genomic exon coordinates for
>>> NM_023035.2. These differences lead to discrepancies when mapping
>>> variants in my own code and at the Ensembl and NCBI web sites. I'd
>>> appreciate some help understanding the origin of these differences.
>>> The following is a diff of exon start,stop,length between e61 and NCBI.
>>> < Ensembl 61 (NM_023035.2; ENST00000360228)
>>> > NCBI (NM_023035.2)
>>> < 13441058 13441147 90
>>> > 13441058 13441150 93
>>> < 13414360 13414427 68
>>> > 13414351 13414427 77
>>> > 13352335 13352340 6
>>> e61 and e62 give identical results for this transcript. There is a
>>> net loss of 12 nt in two exons, and the complete absence of the
>>> terminal exon.
>>> This discrepancy between Ensembl and NCBI is also apparent in
>>> differences at the Ensembl and NCBI web sites. For example, both
>>> concur that rs58729888 is located at chr19:g.13368278, but NCBI maps
>>> it to NM_023035.2:r.4724, NP_075461.2:p.1496V>V  whereas Ensembl
>>> 62 maps it to ENST00000360228:r.4712, p.1492 . ENST..228 is the
>>> transcript retrieved from Ensembl using NM_023035.2 as an external
>>> reference, so I presume that they're intended to be identical. Note
>>> the mapping difference of 12nt is the same as the sum of the length
>>> differences in the exon diffs.
>>> Thanks for any help in understanding the origin of this difference
>>> between Ensembl and NCBI.
>>> The code I used to extract exon coordinates from NCBI and and Ensembl
>>> are attached; if the attachments fail, they're also at
>>> http://pastebin.com/Vuf55x2t and http://pastebin.com/G9sqgZqg.
>>>  http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs58729888
>>> Dev mailing list Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe):
>>> Ensembl Blog: http://www.ensembl.info/
>> Dev mailing list Dev at ensembl.org
>> List admin (including subscribe/unsubscribe):
>> Ensembl Blog: http://www.ensembl.info/
More information about the Dev