[ensembl-dev] exon coordinate discrepancy between NCBI and Ensembl

Paul Flicek flicek at ebi.ac.uk
Wed May 25 10:00:07 BST 2011


There is no such thing as the RefSeq version of the assembly.  The issue is that the RefSeq mRNAs are sequence objects that are independent of the reference human assembly.  

The reason that they are different from the reference assembly comes down to a combination of errors (in either the RefSeq or the assembly) and polymorphisms.  

One legitimate question is why it is so important to specifically use the Refseq sequence when it is different than the reference as these will likely be places enriched in some sort of problems including errors or polymorphisms that may affect any diagnostic test.


Paul

On 25 May 2011, at 09:29, Daniel Hughes wrote:

> could you load the ncbi exons and the refseq version of the assembly as an
> alternative assembly?
> 
> dan.
> 
> Daniel S. T. Hughes M.Biochem (Hons; Oxford), Ph.D (Cambridge)
> -------------------------------------------------------------------------------------
> dsth at cantab.net
> dsth at cpan.org
> 
> 
> 2011/5/25 Kiran Mukhyala <mukhyala at gmail.com>
> 
>> 
>> 
>> On Tue, May 24, 2011 at 12:46 PM, Reece Hart <reece at harts.net> wrote:
>> 
>>> Because it's so convenient to code for Ensembl, I'd still like to see if
>>> there's a way to accomplish what I want with Ensembl. The goal is convert
>>> HGVS variants specified using NCBI accessions between genomic, raw
>>> transcript (i.e., 'r.' variants), CDS, and protein coordinate systems. To
>>> achieve accurate conversion in the general case, it is necessary to have a
>>> single, shared understanding of the exon structure, accurate to nucleotide
>>> level, as implied by the named transcript. Exon-level similarity, even when
>>> the CDS is unchanged, doesn't cut it in this case.
>>> 
>>> Does anyone know whether it would work to load NCBI exons directly into
>>> Ensembl? I'm hoping that populating the transcript, transcript_stable_id,
>>> exon, and exon_transcript tables with original NCBI data would suffice. Is
>>> that too naive?
>>> 
>>> 
>> In order to map genomic to transcript coordinates using the Ensembl API,
>> one requirement is that the transcript be derived from the reference genome.
>> Unfortunately, this is not true for a small percentage of RefSeqs. RefSeq
>> UTRs especially do not match the reference genome well.
>> 
>> What that means is that if you load NCBI exons directly into Ensembl, since
>> the API constructs the transcript sequence from the genome, the reference
>> genome derived transcript will not match the RefSeq sequence and so you will
>> not be able to accurately convert genomic to RefSeq coordinates.
>> 
>> This theoretically should not happen with the CCDS genes but I haven't
>> tested it. By the way, Ensembl does import RefSeq and CCDS genes into the
>> otherfeatures database.
>> 
>> -Kiran
>> 
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe):
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list