[ensembl-dev] transcript coordinates via REST?

Mon Aug 10 14:30:08 BST 2015

Hi Reece,
> On 7 Aug 2015, at 17:00, Reece Hart <reece at harts.net> wrote:
> 
> Hi Andy-
> 
> Thanks for the reply.
> 
> I'm aware that the annotation is performed on the assembly and I'm a fan. However, in my view there's always an implicit alignment -- even if a trivial one -- whenever one has a change of a coordinate system e.g., from genome to transcript.

It is fair to say there is an implicit alignment of primary data to the genome in order to construct a transcript. However please see my later responses as I elaborate on this.

> 
> The upshot of this is that GENCODE annotation reflects what occurs in the genome. Providing an alignment file would match the transcript models.
> 
> Is this *always* true? More correctly, my understanding was that the above typical case had exceptions, which caused ENSTs to differ from the underlying genome. Or, perhaps that used to be the case and I'm not up-to-date with modern Ensembl practice?
> 
> Let's use a specific example: ALMS1, ENST00000264448, GRCh37 (for which I have some data handy). This transcript contains both an indel and a substitution with respect to the assembly. The ENST sequence and exon structure (in e-75, at least) differ from the genome. How does this happen? (Apologies, I don't have examples for GRCh38 handy.)

I’ve done a bit of digging into this from my end. For reference the closest transcript in 38 is ENST00000613296 and differs in length by 3bp (1 residue). I’ve not found the assembly substitution but I believe there is an insertion of CCT in GRCh38 genome at 73,448,100. The 37 transcript lacked this codon and so lacks the proline at 525 in the amino acid sequence.

As mentioned before the annotation reflects what occurs in the genome. The introduction of mini-contig KF573641.1 into GRCh38, and its positioning in contig AC074008.5, fixes the issue as noted in GRC’s issue tracker [1] under issue HG-324. The issue also calls out to NM_015120.4’s alignment as the reason for the patch.

> 
> When bridging to the translational genetics community, it's imperative that we use transcripts that are familiar in both structure and sequence. Since one or both characteristics will differ from a reference genome for some genes, it's unclear to me that we can rely solely on genome-based annotations. In other words, it seems that some alignment file will always be necessary to handle corner cases.
> 
> Do you agree? Am I missing something?

What we are really getting to here is the difference between using the genome as a scaffold to build annotation on from primary data alignment versus projecting known sequence annotation onto the genome. As you have suggested there will be cases where the genome does not represent the RefSeq annotation accurately and so alignment files are necessary. The GENCODE transcripts represent a best attempt at constructing a valid model on the genome and so transcript sequence and genome are in agreement. 

So whilst I agree that mis-matches between primary evidence and a particular assemblies exist there is no mis-match to represent between a GENCODE transcript and its genomic location. This means a much cleaner mapping from alignment to downstream effect as transcripts and variation have been annotated against the same reference. Since read mapping and variant calling is performed without an agreed deterministic protocol there are advantages to using an annotation set that matches the genome sequence versus one with the additional problems you have described in your earlier emails.

Cheers,

Andy

1 - http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/display_issues.cgi?org=human <http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/display_issues.cgi?org=human>
> 
> Thanks,
> Reece
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20150810/73d9557f/attachment.html>