[ensembl-dev] Single nucleotide exons

Thu Feb 24 10:33:03 GMT 2011

Dear Miklos,

Most, if not all of these single base pair exons are biologically not  
real. There are, however, cases of some very short exons spanning just  
6 bp, which are sufficiently substantiated by cDNAs and ESTS.

A quick analysis of the up-coming human gene set (not publicly  
available yet) showed that the set contains 83 single base pair exons  
out of 615132 total exons. Of those, the Havana group at the WTSI has  
manually annotated 41 cases and all those exons are either the first  
or last exon in a Transcript. Generally, the Havana group will push  
out transcript annotation for as long as they find support from  
protein, cDNA or EST alignments. In those cases it is clear that the  
transcript structure is not complete, nevertheless there is not enough  
support for a longer exon model.

This has also implications on the assignment of translations, which  
sometimes do not start or end at codon boundaries. I explained this in  
more detail in a previous post:

http://lists.ensembl.org/pipermail/dev/2011-January/000765.html

The up-coming gene set also includes 39 exons that have been annotated  
by the automated Ensembl genome analysis and annotation pipeline.  
These cases seem to be caused by problematic supporting evidence  
(cDNAs or proteins). I'll give two examples for illustration:

Transcript PIK3C2B-201 (ENST00000391949) has a single-base pair exon  
and is based on cDNA CR749201.1, which can be seen on the "Supporting  
Evidence" panel.

http://feb2011.archive.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000133056;r=1:204391770-204403586;t=ENST00000391949

http://feb2011.archive.ensembl.org/Homo_sapiens/Transcript/SupportingEvidence?db=core;g=ENSG00000133056;r=1:204391770-204403586;t=ENST00000391949

http://www.ebi.ac.uk/ena/data/view/CR749201.1

Aligning the cDNA to the genomic sequence with exonerate, it turns out  
that the cDNA has an insertion near the splice site of intron 3:

    243 : GCCTCAGGATACAGAGGCCAATGCCACTACCTACTTCACTAG  >>>> Target  
Intr :   285
          ||||||||||||||||||||||||||||||||||||||||||++         8587 bp
  85315 :  
GCCTCAGGATACAGAGGCCAATGCCACTACCTACTTCACTAGgt................ : 85271

    286 : on 3 >>>>   
GTTGATTCTGGCCTCTGTGGTAGGAGGCAGGGAGAGTAAGACATGCTCT :   333
                   ++| |||||||||||||||||||||||||||||||||||||||||||||||
  85270 : .........agG- 
TGATTCTGGCCTCTGTGGTAGGAGGCAGGGAGAGTAAGACATGCTCT : 76639

Aligning the corresponding protein Q68E11 with Genewise, this  
algorithm introduces a single-base pair exon to cope with the  
insertion in the cDNA and that is also the basis for our transcript  
model.

Q68E11.1          81 RPQDTEANATTYFT
                      RPQDTEANATTYFT
                      RPQDTEANATTYFT           R:R[aga]
AL606489.26   -85317 accgaggagaattaAGGTAAAAT  Intron 3       TAAA
                      gcaacacacccatc  <2-----[85273  :  80410]-2>
                      gtgtagctctccct

Q68E11.1          96                            LILASVVGGRESKTCSAVSSSG
                                                 +ILASVVGGRESKTCSAVSSSG
                                                 VILASVVGGRESKTCSAVSSSG
AL606489.26   -80408 GTAACTA  Intron 4       CAGgacgtggggagaaattggtttg
                      <0-----[80408  :  76687]-0>tttccttgggagacgcctcccg
                                                 gtgctgaacggtgacttctcta

Because this insertion is right next to s splice site, it can be  
problematic to model for alignment algorithms and they can sometimes  
get this wrong.

Transcript MIER1-204 (ENST00000371018) is another case and according  
to the supporting evidence panel, is based on the manually annotated  
UniProtKB/Swiss-Prot sequence Q8N108-8.

http://feb2011.archive.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000198160;r=1:67390642-67454302;t=ENST00000371018

http://feb2011.archive.ensembl.org/Homo_sapiens/Transcript/SupportingEvidence?db=core;g=ENSG00000198160;r=1:67390642-67454302;t=ENST00000371018

http://www.uniprot.org/uniprot/Q8N108

Checking the alignments in this particular case, it turns out that  
neither algorithm can align the protein to the genomic regions  
successfully. In fact, the Genome Reference Consortium (GRC) has  
already flagged the BAC clone that is the base of the genome assembly  
in this region, as problematic. See GRC case number HG-379.

http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-379

Hence it looks like, in this case we had good supporting evidence from  
a manually curated UniProtKB-SwissProt sequence, but the underlying  
genome sequence is problematic, which again caused the alignment  
algorithm to insert a single-base pair exon to cope with sequence  
imperfections and mismatches.

The upshot is that we are aware of this and that we are actively  
aiming to resolve these (few) remaining issues. We do this via the  
GRC, by fixing, re-sequencing and updating the genome sequence. You  
may have noticed that several patches to the GRCh37 assembly have  
already gone in. Another approach we take is reporting problematic  
cDNAs and proteins to upstream data resources to help cleaning these  
databases.

I hope this explains a bit of background to these mini-exons and also  
mini-introns, for that matter.

Best regards,
Michael Schuster

On 22 Feb 2011, at 10:56, Miklos Cserzo wrote:

>
> Hi Folks,
>
> when I dump the exons of the Human genome via the MART interface in  
> number of cases the reported start and end coordinates of the exons  
> are identical, i.e. there are single nucleotide exons. Are those  
> exons real?
>
> Cheers,
>
> miklos
>
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev

--
Michael Schuster
Ensembl Genome Browser
EMBL - European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridgeshire CB10 1SD
United Kingdom

http://www.ensembl.org/