[ensembl-dev] Single nucleotide exons
Michael Schuster
michaels at ebi.ac.uk
Thu Feb 24 10:33:03 GMT 2011
Dear Miklos,
Most, if not all of these single base pair exons are biologically not
real. There are, however, cases of some very short exons spanning just
6 bp, which are sufficiently substantiated by cDNAs and ESTS.
A quick analysis of the up-coming human gene set (not publicly
available yet) showed that the set contains 83 single base pair exons
out of 615132 total exons. Of those, the Havana group at the WTSI has
manually annotated 41 cases and all those exons are either the first
or last exon in a Transcript. Generally, the Havana group will push
out transcript annotation for as long as they find support from
protein, cDNA or EST alignments. In those cases it is clear that the
transcript structure is not complete, nevertheless there is not enough
support for a longer exon model.
This has also implications on the assignment of translations, which
sometimes do not start or end at codon boundaries. I explained this in
more detail in a previous post:
http://lists.ensembl.org/pipermail/dev/2011-January/000765.html
The up-coming gene set also includes 39 exons that have been annotated
by the automated Ensembl genome analysis and annotation pipeline.
These cases seem to be caused by problematic supporting evidence
(cDNAs or proteins). I'll give two examples for illustration:
Transcript PIK3C2B-201 (ENST00000391949) has a single-base pair exon
and is based on cDNA CR749201.1, which can be seen on the "Supporting
Evidence" panel.
http://feb2011.archive.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000133056;r=1:204391770-204403586;t=ENST00000391949
http://feb2011.archive.ensembl.org/Homo_sapiens/Transcript/SupportingEvidence?db=core;g=ENSG00000133056;r=1:204391770-204403586;t=ENST00000391949
http://www.ebi.ac.uk/ena/data/view/CR749201.1
Aligning the cDNA to the genomic sequence with exonerate, it turns out
that the cDNA has an insertion near the splice site of intron 3:
243 : GCCTCAGGATACAGAGGCCAATGCCACTACCTACTTCACTAG >>>> Target
Intr : 285
||||||||||||||||||||||||||||||||||||||||||++ 8587 bp
85315 :
GCCTCAGGATACAGAGGCCAATGCCACTACCTACTTCACTAGgt................ : 85271
286 : on 3 >>>>
GTTGATTCTGGCCTCTGTGGTAGGAGGCAGGGAGAGTAAGACATGCTCT : 333
++| |||||||||||||||||||||||||||||||||||||||||||||||
85270 : .........agG-
TGATTCTGGCCTCTGTGGTAGGAGGCAGGGAGAGTAAGACATGCTCT : 76639
Aligning the corresponding protein Q68E11 with Genewise, this
algorithm introduces a single-base pair exon to cope with the
insertion in the cDNA and that is also the basis for our transcript
model.
Q68E11.1 81 RPQDTEANATTYFT
RPQDTEANATTYFT
RPQDTEANATTYFT R:R[aga]
AL606489.26 -85317 accgaggagaattaAGGTAAAAT Intron 3 TAAA
gcaacacacccatc <2-----[85273 : 80410]-2>
gtgtagctctccct
Q68E11.1 96 LILASVVGGRESKTCSAVSSSG
+ILASVVGGRESKTCSAVSSSG
VILASVVGGRESKTCSAVSSSG
AL606489.26 -80408 GTAACTA Intron 4 CAGgacgtggggagaaattggtttg
<0-----[80408 : 76687]-0>tttccttgggagacgcctcccg
gtgctgaacggtgacttctcta
Because this insertion is right next to s splice site, it can be
problematic to model for alignment algorithms and they can sometimes
get this wrong.
Transcript MIER1-204 (ENST00000371018) is another case and according
to the supporting evidence panel, is based on the manually annotated
UniProtKB/Swiss-Prot sequence Q8N108-8.
http://feb2011.archive.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000198160;r=1:67390642-67454302;t=ENST00000371018
http://feb2011.archive.ensembl.org/Homo_sapiens/Transcript/SupportingEvidence?db=core;g=ENSG00000198160;r=1:67390642-67454302;t=ENST00000371018
http://www.uniprot.org/uniprot/Q8N108
Checking the alignments in this particular case, it turns out that
neither algorithm can align the protein to the genomic regions
successfully. In fact, the Genome Reference Consortium (GRC) has
already flagged the BAC clone that is the base of the genome assembly
in this region, as problematic. See GRC case number HG-379.
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-379
Hence it looks like, in this case we had good supporting evidence from
a manually curated UniProtKB-SwissProt sequence, but the underlying
genome sequence is problematic, which again caused the alignment
algorithm to insert a single-base pair exon to cope with sequence
imperfections and mismatches.
The upshot is that we are aware of this and that we are actively
aiming to resolve these (few) remaining issues. We do this via the
GRC, by fixing, re-sequencing and updating the genome sequence. You
may have noticed that several patches to the GRCh37 assembly have
already gone in. Another approach we take is reporting problematic
cDNAs and proteins to upstream data resources to help cleaning these
databases.
I hope this explains a bit of background to these mini-exons and also
mini-introns, for that matter.
Best regards,
Michael Schuster
On 22 Feb 2011, at 10:56, Miklos Cserzo wrote:
>
> Hi Folks,
>
> when I dump the exons of the Human genome via the MART interface in
> number of cases the reported start and end coordinates of the exons
> are identical, i.e. there are single nucleotide exons. Are those
> exons real?
>
> Cheers,
>
> miklos
>
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev
--
Michael Schuster
Ensembl Genome Browser
EMBL - European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridgeshire CB10 1SD
United Kingdom
http://www.ensembl.org/
More information about the Dev
mailing list