[ensembl-dev] cDNA and CDS lack of total matching for some genes

Kieron Taylor ktaylor at ebi.ac.uk
Tue Dec 16 16:18:58 GMT 2014


Dear Manuel,

The N's present in the CDS are standard procedure for Ensembl. They 
exist to ensure that translation is in the correct phase when there is 
ambiguity. The transcript you highlight begins with a phase of 2, hence 
two N's are required to keep the protein codons correct. It has also 
been manually annotated as having incomplete CDS up and downstream, 
which more or less tells us the same thing.

http://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000007350;r=X:154295671-154330350;t=ENST00000426203

It might be useful for you to tell us what you're aiming for, as we 
might have better tools for your task.


Regards,


Kieron Taylor
Ensembl Core


On 16/12/2014 13:20, Manuel Tardáguila Sancho wrote:
> Hello Ensembl Dev team,
>
> I am currently working with two files from Ensembl release
> 75; Homo_sapiens.GRCh37.75.cds.all.fa, with all the CDS from the human
> release, and Homo_sapiens.GRCh37.75.cdna.all.fa with all the cDNA.
>
> As part of one of my scripts I was checking that the CDS matches into
> the cDNA and it does so routinely except for some genes for which the
> CDS begins with one or two N (see example below).
>
> Once these N's are removed the CDS matches the cDNA.
>
> All of the CDS that I have checked lack a proper ATG as start codon.
>
> I don ´t know if these N's are a code to denote transcripts with no
> canonical start codon, I have checked the accompanying README files and
> they don't mention them. Best,
>
> Manuel Tardaguila
>
>  >ENST00000426203 *cdna:*putative
> chromosome:GRCh37:X:153533396:153539497:1 gene:ENSG00000007350
> gene_biotype:protein_coding transcript_biotype:protein_coding
>
> AGAGGCACAAAGGAAACTTGCCCCGAGTCCACGGTGCTCTGCGGTTAGGAGCTGGCCTCA
> CTGTGCACAGGGGGAGGGGTGCCACCCTACATCATGTAGCAGTTCTTCTGAGATCATGTC
> TGTGCTGTTCTTCTACATCATGAGGTACAAGCAGTCAGATCCAGAGAATCCGGACAACGA
> CCGATTTGTCCTCGCAAAGAGACTGTCGTTTGTGGATGTGGCAACAGGATGGCTCGGACA
> AGGACTGGGAGTTGCATGTGGAATGGCATATACTGGCAAGTACTTCGACAGGGCCAGCTA
> CCGGGTGTTCTGCCTCATGAGTGATGGCGAGTCCTCAGAAGGCTCTGTCTGGGAGGCAAT
> GGCCTTTGCTTCCTACTACAGTCTGGACAATCTTGTGGCAATCTTTGATGTGAACCGCCT
> GGGACACAGTGGTGCATTGCCCGCCGAGCACTGCATAAACATCTATCAGAGGCGCTGCGA
> AGCCTTTGGGTGGAACACTTATGTGGTGGACGGCCGGGACGTGGA
>
>  >ENST00000426203 *cds*:putative
> chromosome:GRCh37:X:153533396:153539497:1 gene:ENSG00000007350
> gene_biotype:protein_coding transcript_biotype:protein_coding
> *NN*AGAGGCACAAAGGAAACTTGCCCCGAGTCCACGGTGCTCTGCGGTTAGGAGCTGGCCT
> CACTGTGCACAGGGGGAGGGGTGCCACCCTACATCATGTAGCAGTTCTTCTGAGATCATG
> TCTGTGCTGTTCTTCTACATCATGAGGTACAAGCAGTCAGATCCAGAGAATCCGGACAAC
> GACCGATTTGTCCTCGCAAAGAGACTGTCGTTTGTGGATGTGGCAACAGGATGGCTCGGA
> CAAGGACTGGGAGTTGCATGTGGAATGGCATATACTGGCAAGTACTTCGACAGGGCCAGC
> TACCGGGTGTTCTGCCTCATGAGTGATGGCGAGTCCTCAGAAGGCTCTGTCTGGGAGGCA
> ATGGCCTTTGCTTCCTACTACAGTCTGGACAATCTTGTGGCAATCTTTGATGTGAACCGC
> CTGGGACACAGTGGTGCATTGCCCGCCGAGCACTGCATAAACATCTATCAGAGGCGCTGC
> GAAGCCTTTGGGTGGAACACTTATGTGGTGGACGGCCGGGACGTGGA
>
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>





More information about the Dev mailing list