[ensembl-dev] Biotypes in GTF release 65
Arno Velds
a.velds at nki.nl
Mon Jan 23 12:19:21 GMT 2012
Dear EnsDev,
To my surprise the composition of the biotype column (col 2) in the Homo
sapiens gtf file has changed considerably. Most entries are now listed
as protein coding while previously the were high counts of other
non-coding types as well. Below the tail of the most occurring types:
$ zcat Homo_sapiens.GRCh37.64.gtf.gz |cut -f2 |sort|uniq -c |sort
-n|tail -20
657 IG_V_gene
787 scRNA_pseudogene
821 non_coding
891 unitary_pseudogene
939 sense_intronic
1190 misc_RNA
1434 polymorphic_pseudogene
1523 snoRNA
1809 miRNA
1951 snRNA
2209 pseudogene
2775 transcribed_unprocessed_pseudogene
9411 unprocessed_pseudogene
10675 processed_pseudogene
15690 antisense
20880 lincRNA
85066 retained_intron
139916 nonsense_mediated_decay
140543 processed_transcript
1644144 protein_coding
$ zcat Homo_sapiens.GRCh37.65.gtf.gz |cut -f2 |sort|uniq -c |sort
-n|tail -20
179 rRNA_pseudogene
190 ncrna_host
223 IG_V_pseudogene
485 TR_V_gene
535 rRNA
580 Mt_tRNA_pseudogene
657 IG_V_gene
700 non_coding
787 scRNA_pseudogene
987 sense_intronic
1190 misc_RNA
1523 snoRNA
1809 miRNA
1951 snRNA
2795 polymorphic_pseudogene
14201 processed_transcript
14518 antisense
29715 lincRNA
35461 pseudogene
2005431 protein_coding
A perhaps unnecessary example ENSG00000187758 has 3 transcripts
according to the ensembl website, but only 1 coding. This is nicely
represented in the 64 GTF:
$ zgrep ENSG00000187758 Homo_sapiens.GRCh37.64.gtf.gz |cut -f 2 |sort|uniq
processed_transcript
protein_coding
retained_intron
The 65 gts has only 1:
$ zgrep ENSG00000187758 Homo_sapiens.GRCh37.65.gtf.gz |cut -f 2 |sort|uniq
protein_coding
Does column 2 now also represent the gene biotype instead of the
transcript? Is this a mistake?
Thanks for any insights!
Arno Velds
The Netherlands
More information about the Dev
mailing list