[ensembl-dev] Biotypes in GTF release 65

Arno Velds a.velds at nki.nl
Mon Jan 23 12:19:21 GMT 2012


Dear EnsDev,

To my surprise the composition of the biotype column (col 2) in the Homo 
sapiens gtf file has changed considerably. Most entries are now listed 
as protein coding while previously the were high counts of other 
non-coding types as well. Below the tail of the most occurring types:

$ zcat Homo_sapiens.GRCh37.64.gtf.gz |cut -f2 |sort|uniq -c |sort 
-n|tail -20
     657 IG_V_gene
     787 scRNA_pseudogene
     821 non_coding
     891 unitary_pseudogene
     939 sense_intronic
    1190 misc_RNA
    1434 polymorphic_pseudogene
    1523 snoRNA
    1809 miRNA
    1951 snRNA
    2209 pseudogene
    2775 transcribed_unprocessed_pseudogene
    9411 unprocessed_pseudogene
   10675 processed_pseudogene
   15690 antisense
   20880 lincRNA
   85066 retained_intron
  139916 nonsense_mediated_decay
  140543 processed_transcript
1644144 protein_coding

$ zcat Homo_sapiens.GRCh37.65.gtf.gz |cut -f2 |sort|uniq -c |sort 
-n|tail -20
     179 rRNA_pseudogene
     190 ncrna_host
     223 IG_V_pseudogene
     485 TR_V_gene
     535 rRNA
     580 Mt_tRNA_pseudogene
     657 IG_V_gene
     700 non_coding
     787 scRNA_pseudogene
     987 sense_intronic
    1190 misc_RNA
    1523 snoRNA
    1809 miRNA
    1951 snRNA
    2795 polymorphic_pseudogene
   14201 processed_transcript
   14518 antisense
   29715 lincRNA
   35461 pseudogene
2005431 protein_coding

A perhaps unnecessary example ENSG00000187758 has 3 transcripts 
according to the ensembl website, but only 1 coding. This is nicely 
represented in the 64 GTF:
$ zgrep ENSG00000187758 Homo_sapiens.GRCh37.64.gtf.gz |cut -f 2 |sort|uniq
processed_transcript
protein_coding
retained_intron

The 65 gts has only 1:
$ zgrep ENSG00000187758 Homo_sapiens.GRCh37.65.gtf.gz |cut -f 2 |sort|uniq
protein_coding

Does column 2 now also represent the gene biotype instead of the 
transcript? Is this a mistake?

Thanks for any insights!

Arno Velds
The Netherlands








More information about the Dev mailing list