[ensembl-dev] Biotypes in GTF release 65

Andy Yates ayates at ebi.ac.uk
Tue Jan 24 17:41:34 GMT 2012


Dear Arno,

The web team has regenerated the GTF dumps & put them onto the FTP site. Hope this solves your problem

Andy

Andrew Yates                   Ensembl Core Software Project Leader
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensembl.org/

On 23 Jan 2012, at 12:19, Arno Velds wrote:

> Dear EnsDev,
> 
> To my surprise the composition of the biotype column (col 2) in the Homo sapiens gtf file has changed considerably. Most entries are now listed as protein coding while previously the were high counts of other non-coding types as well. Below the tail of the most occurring types:
> 
> $ zcat Homo_sapiens.GRCh37.64.gtf.gz |cut -f2 |sort|uniq -c |sort -n|tail -20
>    657 IG_V_gene
>    787 scRNA_pseudogene
>    821 non_coding
>    891 unitary_pseudogene
>    939 sense_intronic
>   1190 misc_RNA
>   1434 polymorphic_pseudogene
>   1523 snoRNA
>   1809 miRNA
>   1951 snRNA
>   2209 pseudogene
>   2775 transcribed_unprocessed_pseudogene
>   9411 unprocessed_pseudogene
>  10675 processed_pseudogene
>  15690 antisense
>  20880 lincRNA
>  85066 retained_intron
> 139916 nonsense_mediated_decay
> 140543 processed_transcript
> 1644144 protein_coding
> 
> $ zcat Homo_sapiens.GRCh37.65.gtf.gz |cut -f2 |sort|uniq -c |sort -n|tail -20
>    179 rRNA_pseudogene
>    190 ncrna_host
>    223 IG_V_pseudogene
>    485 TR_V_gene
>    535 rRNA
>    580 Mt_tRNA_pseudogene
>    657 IG_V_gene
>    700 non_coding
>    787 scRNA_pseudogene
>    987 sense_intronic
>   1190 misc_RNA
>   1523 snoRNA
>   1809 miRNA
>   1951 snRNA
>   2795 polymorphic_pseudogene
>  14201 processed_transcript
>  14518 antisense
>  29715 lincRNA
>  35461 pseudogene
> 2005431 protein_coding
> 
> A perhaps unnecessary example ENSG00000187758 has 3 transcripts according to the ensembl website, but only 1 coding. This is nicely represented in the 64 GTF:
> $ zgrep ENSG00000187758 Homo_sapiens.GRCh37.64.gtf.gz |cut -f 2 |sort|uniq
> processed_transcript
> protein_coding
> retained_intron
> 
> The 65 gts has only 1:
> $ zgrep ENSG00000187758 Homo_sapiens.GRCh37.65.gtf.gz |cut -f 2 |sort|uniq
> protein_coding
> 
> Does column 2 now also represent the gene biotype instead of the transcript? Is this a mistake?
> 
> Thanks for any insights!
> 
> Arno Velds
> The Netherlands
> 
> 
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list