[ensembl-dev] GFF3 peculiarity in Homo_sapiens.GRCh37.82.gff3

Mon Oct 31 10:01:43 GMT 2016

Hi,

Dear all,

I'm working in an annotation service and we use the gff3 format as a central format for everything we are doing. So, we have scripts to check the gff3 and correct them if needed.
When I was working with the gff3 file Homo_sapiens.GRCh37.82.gff3 coming from your work, some peculiarities that I have never seen before popped up.

First of all I would like to explain quickly how we parse our data:

We usually parse our data paying strong attention about the “type” (3rd column) and sorting them in 3 levels structure:
Level1 => Features that do not have Parent: gene, pseudogene, lincrna_gene, mirna_gene etc.
Level2 => Features that have Parent and Children: mrna, trna, snorna, transcript, processed_pseudogene, etc.
Level3 => Features that have Parent but no Children: cds, exon, utr, tts, stop_codon,sig_peptide, etc.

This works quite fine with gff3 coming from many different sources, but when coming to parse your data it doesn’t work properly.
Indeed there is some inconsistency within the 3rd column and we cannot either use the biotype attribute of the 9th column that seems to vary within a same feature.
I would like to know if you can make an effort to homogenise those things for future data releases.

In order to better explain the thing, and argument it, here is some examples:
========================================================

************************************************
    FIRST, examples where things are fine:
************************************************
###
1	ensembl	snRNA_gene	13384735	13384841	.	-	.	ID=gene:ENSG00000207511;Name=RNU6-771P;biotype=snRNA;description=RNA%2C U6 small nuclear 771%2C pseudogene [Source:HGNC Symbol%3BAcc:47734];gene_id=ENSG00000207511;logic_name=ncrna;version=1
1	ensembl	snRNA	13384735	13384841	.	-	.	ID=transcript:ENST00000384780;Parent=gene:ENSG00000207511;Name=RNU6-771P-201;biotype=snRNA;tag=basic;transcript_id=ENST00000384780;version=1
1	ensembl	exon	13384735	13384841	.	-	.	Parent=transcript:ENST00000384780;Name=ENSE00001807562;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001807562;rank=1;version=1
###
OK:	Based on 3rd column: snRNA_gene <= snRNA <= exon
OK:	biotype attribute: 		      snRNA <= snRNA

###
1	havana	lincRNA_gene	32814795	32816264	.	-	.	ID=gene:ENSG00000233775;Name=RP4-811H24.9;biotype=lincRNA;gene_id=ENSG00000233775;logic_name=havana;version=1
1	havana	lincRNA	32814795	32816264	.	-	.	ID=transcript:ENST00000448134;Parent=gene:ENSG00000233775;Name=RP4-811H24.9-001;biotype=lincRNA;havana_transcript=OTTHUMT00000020212;havana_version=3;tag=basic;transcript_id=ENST00000448134;version=1
1	havana	exon	32814795	32815422	.	-	.	Parent=transcript:ENST00000448134;Name=ENSE00001776478;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001776478;rank=2;version=1
1	havana	exon	32816206	32816264	.	-	.	Parent=transcript:ENST00000448134;Name=ENSE00001624254;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001624254;rank=1;version=1
###
OK:	Based on 3rd column:  lincRNA_gene <= lincRNA <= exon
OK:	biotype attribute: 		       lincRNA <= lincRNA

###
1	ensembl_havana	gene	13474689	13477522	.	-	.	ID=gene:ENSG00000204491;Name=PRAMEF18;biotype=protein_coding;description=PRAME family member 18 [Source:HGNC Symbol%3BAcc:30693];gene_id=ENSG00000204491;logic_name=ensembl_havana_gene;version=2
1	ensembl_havana	transcript	13474689	13477522	.	-	.	ID=transcript:ENST00000376126;Parent=gene:ENSG00000204491;Name=PRAMEF18-001;biotype=protein_coding;ccdsid=CCDS41258.1;havana_transcript=OTTHUMT00000008177;havana_version=2;tag=basic;transcript_id=ENST00000376126;version=2
1	ensembl_havana	exon	13474689	13475262	.	-	.	Parent=transcript:ENST00000376126;Name=ENSE00001592884;constitutive=1;ensembl_end_phase=-1;ensembl_phase=2;exon_id=ENSE00001592884;rank=3;version=2
1	ensembl_havana	CDS	13474689	13475262	.	-	1	ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
1	ensembl_havana	exon	13476271	13476849	.	-	.	Parent=transcript:ENST00000376126;Name=ENSE00001620306;constitutive=1;ensembl_end_phase=2;ensembl_phase=2;exon_id=ENSE00001620306;rank=2;version=1
1	ensembl_havana	CDS	13476271	13476849	.	-	1	ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
1	ensembl_havana	exon	13477236	13477522	.	-	.	Parent=transcript:ENST00000376126;Name=ENSE00003445415;constitutive=1;ensembl_end_phase=2;ensembl_phase=0;exon_id=ENSE00003445415;rank=1;version=1
1	ensembl_havana	CDS	13477236	13477522	.	-	0	ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
###
OK:	Based on 3rd column:        gene <= transcript <= exon,CDS
OK:	biotype attribute: protein_coding <= protein_coding

************************************************
    SECOND, 3rd column OK but biotype changes:
************************************************

###
10	ensembl_havana	pseudogene	112696380	112696991	.	-	.	ID=gene:ENSG00000234118;Name=RPL13AP6;biotype=pseudogene;description=ribosomal protein L13a pseudogene 6 [Source:HGNC Symbol%3BAcc:23737];gene_id=ENSG00000234118;logic_name=ensembl_havana_gene;version=1
10	ensembl_havana	processed_pseudogene	112696380	112696991	.	-	.	ID=transcript:ENST00000430133;Parent=gene:ENSG00000234118;Name=RPL13AP6-001;biotype=processed_pseudogene;havana_transcript=OTTHUMT00000050371;havana_version=1;tag=basic;transcript_id=ENST00000430133;version=1
10	ensembl_havana	exon	112696380	112696991	.	-	.	Parent=transcript:ENST00000430133;Name=ENSE00002511651;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002511651;rank=1;version=1
###
OK:	Based on 3rd column : pseudogene <= processed_pseudogene <= exon
??:	biotype attribute : 	     pseudogene <= processed_pseudogene

Question1 : Why the biotype change ? Will not be more coherent to have the processed_pseudogene biotype for the pseudogene feature too ?

************************************************
    THIRD, 3rd column does not change but biotype changes:
************************************************

###
1	havana	pseudogene	176241619	176242538	.	+	.	ID=gene:ENSG00000227815;Name=RP11-195C7.3;biotype=pseudogene;gene_id=ENSG00000227815;logic_name=havana;version=2
1	havana	pseudogene	176241619	176242538	.	+	.	ID=transcript:ENST00000440296;Parent=gene:ENSG00000227815;Name=RP11-195C7.3-001;biotype=unprocessed_pseudogene;havana_transcript=OTTHUMT00000084685;havana_version=2;tag=basic;transcript_id=ENST00000440296;version=2
1	havana	exon	176241619	176241675	.	+	.	Parent=transcript:ENST00000440296;Name=ENSE00001660785;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001660785;rank=1;version=2
1	havana	exon	176241743	176242168	.	+	.	Parent=transcript:ENST00000440296;Name=ENSE00001739151;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001739151;rank=2;version=2
1	havana	exon	176242227	176242538	.	+	.	Parent=transcript:ENST00000440296;Name=ENSE00001773509;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001773509;rank=3;version=2
###
??:	Based on 3rd column : pseudogene <= pseudogene <= exon
??:	biotype attribute :	      pseudogene <= unprocessed_pseudogene

Question2 : Why the second feature is also a pseudogene ? Will not be more coherent to have a sort of subclass of pseudogene like unprocessed_pseudogene as you do for the biotype in that case ?
		    As for the question1,  why the biotype change ? Will not be better to have unprocessed_pseudogene for the top feature too ?

###
1	havana	pseudogene	13411551	13414482	.	+	.	ID=gene:ENSG00000237700;Name=RP11-219C24.6;biotype=pseudogene;gene_id=ENSG00000237700;logic_name=havana;version=1
1	havana	pseudogene	13411551	13414482	.	+	.	ID=transcript:ENST00000437300;Parent=gene:ENSG00000237700;Name=RP11-219C24.6-001;biotype=unitary_pseudogene;havana_transcript=OTTHUMT00000022042;havana_version=1;tag=basic;transcript_id=ENST00000437300;version=1
1	havana	exon	13411551	13411837	.	+	.	Parent=transcript:ENST00000437300;Name=ENSE00001677077;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001677077;rank=1;version=1
1	havana	exon	13412234	13412812	.	+	.	Parent=transcript:ENST00000437300;Name=ENSE00001715540;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001715540;rank=2;version=1
1	havana	exon	13413924	13414482	.	+	.	Parent=transcript:ENST00000437300;Name=ENSE00001784031;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001784031;rank=3;version=1
###
??:	Based on 3rd column : pseudogene <= pseudogene <= exon
??:	biotype attribute :           pseudogene <= unitary_pseudogene

The same as question2 ...

*************************************************
    FOURTH, same feature used differently :
*************************************************

### Locus1
1	havana	processed_transcript	141474	173862	.	-	.	ID=gene:ENSG00000241860;Name=RP11-34P13.13;biotype=processed_transcript;gene_id=ENSG00000241860;logic_name=havana;version=2
1	havana	transcript	141474	149707	.	-	.	ID=transcript:ENST00000484859;Parent=gene:ENSG00000241860;Name=RP11-34P13.13-004;biotype=antisense;havana_transcript=OTTHUMT00000007035;havana_version=1;tag=basic;transcript_id=ENST00000484859;version=1
1	havana	exon	141474	143011	.	-	.	Parent=transcript:ENST00000484859;Name=ENSE00001911218;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001911218;rank=2;version=1
1	havana	exon	146386	149707	.	-	.	Parent=transcript:ENST00000484859;Name=ENSE00001860404;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001860404;rank=1;version=1
### Locus2
1	ensembl_havana	pseudogene	11869	14412	.	+	.	ID=gene:ENSG00000223972;Name=DDX11L1;biotype=pseudogene;description=DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [Source:HGNC Symbol%3BAcc:37102];gene_id=ENSG00000223972;logic_name=ensembl_havana_gene;version=4
1	ensembl_havana	processed_transcript	11869	14409	.	+	.	ID=transcript:ENST00000456328;Parent=gene:ENSG00000223972;Name=DDX11L1-002;biotype=processed_transcript;havana_transcript=OTTHUMT00000362751;havana_version=1;tag=basic;transcript_id=ENST00000456328;version=2
1	havana	exon	11869	12227	.	+	.	Parent=transcript:ENST00000456328;Name=ENSE00002234944;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002234944;rank=1;version=1
1	havana	exon	12613	12721	.	+	.	Parent=transcript:ENST00000456328;Name=ENSE00003582793;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003582793;rank=2;version=1
1	havana	exon	13221	14409	.	+	.	Parent=transcript:ENST00000456328;Name=ENSE00002312635;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002312635;rank=3;version=1
###
?? => For the first locus the processed_transcript is a top level feature (has a transcript child), then for the second locus the processed_transcript is a child of the pseudogene feature.

Question3: Why a same feature (processed_transcript) does not follow the same schema ? In my sense, it must be always either a top level feature or a child of a top level feature.

********************************
    FIFTH, semantic choice :
********************************

###
1	ensembl	RNA	1340841	1341132	.	-	.	ID=gene:ENSG00000264293;Name=RN7SL657P;biotype=misc_RNA;description=RNA%2C 7SL%2C cytoplasmic 657%2C pseudogene [Source:HGNC Symbol%3BAcc:46673];gene_id=ENSG00000264293;logic_name=ncrna;version=1
1	ensembl	transcript	1340841	1341132	.	-	.	ID=transcript:ENST00000582431;Parent=gene:ENSG00000264293;Name=RN7SL657P-201;biotype=misc_RNA;tag=basic;transcript_id=ENST00000582431;version=1
1	ensembl	exon	1340841	1341132	.	-	.	Parent=transcript:ENST00000582431;Name=ENSE00002720632;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002720632;rank=1;version=1
###
??: Just a semantic choice but a RNA that has a transcript has child is a bit strange. 

Remark: Change the transcript by RNA and have has top level feature something like RNA_gene as you do for all other RNA feature types would be great.

So I would like some clarifications about  the choices you did when creating your gff3 annotation file. I would like to know as well if it’s possible for you to make the thing more consistent, it would be easier for everybody I guess.
Thanks in advance,

Best regards,

Jacques Dainat, PhD
NBIS (National Bioinformatics Infrastructure Sweden)
Genome Annotation Service

Address: (room E10:4204 - last floor)
Uppsala University, BMC
Department of Medical Biochemistry Microbiology, Genomics
Husargatan 3, box 582
S-75123 Uppsala Sweden
Phone: 01 84 71 46 25
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20161031/d7dc23bc/attachment.html>