[ensembl-dev] GFF3 peculiarity in Homo_sapiens.GRCh37.82.gff3

Anne Lyle annelyle at ebi.ac.uk
Mon Oct 31 10:33:19 GMT 2016

Hi Jacques

Thanks for your input. We’re currently reworking our own parsers and exporters for all the common formats, including GFF3, so we’ll take your comments into consideration.



> On 31 Oct 2016, at 10:01, Jacques Dainat <jacques.dainat at bils.se> wrote:
> Hi,
> Dear all,
> I'm working in an annotation service and we use the gff3 format as a central format for everything we are doing. So, we have scripts to check the gff3 and correct them if needed.
> When I was working with the gff3 file Homo_sapiens.GRCh37.82.gff3 coming from your work, some peculiarities that I have never seen before popped up.
> First of all I would like to explain quickly how we parse our data:
> We usually parse our data paying strong attention about the “type” (3rd column) and sorting them in 3 levels structure:
> Level1 => Features that do not have Parent: gene, pseudogene, lincrna_gene, mirna_gene etc.
> Level2 => Features that have Parent and Children: mrna, trna, snorna, transcript, processed_pseudogene, etc.
> Level3 => Features that have Parent but no Children: cds, exon, utr, tts, stop_codon,sig_peptide, etc.
> This works quite fine with gff3 coming from many different sources, but when coming to parse your data it doesn’t work properly.
> Indeed there is some inconsistency within the 3rd column and we cannot either use the biotype attribute of the 9th column that seems to vary within a same feature.
> I would like to know if you can make an effort to homogenise those things for future data releases.
> In order to better explain the thing, and argument it, here is some examples:
> ========================================================
> ************************************************
>     FIRST, examples where things are fine:
> ************************************************
> ###
> 1	ensembl	snRNA_gene	13384735	13384841	.	-	.	ID=gene:ENSG00000207511;Name=RNU6-771P;biotype=snRNA;description=RNA%2C U6 small nuclear 771%2C pseudogene [Source:HGNC Symbol%3BAcc:47734];gene_id=ENSG00000207511;logic_name=ncrna;version=1
> 1	ensembl	snRNA	13384735	13384841	.	-	.	ID=transcript:ENST00000384780;Parent=gene:ENSG00000207511;Name=RNU6-771P-201;biotype=snRNA;tag=basic;transcript_id=ENST00000384780;version=1
> 1	ensembl	exon	13384735	13384841	.	-	.	Parent=transcript:ENST00000384780;Name=ENSE00001807562;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001807562;rank=1;version=1
> ###
> OK:	Based on 3rd column: snRNA_gene <= snRNA <= exon
> OK:	biotype attribute: 		      snRNA <= snRNA
> ###
> 1	havana	lincRNA_gene	32814795	32816264	.	-	.	ID=gene:ENSG00000233775;Name=RP4-811H24.9;biotype=lincRNA;gene_id=ENSG00000233775;logic_name=havana;version=1
> 1	havana	lincRNA	32814795	32816264	.	-	.	ID=transcript:ENST00000448134;Parent=gene:ENSG00000233775;Name=RP4-811H24.9-001;biotype=lincRNA;havana_transcript=OTTHUMT00000020212;havana_version=3;tag=basic;transcript_id=ENST00000448134;version=1
> 1	havana	exon	32814795	32815422	.	-	.	Parent=transcript:ENST00000448134;Name=ENSE00001776478;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001776478;rank=2;version=1
> 1	havana	exon	32816206	32816264	.	-	.	Parent=transcript:ENST00000448134;Name=ENSE00001624254;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001624254;rank=1;version=1
> ###
> OK:	Based on 3rd column:  lincRNA_gene <= lincRNA <= exon
> OK:	biotype attribute: 		       lincRNA <= lincRNA
> ###
> 1	ensembl_havana	gene	13474689	13477522	.	-	.	ID=gene:ENSG00000204491;Name=PRAMEF18;biotype=protein_coding;description=PRAME family member 18 [Source:HGNC Symbol%3BAcc:30693];gene_id=ENSG00000204491;logic_name=ensembl_havana_gene;version=2
> 1	ensembl_havana	transcript	13474689	13477522	.	-	.	ID=transcript:ENST00000376126;Parent=gene:ENSG00000204491;Name=PRAMEF18-001;biotype=protein_coding;ccdsid=CCDS41258.1;havana_transcript=OTTHUMT00000008177;havana_version=2;tag=basic;transcript_id=ENST00000376126;version=2
> 1	ensembl_havana	exon	13474689	13475262	.	-	.	Parent=transcript:ENST00000376126;Name=ENSE00001592884;constitutive=1;ensembl_end_phase=-1;ensembl_phase=2;exon_id=ENSE00001592884;rank=3;version=2
> 1	ensembl_havana	CDS	13474689	13475262	.	-	1	ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
> 1	ensembl_havana	exon	13476271	13476849	.	-	.	Parent=transcript:ENST00000376126;Name=ENSE00001620306;constitutive=1;ensembl_end_phase=2;ensembl_phase=2;exon_id=ENSE00001620306;rank=2;version=1
> 1	ensembl_havana	CDS	13476271	13476849	.	-	1	ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
> 1	ensembl_havana	exon	13477236	13477522	.	-	.	Parent=transcript:ENST00000376126;Name=ENSE00003445415;constitutive=1;ensembl_end_phase=2;ensembl_phase=0;exon_id=ENSE00003445415;rank=1;version=1
> 1	ensembl_havana	CDS	13477236	13477522	.	-	0	ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
> ###
> OK:	Based on 3rd column:        gene <= transcript <= exon,CDS
> OK:	biotype attribute: protein_coding <= protein_coding
> ************************************************
>     SECOND, 3rd column OK but biotype changes:
> ************************************************
> ###
> 10	ensembl_havana	pseudogene	112696380	112696991	.	-	.	ID=gene:ENSG00000234118;Name=RPL13AP6;biotype=pseudogene;description=ribosomal protein L13a pseudogene 6 [Source:HGNC Symbol%3BAcc:23737];gene_id=ENSG00000234118;logic_name=ensembl_havana_gene;version=1
> 10	ensembl_havana	processed_pseudogene	112696380	112696991	.	-	.	ID=transcript:ENST00000430133;Parent=gene:ENSG00000234118;Name=RPL13AP6-001;biotype=processed_pseudogene;havana_transcript=OTTHUMT00000050371;havana_version=1;tag=basic;transcript_id=ENST00000430133;version=1
> 10	ensembl_havana	exon	112696380	112696991	.	-	.	Parent=transcript:ENST00000430133;Name=ENSE00002511651;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002511651;rank=1;version=1
> ###
> OK:	Based on 3rd column : pseudogene <= processed_pseudogene <= exon
> ??:	biotype attribute : 	     pseudogene <= processed_pseudogene
> Question1 : Why the biotype change ? Will not be more coherent to have the processed_pseudogene biotype for the pseudogene feature too ?
> ************************************************
>     THIRD, 3rd column does not change but biotype changes:
> ************************************************
> ###
> 1	havana	pseudogene	176241619	176242538	.	+	.	ID=gene:ENSG00000227815;Name=RP11-195C7.3;biotype=pseudogene;gene_id=ENSG00000227815;logic_name=havana;version=2
> 1	havana	pseudogene	176241619	176242538	.	+	.	ID=transcript:ENST00000440296;Parent=gene:ENSG00000227815;Name=RP11-195C7.3-001;biotype=unprocessed_pseudogene;havana_transcript=OTTHUMT00000084685;havana_version=2;tag=basic;transcript_id=ENST00000440296;version=2
> 1	havana	exon	176241619	176241675	.	+	.	Parent=transcript:ENST00000440296;Name=ENSE00001660785;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001660785;rank=1;version=2
> 1	havana	exon	176241743	176242168	.	+	.	Parent=transcript:ENST00000440296;Name=ENSE00001739151;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001739151;rank=2;version=2
> 1	havana	exon	176242227	176242538	.	+	.	Parent=transcript:ENST00000440296;Name=ENSE00001773509;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001773509;rank=3;version=2
> ###
> ??:	Based on 3rd column : pseudogene <= pseudogene <= exon
> ??:	biotype attribute :	      pseudogene <= unprocessed_pseudogene
> Question2 : Why the second feature is also a pseudogene ? Will not be more coherent to have a sort of subclass of pseudogene like unprocessed_pseudogene as you do for the biotype in that case ?
> 		    As for the question1,  why the biotype change ? Will not be better to have unprocessed_pseudogene for the top feature too ?
> ###
> 1	havana	pseudogene	13411551	13414482	.	+	.	ID=gene:ENSG00000237700;Name=RP11-219C24.6;biotype=pseudogene;gene_id=ENSG00000237700;logic_name=havana;version=1
> 1	havana	pseudogene	13411551	13414482	.	+	.	ID=transcript:ENST00000437300;Parent=gene:ENSG00000237700;Name=RP11-219C24.6-001;biotype=unitary_pseudogene;havana_transcript=OTTHUMT00000022042;havana_version=1;tag=basic;transcript_id=ENST00000437300;version=1
> 1	havana	exon	13411551	13411837	.	+	.	Parent=transcript:ENST00000437300;Name=ENSE00001677077;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001677077;rank=1;version=1
> 1	havana	exon	13412234	13412812	.	+	.	Parent=transcript:ENST00000437300;Name=ENSE00001715540;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001715540;rank=2;version=1
> 1	havana	exon	13413924	13414482	.	+	.	Parent=transcript:ENST00000437300;Name=ENSE00001784031;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001784031;rank=3;version=1
> ###
> ??:	Based on 3rd column : pseudogene <= pseudogene <= exon
> ??:	biotype attribute :           pseudogene <= unitary_pseudogene
> The same as question2 ...
> *************************************************
>     FOURTH, same feature used differently :
> *************************************************
> ### Locus1
> 1	havana	processed_transcript	141474	173862	.	-	.	ID=gene:ENSG00000241860;Name=RP11-34P13.13;biotype=processed_transcript;gene_id=ENSG00000241860;logic_name=havana;version=2
> 1	havana	transcript	141474	149707	.	-	.	ID=transcript:ENST00000484859;Parent=gene:ENSG00000241860;Name=RP11-34P13.13-004;biotype=antisense;havana_transcript=OTTHUMT00000007035;havana_version=1;tag=basic;transcript_id=ENST00000484859;version=1
> 1	havana	exon	141474	143011	.	-	.	Parent=transcript:ENST00000484859;Name=ENSE00001911218;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001911218;rank=2;version=1
> 1	havana	exon	146386	149707	.	-	.	Parent=transcript:ENST00000484859;Name=ENSE00001860404;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001860404;rank=1;version=1
> ### Locus2
> 1	ensembl_havana	pseudogene	11869	14412	.	+	.	ID=gene:ENSG00000223972;Name=DDX11L1;biotype=pseudogene;description=DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [Source:HGNC Symbol%3BAcc:37102];gene_id=ENSG00000223972;logic_name=ensembl_havana_gene;version=4
> 1	ensembl_havana	processed_transcript	11869	14409	.	+	.	ID=transcript:ENST00000456328;Parent=gene:ENSG00000223972;Name=DDX11L1-002;biotype=processed_transcript;havana_transcript=OTTHUMT00000362751;havana_version=1;tag=basic;transcript_id=ENST00000456328;version=2
> 1	havana	exon	11869	12227	.	+	.	Parent=transcript:ENST00000456328;Name=ENSE00002234944;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002234944;rank=1;version=1
> 1	havana	exon	12613	12721	.	+	.	Parent=transcript:ENST00000456328;Name=ENSE00003582793;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003582793;rank=2;version=1
> 1	havana	exon	13221	14409	.	+	.	Parent=transcript:ENST00000456328;Name=ENSE00002312635;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002312635;rank=3;version=1
> ###
> ?? => For the first locus the processed_transcript is a top level feature (has a transcript child), then for the second locus the processed_transcript is a child of the pseudogene feature.
> Question3: Why a same feature (processed_transcript) does not follow the same schema ? In my sense, it must be always either a top level feature or a child of a top level feature.
> ********************************
>     FIFTH, semantic choice :
> ********************************
> ###
> 1	ensembl	RNA	1340841	1341132	.	-	.	ID=gene:ENSG00000264293;Name=RN7SL657P;biotype=misc_RNA;description=RNA%2C 7SL%2C cytoplasmic 657%2C pseudogene [Source:HGNC Symbol%3BAcc:46673];gene_id=ENSG00000264293;logic_name=ncrna;version=1
> 1	ensembl	transcript	1340841	1341132	.	-	.	ID=transcript:ENST00000582431;Parent=gene:ENSG00000264293;Name=RN7SL657P-201;biotype=misc_RNA;tag=basic;transcript_id=ENST00000582431;version=1
> 1	ensembl	exon	1340841	1341132	.	-	.	Parent=transcript:ENST00000582431;Name=ENSE00002720632;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002720632;rank=1;version=1
> ###
> ??: Just a semantic choice but a RNA that has a transcript has child is a bit strange. 
> Remark: Change the transcript by RNA and have has top level feature something like RNA_gene as you do for all other RNA feature types would be great.
> So I would like some clarifications about  the choices you did when creating your gff3 annotation file. I would like to know as well if it’s possible for you to make the thing more consistent, it would be easier for everybody I guess.
> Thanks in advance,
> Best regards,
> Jacques Dainat, PhD
> NBIS (National Bioinformatics Infrastructure Sweden)
> Genome Annotation Service
> Address: (room E10:4204 - last floor)
> Uppsala University, BMC
> Department of Medical Biochemistry Microbiology, Genomics
> Husargatan 3, box 582
> S-75123 Uppsala Sweden
> Phone: 01 84 71 46 25
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20161031/44894d1e/attachment.html>

More information about the Dev mailing list