[ensembl-dev] GFF3 peculiarity in Homo_sapiens.GRCh37.82.gff3

Mon Oct 31 13:17:30 GMT 2016

Hi Jaques,
The GFF files reflect the data in our databases. The human gene set is a merge between the Ensembl set and the Havana set. The Havana set being manually curated, it has precedence over the Ensembl set and we do not modify the Havana set. The data in the GRCh37 databases will not be updated as GRCh38 is an updated and improved version of GRCh37.

In your third example, it is a gene annotated by Havana and the gene biotype is pseudogene and the transcript biotype is unprocessed_pseudogene. If you look in our latest release using GRCh38, you will see that the discrepancy has been corrected and the gene has the biotype unprocessed_pseudogene.

We are using SO terms to assign the 3rd column. In your fifth example, we use the misc_RNA biotype which is not a SO term. The correct SO term in this case would be ncRNA. This is something that our new exporters will be able to fix.

Thanks
Thibaut

> On 31 Oct 2016, at 10:33, Anne Lyle <annelyle at ebi.ac.uk> wrote:
> 
> Hi Jacques
> 
> Thanks for your input. We’re currently reworking our own parsers and exporters for all the common formats, including GFF3, so we’ll take your comments into consideration.
> 
> Cheers
> 
> Anne
> 
> 
> 
>> On 31 Oct 2016, at 10:01, Jacques Dainat <jacques.dainat at bils.se <mailto:jacques.dainat at bils.se>> wrote:
>> 
>> Hi,
>> 
>> Dear all,
>> 
>> I'm working in an annotation service and we use the gff3 format as a central format for everything we are doing. So, we have scripts to check the gff3 and correct them if needed.
>> When I was working with the gff3 file Homo_sapiens.GRCh37.82.gff3 coming from your work, some peculiarities that I have never seen before popped up.
>> 
>> First of all I would like to explain quickly how we parse our data:
>> 
>> We usually parse our data paying strong attention about the “type” (3rd column) and sorting them in 3 levels structure:
>> Level1 => Features that do not have Parent: gene, pseudogene, lincrna_gene, mirna_gene etc.
>> Level2 => Features that have Parent and Children: mrna, trna, snorna, transcript, processed_pseudogene, etc.
>> Level3 => Features that have Parent but no Children: cds, exon, utr, tts, stop_codon,sig_peptide, etc.
>> 
>> This works quite fine with gff3 coming from many different sources, but when coming to parse your data it doesn’t work properly.
>> Indeed there is some inconsistency within the 3rd column and we cannot either use the biotype attribute of the 9th column that seems to vary within a same feature.
>> I would like to know if you can make an effort to homogenise those things for future data releases.
>> 
>> In order to better explain the thing, and argument it, here is some examples:
>> ========================================================
>> 
>> ************************************************
>>     FIRST, examples where things are fine:
>> ************************************************
>> ###
>> 1	ensembl	snRNA_gene	13384735	13384841	.	-	.	ID=gene:ENSG00000207511;Name=RNU6-771P;biotype=snRNA;description=RNA%2C U6 small nuclear 771%2C pseudogene [Source:HGNC Symbol%3BAcc:47734];gene_id=ENSG00000207511;logic_name=ncrna;version=1
>> 1	ensembl	snRNA	13384735	13384841	.	-	.	ID=transcript:ENST00000384780;Parent=gene:ENSG00000207511;Name=RNU6-771P-201;biotype=snRNA;tag=basic;transcript_id=ENST00000384780;version=1
>> 1	ensembl	exon	13384735	13384841	.	-	.	Parent=transcript:ENST00000384780;Name=ENSE00001807562;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001807562;rank=1;version=1
>> ###
>> OK:	Based on 3rd column: snRNA_gene <= snRNA <= exon
>> OK:	biotype attribute: 		      snRNA <= snRNA
>> 
>> ###
>> 1	havana	lincRNA_gene	32814795	32816264	.	-	.	ID=gene:ENSG00000233775;Name=RP4-811H24.9;biotype=lincRNA;gene_id=ENSG00000233775;logic_name=havana;version=1
>> 1	havana	lincRNA	32814795	32816264	.	-	.	ID=transcript:ENST00000448134;Parent=gene:ENSG00000233775;Name=RP4-811H24.9-001;biotype=lincRNA;havana_transcript=OTTHUMT00000020212;havana_version=3;tag=basic;transcript_id=ENST00000448134;version=1
>> 1	havana	exon	32814795	32815422	.	-	.	Parent=transcript:ENST00000448134;Name=ENSE00001776478;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001776478;rank=2;version=1
>> 1	havana	exon	32816206	32816264	.	-	.	Parent=transcript:ENST00000448134;Name=ENSE00001624254;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001624254;rank=1;version=1
>> ###
>> OK:	Based on 3rd column:  lincRNA_gene <= lincRNA <= exon
>> OK:	biotype attribute: 		       lincRNA <= lincRNA
>> 
>> ###
>> 1	ensembl_havana	gene	13474689	13477522	.	-	.	ID=gene:ENSG00000204491;Name=PRAMEF18;biotype=protein_coding;description=PRAME family member 18 [Source:HGNC Symbol%3BAcc:30693];gene_id=ENSG00000204491;logic_name=ensembl_havana_gene;version=2
>> 1	ensembl_havana	transcript	13474689	13477522	.	-	.	ID=transcript:ENST00000376126;Parent=gene:ENSG00000204491;Name=PRAMEF18-001;biotype=protein_coding;ccdsid=CCDS41258.1;havana_transcript=OTTHUMT00000008177;havana_version=2;tag=basic;transcript_id=ENST00000376126;version=2
>> 1	ensembl_havana	exon	13474689	13475262	.	-	.	Parent=transcript:ENST00000376126;Name=ENSE00001592884;constitutive=1;ensembl_end_phase=-1;ensembl_phase=2;exon_id=ENSE00001592884;rank=3;version=2
>> 1	ensembl_havana	CDS	13474689	13475262	.	-	1	ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
>> 1	ensembl_havana	exon	13476271	13476849	.	-	.	Parent=transcript:ENST00000376126;Name=ENSE00001620306;constitutive=1;ensembl_end_phase=2;ensembl_phase=2;exon_id=ENSE00001620306;rank=2;version=1
>> 1	ensembl_havana	CDS	13476271	13476849	.	-	1	ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
>> 1	ensembl_havana	exon	13477236	13477522	.	-	.	Parent=transcript:ENST00000376126;Name=ENSE00003445415;constitutive=1;ensembl_end_phase=2;ensembl_phase=0;exon_id=ENSE00003445415;rank=1;version=1
>> 1	ensembl_havana	CDS	13477236	13477522	.	-	0	ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
>> ###
>> OK:	Based on 3rd column:        gene <= transcript <= exon,CDS
>> OK:	biotype attribute: protein_coding <= protein_coding
>> 
>> ************************************************
>>     SECOND, 3rd column OK but biotype changes:
>> ************************************************
>> 
>> ###
>> 10	ensembl_havana	pseudogene	112696380	112696991	.	-	.	ID=gene:ENSG00000234118;Name=RPL13AP6;biotype=pseudogene;description=ribosomal protein L13a pseudogene 6 [Source:HGNC Symbol%3BAcc:23737];gene_id=ENSG00000234118;logic_name=ensembl_havana_gene;version=1
>> 10	ensembl_havana	processed_pseudogene	112696380	112696991	.	-	.	ID=transcript:ENST00000430133;Parent=gene:ENSG00000234118;Name=RPL13AP6-001;biotype=processed_pseudogene;havana_transcript=OTTHUMT00000050371;havana_version=1;tag=basic;transcript_id=ENST00000430133;version=1
>> 10	ensembl_havana	exon	112696380	112696991	.	-	.	Parent=transcript:ENST00000430133;Name=ENSE00002511651;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002511651;rank=1;version=1
>> ###
>> OK:	Based on 3rd column : pseudogene <= processed_pseudogene <= exon
>> ??:	biotype attribute : 	     pseudogene <= processed_pseudogene
>> 
>> Question1 : Why the biotype change ? Will not be more coherent to have the processed_pseudogene biotype for the pseudogene feature too ?
>> 
>> ************************************************
>>     THIRD, 3rd column does not change but biotype changes:
>> ************************************************
>> 
>> ###
>> 1	havana	pseudogene	176241619	176242538	.	+	.	ID=gene:ENSG00000227815;Name=RP11-195C7.3;biotype=pseudogene;gene_id=ENSG00000227815;logic_name=havana;version=2
>> 1	havana	pseudogene	176241619	176242538	.	+	.	ID=transcript:ENST00000440296;Parent=gene:ENSG00000227815;Name=RP11-195C7.3-001;biotype=unprocessed_pseudogene;havana_transcript=OTTHUMT00000084685;havana_version=2;tag=basic;transcript_id=ENST00000440296;version=2
>> 1	havana	exon	176241619	176241675	.	+	.	Parent=transcript:ENST00000440296;Name=ENSE00001660785;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001660785;rank=1;version=2
>> 1	havana	exon	176241743	176242168	.	+	.	Parent=transcript:ENST00000440296;Name=ENSE00001739151;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001739151;rank=2;version=2
>> 1	havana	exon	176242227	176242538	.	+	.	Parent=transcript:ENST00000440296;Name=ENSE00001773509;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001773509;rank=3;version=2
>> ###
>> ??:	Based on 3rd column : pseudogene <= pseudogene <= exon
>> ??:	biotype attribute :	      pseudogene <= unprocessed_pseudogene
>> 
>> Question2 : Why the second feature is also a pseudogene ? Will not be more coherent to have a sort of subclass of pseudogene like unprocessed_pseudogene as you do for the biotype in that case ?
>> 		    As for the question1,  why the biotype change ? Will not be better to have unprocessed_pseudogene for the top feature too ?
>> 
>> ###
>> 1	havana	pseudogene	13411551	13414482	.	+	.	ID=gene:ENSG00000237700;Name=RP11-219C24.6;biotype=pseudogene;gene_id=ENSG00000237700;logic_name=havana;version=1
>> 1	havana	pseudogene	13411551	13414482	.	+	.	ID=transcript:ENST00000437300;Parent=gene:ENSG00000237700;Name=RP11-219C24.6-001;biotype=unitary_pseudogene;havana_transcript=OTTHUMT00000022042;havana_version=1;tag=basic;transcript_id=ENST00000437300;version=1
>> 1	havana	exon	13411551	13411837	.	+	.	Parent=transcript:ENST00000437300;Name=ENSE00001677077;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001677077;rank=1;version=1
>> 1	havana	exon	13412234	13412812	.	+	.	Parent=transcript:ENST00000437300;Name=ENSE00001715540;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001715540;rank=2;version=1
>> 1	havana	exon	13413924	13414482	.	+	.	Parent=transcript:ENST00000437300;Name=ENSE00001784031;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001784031;rank=3;version=1
>> ###
>> ??:	Based on 3rd column : pseudogene <= pseudogene <= exon
>> ??:	biotype attribute :           pseudogene <= unitary_pseudogene
>> 
>> The same as question2 ...
>> 
>> *************************************************
>>     FOURTH, same feature used differently :
>> *************************************************
>> 
>> ### Locus1
>> 1	havana	processed_transcript	141474	173862	.	-	.	ID=gene:ENSG00000241860;Name=RP11-34P13.13;biotype=processed_transcript;gene_id=ENSG00000241860;logic_name=havana;version=2
>> 1	havana	transcript	141474	149707	.	-	.	ID=transcript:ENST00000484859;Parent=gene:ENSG00000241860;Name=RP11-34P13.13-004;biotype=antisense;havana_transcript=OTTHUMT00000007035;havana_version=1;tag=basic;transcript_id=ENST00000484859;version=1
>> 1	havana	exon	141474	143011	.	-	.	Parent=transcript:ENST00000484859;Name=ENSE00001911218;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001911218;rank=2;version=1
>> 1	havana	exon	146386	149707	.	-	.	Parent=transcript:ENST00000484859;Name=ENSE00001860404;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001860404;rank=1;version=1
>> ### Locus2
>> 1	ensembl_havana	pseudogene	11869	14412	.	+	.	ID=gene:ENSG00000223972;Name=DDX11L1;biotype=pseudogene;description=DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [Source:HGNC Symbol%3BAcc:37102];gene_id=ENSG00000223972;logic_name=ensembl_havana_gene;version=4
>> 1	ensembl_havana	processed_transcript	11869	14409	.	+	.	ID=transcript:ENST00000456328;Parent=gene:ENSG00000223972;Name=DDX11L1-002;biotype=processed_transcript;havana_transcript=OTTHUMT00000362751;havana_version=1;tag=basic;transcript_id=ENST00000456328;version=2
>> 1	havana	exon	11869	12227	.	+	.	Parent=transcript:ENST00000456328;Name=ENSE00002234944;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002234944;rank=1;version=1
>> 1	havana	exon	12613	12721	.	+	.	Parent=transcript:ENST00000456328;Name=ENSE00003582793;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003582793;rank=2;version=1
>> 1	havana	exon	13221	14409	.	+	.	Parent=transcript:ENST00000456328;Name=ENSE00002312635;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002312635;rank=3;version=1
>> ###
>> ?? => For the first locus the processed_transcript is a top level feature (has a transcript child), then for the second locus the processed_transcript is a child of the pseudogene feature.
>> 
>> Question3: Why a same feature (processed_transcript) does not follow the same schema ? In my sense, it must be always either a top level feature or a child of a top level feature.
>> 
>> ********************************
>>     FIFTH, semantic choice :
>> ********************************
>> 
>> ###
>> 1	ensembl	RNA	1340841	1341132	.	-	.	ID=gene:ENSG00000264293;Name=RN7SL657P;biotype=misc_RNA;description=RNA%2C 7SL%2C cytoplasmic 657%2C pseudogene [Source:HGNC Symbol%3BAcc:46673];gene_id=ENSG00000264293;logic_name=ncrna;version=1
>> 1	ensembl	transcript	1340841	1341132	.	-	.	ID=transcript:ENST00000582431;Parent=gene:ENSG00000264293;Name=RN7SL657P-201;biotype=misc_RNA;tag=basic;transcript_id=ENST00000582431;version=1
>> 1	ensembl	exon	1340841	1341132	.	-	.	Parent=transcript:ENST00000582431;Name=ENSE00002720632;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002720632;rank=1;version=1
>> ###
>> ??: Just a semantic choice but a RNA that has a transcript has child is a bit strange. 
>> 
>> Remark: Change the transcript by RNA and have has top level feature something like RNA_gene as you do for all other RNA feature types would be great.
>> 
>> 
>> So I would like some clarifications about  the choices you did when creating your gff3 annotation file. I would like to know as well if it’s possible for you to make the thing more consistent, it would be easier for everybody I guess.
>> Thanks in advance,
>> 
>> Best regards,
>> 
>> Jacques Dainat, PhD
>> NBIS (National Bioinformatics Infrastructure Sweden)
>> Genome Annotation Service
>> 
>> Address: (room E10:4204 - last floor)
>> Uppsala University, BMC
>> Department of Medical Biochemistry Microbiology, Genomics
>> Husargatan 3, box 582
>> S-75123 Uppsala Sweden
>> Phone: 01 84 71 46 25
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20161031/519dfc8c/attachment.html>