[ensembl-dev] GFF3 peculiarity in Homo_sapiens.GRCh37.82.gff3
Jacques Dainat
jacques.dainat at bils.se
Mon Oct 31 10:01:43 GMT 2016
Hi,
Dear all,
I'm working in an annotation service and we use the gff3 format as a central format for everything we are doing. So, we have scripts to check the gff3 and correct them if needed.
When I was working with the gff3 file Homo_sapiens.GRCh37.82.gff3 coming from your work, some peculiarities that I have never seen before popped up.
First of all I would like to explain quickly how we parse our data:
We usually parse our data paying strong attention about the “type” (3rd column) and sorting them in 3 levels structure:
Level1 => Features that do not have Parent: gene, pseudogene, lincrna_gene, mirna_gene etc.
Level2 => Features that have Parent and Children: mrna, trna, snorna, transcript, processed_pseudogene, etc.
Level3 => Features that have Parent but no Children: cds, exon, utr, tts, stop_codon,sig_peptide, etc.
This works quite fine with gff3 coming from many different sources, but when coming to parse your data it doesn’t work properly.
Indeed there is some inconsistency within the 3rd column and we cannot either use the biotype attribute of the 9th column that seems to vary within a same feature.
I would like to know if you can make an effort to homogenise those things for future data releases.
In order to better explain the thing, and argument it, here is some examples:
========================================================
************************************************
FIRST, examples where things are fine:
************************************************
###
1 ensembl snRNA_gene 13384735 13384841 . - . ID=gene:ENSG00000207511;Name=RNU6-771P;biotype=snRNA;description=RNA%2C U6 small nuclear 771%2C pseudogene [Source:HGNC Symbol%3BAcc:47734];gene_id=ENSG00000207511;logic_name=ncrna;version=1
1 ensembl snRNA 13384735 13384841 . - . ID=transcript:ENST00000384780;Parent=gene:ENSG00000207511;Name=RNU6-771P-201;biotype=snRNA;tag=basic;transcript_id=ENST00000384780;version=1
1 ensembl exon 13384735 13384841 . - . Parent=transcript:ENST00000384780;Name=ENSE00001807562;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001807562;rank=1;version=1
###
OK: Based on 3rd column: snRNA_gene <= snRNA <= exon
OK: biotype attribute: snRNA <= snRNA
###
1 havana lincRNA_gene 32814795 32816264 . - . ID=gene:ENSG00000233775;Name=RP4-811H24.9;biotype=lincRNA;gene_id=ENSG00000233775;logic_name=havana;version=1
1 havana lincRNA 32814795 32816264 . - . ID=transcript:ENST00000448134;Parent=gene:ENSG00000233775;Name=RP4-811H24.9-001;biotype=lincRNA;havana_transcript=OTTHUMT00000020212;havana_version=3;tag=basic;transcript_id=ENST00000448134;version=1
1 havana exon 32814795 32815422 . - . Parent=transcript:ENST00000448134;Name=ENSE00001776478;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001776478;rank=2;version=1
1 havana exon 32816206 32816264 . - . Parent=transcript:ENST00000448134;Name=ENSE00001624254;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001624254;rank=1;version=1
###
OK: Based on 3rd column: lincRNA_gene <= lincRNA <= exon
OK: biotype attribute: lincRNA <= lincRNA
###
1 ensembl_havana gene 13474689 13477522 . - . ID=gene:ENSG00000204491;Name=PRAMEF18;biotype=protein_coding;description=PRAME family member 18 [Source:HGNC Symbol%3BAcc:30693];gene_id=ENSG00000204491;logic_name=ensembl_havana_gene;version=2
1 ensembl_havana transcript 13474689 13477522 . - . ID=transcript:ENST00000376126;Parent=gene:ENSG00000204491;Name=PRAMEF18-001;biotype=protein_coding;ccdsid=CCDS41258.1;havana_transcript=OTTHUMT00000008177;havana_version=2;tag=basic;transcript_id=ENST00000376126;version=2
1 ensembl_havana exon 13474689 13475262 . - . Parent=transcript:ENST00000376126;Name=ENSE00001592884;constitutive=1;ensembl_end_phase=-1;ensembl_phase=2;exon_id=ENSE00001592884;rank=3;version=2
1 ensembl_havana CDS 13474689 13475262 . - 1 ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
1 ensembl_havana exon 13476271 13476849 . - . Parent=transcript:ENST00000376126;Name=ENSE00001620306;constitutive=1;ensembl_end_phase=2;ensembl_phase=2;exon_id=ENSE00001620306;rank=2;version=1
1 ensembl_havana CDS 13476271 13476849 . - 1 ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
1 ensembl_havana exon 13477236 13477522 . - . Parent=transcript:ENST00000376126;Name=ENSE00003445415;constitutive=1;ensembl_end_phase=2;ensembl_phase=0;exon_id=ENSE00003445415;rank=1;version=1
1 ensembl_havana CDS 13477236 13477522 . - 0 ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
###
OK: Based on 3rd column: gene <= transcript <= exon,CDS
OK: biotype attribute: protein_coding <= protein_coding
************************************************
SECOND, 3rd column OK but biotype changes:
************************************************
###
10 ensembl_havana pseudogene 112696380 112696991 . - . ID=gene:ENSG00000234118;Name=RPL13AP6;biotype=pseudogene;description=ribosomal protein L13a pseudogene 6 [Source:HGNC Symbol%3BAcc:23737];gene_id=ENSG00000234118;logic_name=ensembl_havana_gene;version=1
10 ensembl_havana processed_pseudogene 112696380 112696991 . - . ID=transcript:ENST00000430133;Parent=gene:ENSG00000234118;Name=RPL13AP6-001;biotype=processed_pseudogene;havana_transcript=OTTHUMT00000050371;havana_version=1;tag=basic;transcript_id=ENST00000430133;version=1
10 ensembl_havana exon 112696380 112696991 . - . Parent=transcript:ENST00000430133;Name=ENSE00002511651;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002511651;rank=1;version=1
###
OK: Based on 3rd column : pseudogene <= processed_pseudogene <= exon
??: biotype attribute : pseudogene <= processed_pseudogene
Question1 : Why the biotype change ? Will not be more coherent to have the processed_pseudogene biotype for the pseudogene feature too ?
************************************************
THIRD, 3rd column does not change but biotype changes:
************************************************
###
1 havana pseudogene 176241619 176242538 . + . ID=gene:ENSG00000227815;Name=RP11-195C7.3;biotype=pseudogene;gene_id=ENSG00000227815;logic_name=havana;version=2
1 havana pseudogene 176241619 176242538 . + . ID=transcript:ENST00000440296;Parent=gene:ENSG00000227815;Name=RP11-195C7.3-001;biotype=unprocessed_pseudogene;havana_transcript=OTTHUMT00000084685;havana_version=2;tag=basic;transcript_id=ENST00000440296;version=2
1 havana exon 176241619 176241675 . + . Parent=transcript:ENST00000440296;Name=ENSE00001660785;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001660785;rank=1;version=2
1 havana exon 176241743 176242168 . + . Parent=transcript:ENST00000440296;Name=ENSE00001739151;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001739151;rank=2;version=2
1 havana exon 176242227 176242538 . + . Parent=transcript:ENST00000440296;Name=ENSE00001773509;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001773509;rank=3;version=2
###
??: Based on 3rd column : pseudogene <= pseudogene <= exon
??: biotype attribute : pseudogene <= unprocessed_pseudogene
Question2 : Why the second feature is also a pseudogene ? Will not be more coherent to have a sort of subclass of pseudogene like unprocessed_pseudogene as you do for the biotype in that case ?
As for the question1, why the biotype change ? Will not be better to have unprocessed_pseudogene for the top feature too ?
###
1 havana pseudogene 13411551 13414482 . + . ID=gene:ENSG00000237700;Name=RP11-219C24.6;biotype=pseudogene;gene_id=ENSG00000237700;logic_name=havana;version=1
1 havana pseudogene 13411551 13414482 . + . ID=transcript:ENST00000437300;Parent=gene:ENSG00000237700;Name=RP11-219C24.6-001;biotype=unitary_pseudogene;havana_transcript=OTTHUMT00000022042;havana_version=1;tag=basic;transcript_id=ENST00000437300;version=1
1 havana exon 13411551 13411837 . + . Parent=transcript:ENST00000437300;Name=ENSE00001677077;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001677077;rank=1;version=1
1 havana exon 13412234 13412812 . + . Parent=transcript:ENST00000437300;Name=ENSE00001715540;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001715540;rank=2;version=1
1 havana exon 13413924 13414482 . + . Parent=transcript:ENST00000437300;Name=ENSE00001784031;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001784031;rank=3;version=1
###
??: Based on 3rd column : pseudogene <= pseudogene <= exon
??: biotype attribute : pseudogene <= unitary_pseudogene
The same as question2 ...
*************************************************
FOURTH, same feature used differently :
*************************************************
### Locus1
1 havana processed_transcript 141474 173862 . - . ID=gene:ENSG00000241860;Name=RP11-34P13.13;biotype=processed_transcript;gene_id=ENSG00000241860;logic_name=havana;version=2
1 havana transcript 141474 149707 . - . ID=transcript:ENST00000484859;Parent=gene:ENSG00000241860;Name=RP11-34P13.13-004;biotype=antisense;havana_transcript=OTTHUMT00000007035;havana_version=1;tag=basic;transcript_id=ENST00000484859;version=1
1 havana exon 141474 143011 . - . Parent=transcript:ENST00000484859;Name=ENSE00001911218;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001911218;rank=2;version=1
1 havana exon 146386 149707 . - . Parent=transcript:ENST00000484859;Name=ENSE00001860404;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001860404;rank=1;version=1
### Locus2
1 ensembl_havana pseudogene 11869 14412 . + . ID=gene:ENSG00000223972;Name=DDX11L1;biotype=pseudogene;description=DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [Source:HGNC Symbol%3BAcc:37102];gene_id=ENSG00000223972;logic_name=ensembl_havana_gene;version=4
1 ensembl_havana processed_transcript 11869 14409 . + . ID=transcript:ENST00000456328;Parent=gene:ENSG00000223972;Name=DDX11L1-002;biotype=processed_transcript;havana_transcript=OTTHUMT00000362751;havana_version=1;tag=basic;transcript_id=ENST00000456328;version=2
1 havana exon 11869 12227 . + . Parent=transcript:ENST00000456328;Name=ENSE00002234944;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002234944;rank=1;version=1
1 havana exon 12613 12721 . + . Parent=transcript:ENST00000456328;Name=ENSE00003582793;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003582793;rank=2;version=1
1 havana exon 13221 14409 . + . Parent=transcript:ENST00000456328;Name=ENSE00002312635;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002312635;rank=3;version=1
###
?? => For the first locus the processed_transcript is a top level feature (has a transcript child), then for the second locus the processed_transcript is a child of the pseudogene feature.
Question3: Why a same feature (processed_transcript) does not follow the same schema ? In my sense, it must be always either a top level feature or a child of a top level feature.
********************************
FIFTH, semantic choice :
********************************
###
1 ensembl RNA 1340841 1341132 . - . ID=gene:ENSG00000264293;Name=RN7SL657P;biotype=misc_RNA;description=RNA%2C 7SL%2C cytoplasmic 657%2C pseudogene [Source:HGNC Symbol%3BAcc:46673];gene_id=ENSG00000264293;logic_name=ncrna;version=1
1 ensembl transcript 1340841 1341132 . - . ID=transcript:ENST00000582431;Parent=gene:ENSG00000264293;Name=RN7SL657P-201;biotype=misc_RNA;tag=basic;transcript_id=ENST00000582431;version=1
1 ensembl exon 1340841 1341132 . - . Parent=transcript:ENST00000582431;Name=ENSE00002720632;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002720632;rank=1;version=1
###
??: Just a semantic choice but a RNA that has a transcript has child is a bit strange.
Remark: Change the transcript by RNA and have has top level feature something like RNA_gene as you do for all other RNA feature types would be great.
So I would like some clarifications about the choices you did when creating your gff3 annotation file. I would like to know as well if it’s possible for you to make the thing more consistent, it would be easier for everybody I guess.
Thanks in advance,
Best regards,
Jacques Dainat, PhD
NBIS (National Bioinformatics Infrastructure Sweden)
Genome Annotation Service
Address: (room E10:4204 - last floor)
Uppsala University, BMC
Department of Medical Biochemistry Microbiology, Genomics
Husargatan 3, box 582
S-75123 Uppsala Sweden
Phone: 01 84 71 46 25
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20161031/d7dc23bc/attachment.html>
More information about the Dev
mailing list