[ensembl-dev] GFF3 peculiarity in Homo_sapiens.GRCh37.82.gff3
Thibaut Hourlier
thibaut at ebi.ac.uk
Mon Oct 31 13:17:30 GMT 2016
Hi Jaques,
The GFF files reflect the data in our databases. The human gene set is a merge between the Ensembl set and the Havana set. The Havana set being manually curated, it has precedence over the Ensembl set and we do not modify the Havana set. The data in the GRCh37 databases will not be updated as GRCh38 is an updated and improved version of GRCh37.
In your third example, it is a gene annotated by Havana and the gene biotype is pseudogene and the transcript biotype is unprocessed_pseudogene. If you look in our latest release using GRCh38, you will see that the discrepancy has been corrected and the gene has the biotype unprocessed_pseudogene.
We are using SO terms to assign the 3rd column. In your fifth example, we use the misc_RNA biotype which is not a SO term. The correct SO term in this case would be ncRNA. This is something that our new exporters will be able to fix.
Thanks
Thibaut
> On 31 Oct 2016, at 10:33, Anne Lyle <annelyle at ebi.ac.uk> wrote:
>
> Hi Jacques
>
> Thanks for your input. We’re currently reworking our own parsers and exporters for all the common formats, including GFF3, so we’ll take your comments into consideration.
>
> Cheers
>
> Anne
>
>
>
>> On 31 Oct 2016, at 10:01, Jacques Dainat <jacques.dainat at bils.se <mailto:jacques.dainat at bils.se>> wrote:
>>
>> Hi,
>>
>> Dear all,
>>
>> I'm working in an annotation service and we use the gff3 format as a central format for everything we are doing. So, we have scripts to check the gff3 and correct them if needed.
>> When I was working with the gff3 file Homo_sapiens.GRCh37.82.gff3 coming from your work, some peculiarities that I have never seen before popped up.
>>
>> First of all I would like to explain quickly how we parse our data:
>>
>> We usually parse our data paying strong attention about the “type” (3rd column) and sorting them in 3 levels structure:
>> Level1 => Features that do not have Parent: gene, pseudogene, lincrna_gene, mirna_gene etc.
>> Level2 => Features that have Parent and Children: mrna, trna, snorna, transcript, processed_pseudogene, etc.
>> Level3 => Features that have Parent but no Children: cds, exon, utr, tts, stop_codon,sig_peptide, etc.
>>
>> This works quite fine with gff3 coming from many different sources, but when coming to parse your data it doesn’t work properly.
>> Indeed there is some inconsistency within the 3rd column and we cannot either use the biotype attribute of the 9th column that seems to vary within a same feature.
>> I would like to know if you can make an effort to homogenise those things for future data releases.
>>
>> In order to better explain the thing, and argument it, here is some examples:
>> ========================================================
>>
>> ************************************************
>> FIRST, examples where things are fine:
>> ************************************************
>> ###
>> 1 ensembl snRNA_gene 13384735 13384841 . - . ID=gene:ENSG00000207511;Name=RNU6-771P;biotype=snRNA;description=RNA%2C U6 small nuclear 771%2C pseudogene [Source:HGNC Symbol%3BAcc:47734];gene_id=ENSG00000207511;logic_name=ncrna;version=1
>> 1 ensembl snRNA 13384735 13384841 . - . ID=transcript:ENST00000384780;Parent=gene:ENSG00000207511;Name=RNU6-771P-201;biotype=snRNA;tag=basic;transcript_id=ENST00000384780;version=1
>> 1 ensembl exon 13384735 13384841 . - . Parent=transcript:ENST00000384780;Name=ENSE00001807562;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001807562;rank=1;version=1
>> ###
>> OK: Based on 3rd column: snRNA_gene <= snRNA <= exon
>> OK: biotype attribute: snRNA <= snRNA
>>
>> ###
>> 1 havana lincRNA_gene 32814795 32816264 . - . ID=gene:ENSG00000233775;Name=RP4-811H24.9;biotype=lincRNA;gene_id=ENSG00000233775;logic_name=havana;version=1
>> 1 havana lincRNA 32814795 32816264 . - . ID=transcript:ENST00000448134;Parent=gene:ENSG00000233775;Name=RP4-811H24.9-001;biotype=lincRNA;havana_transcript=OTTHUMT00000020212;havana_version=3;tag=basic;transcript_id=ENST00000448134;version=1
>> 1 havana exon 32814795 32815422 . - . Parent=transcript:ENST00000448134;Name=ENSE00001776478;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001776478;rank=2;version=1
>> 1 havana exon 32816206 32816264 . - . Parent=transcript:ENST00000448134;Name=ENSE00001624254;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001624254;rank=1;version=1
>> ###
>> OK: Based on 3rd column: lincRNA_gene <= lincRNA <= exon
>> OK: biotype attribute: lincRNA <= lincRNA
>>
>> ###
>> 1 ensembl_havana gene 13474689 13477522 . - . ID=gene:ENSG00000204491;Name=PRAMEF18;biotype=protein_coding;description=PRAME family member 18 [Source:HGNC Symbol%3BAcc:30693];gene_id=ENSG00000204491;logic_name=ensembl_havana_gene;version=2
>> 1 ensembl_havana transcript 13474689 13477522 . - . ID=transcript:ENST00000376126;Parent=gene:ENSG00000204491;Name=PRAMEF18-001;biotype=protein_coding;ccdsid=CCDS41258.1;havana_transcript=OTTHUMT00000008177;havana_version=2;tag=basic;transcript_id=ENST00000376126;version=2
>> 1 ensembl_havana exon 13474689 13475262 . - . Parent=transcript:ENST00000376126;Name=ENSE00001592884;constitutive=1;ensembl_end_phase=-1;ensembl_phase=2;exon_id=ENSE00001592884;rank=3;version=2
>> 1 ensembl_havana CDS 13474689 13475262 . - 1 ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
>> 1 ensembl_havana exon 13476271 13476849 . - . Parent=transcript:ENST00000376126;Name=ENSE00001620306;constitutive=1;ensembl_end_phase=2;ensembl_phase=2;exon_id=ENSE00001620306;rank=2;version=1
>> 1 ensembl_havana CDS 13476271 13476849 . - 1 ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
>> 1 ensembl_havana exon 13477236 13477522 . - . Parent=transcript:ENST00000376126;Name=ENSE00003445415;constitutive=1;ensembl_end_phase=2;ensembl_phase=0;exon_id=ENSE00003445415;rank=1;version=1
>> 1 ensembl_havana CDS 13477236 13477522 . - 0 ID=CDS:ENSP00000365294;Parent=transcript:ENST00000376126;protein_id=ENSP00000365294
>> ###
>> OK: Based on 3rd column: gene <= transcript <= exon,CDS
>> OK: biotype attribute: protein_coding <= protein_coding
>>
>> ************************************************
>> SECOND, 3rd column OK but biotype changes:
>> ************************************************
>>
>> ###
>> 10 ensembl_havana pseudogene 112696380 112696991 . - . ID=gene:ENSG00000234118;Name=RPL13AP6;biotype=pseudogene;description=ribosomal protein L13a pseudogene 6 [Source:HGNC Symbol%3BAcc:23737];gene_id=ENSG00000234118;logic_name=ensembl_havana_gene;version=1
>> 10 ensembl_havana processed_pseudogene 112696380 112696991 . - . ID=transcript:ENST00000430133;Parent=gene:ENSG00000234118;Name=RPL13AP6-001;biotype=processed_pseudogene;havana_transcript=OTTHUMT00000050371;havana_version=1;tag=basic;transcript_id=ENST00000430133;version=1
>> 10 ensembl_havana exon 112696380 112696991 . - . Parent=transcript:ENST00000430133;Name=ENSE00002511651;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002511651;rank=1;version=1
>> ###
>> OK: Based on 3rd column : pseudogene <= processed_pseudogene <= exon
>> ??: biotype attribute : pseudogene <= processed_pseudogene
>>
>> Question1 : Why the biotype change ? Will not be more coherent to have the processed_pseudogene biotype for the pseudogene feature too ?
>>
>> ************************************************
>> THIRD, 3rd column does not change but biotype changes:
>> ************************************************
>>
>> ###
>> 1 havana pseudogene 176241619 176242538 . + . ID=gene:ENSG00000227815;Name=RP11-195C7.3;biotype=pseudogene;gene_id=ENSG00000227815;logic_name=havana;version=2
>> 1 havana pseudogene 176241619 176242538 . + . ID=transcript:ENST00000440296;Parent=gene:ENSG00000227815;Name=RP11-195C7.3-001;biotype=unprocessed_pseudogene;havana_transcript=OTTHUMT00000084685;havana_version=2;tag=basic;transcript_id=ENST00000440296;version=2
>> 1 havana exon 176241619 176241675 . + . Parent=transcript:ENST00000440296;Name=ENSE00001660785;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001660785;rank=1;version=2
>> 1 havana exon 176241743 176242168 . + . Parent=transcript:ENST00000440296;Name=ENSE00001739151;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001739151;rank=2;version=2
>> 1 havana exon 176242227 176242538 . + . Parent=transcript:ENST00000440296;Name=ENSE00001773509;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001773509;rank=3;version=2
>> ###
>> ??: Based on 3rd column : pseudogene <= pseudogene <= exon
>> ??: biotype attribute : pseudogene <= unprocessed_pseudogene
>>
>> Question2 : Why the second feature is also a pseudogene ? Will not be more coherent to have a sort of subclass of pseudogene like unprocessed_pseudogene as you do for the biotype in that case ?
>> As for the question1, why the biotype change ? Will not be better to have unprocessed_pseudogene for the top feature too ?
>>
>> ###
>> 1 havana pseudogene 13411551 13414482 . + . ID=gene:ENSG00000237700;Name=RP11-219C24.6;biotype=pseudogene;gene_id=ENSG00000237700;logic_name=havana;version=1
>> 1 havana pseudogene 13411551 13414482 . + . ID=transcript:ENST00000437300;Parent=gene:ENSG00000237700;Name=RP11-219C24.6-001;biotype=unitary_pseudogene;havana_transcript=OTTHUMT00000022042;havana_version=1;tag=basic;transcript_id=ENST00000437300;version=1
>> 1 havana exon 13411551 13411837 . + . Parent=transcript:ENST00000437300;Name=ENSE00001677077;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001677077;rank=1;version=1
>> 1 havana exon 13412234 13412812 . + . Parent=transcript:ENST00000437300;Name=ENSE00001715540;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001715540;rank=2;version=1
>> 1 havana exon 13413924 13414482 . + . Parent=transcript:ENST00000437300;Name=ENSE00001784031;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001784031;rank=3;version=1
>> ###
>> ??: Based on 3rd column : pseudogene <= pseudogene <= exon
>> ??: biotype attribute : pseudogene <= unitary_pseudogene
>>
>> The same as question2 ...
>>
>> *************************************************
>> FOURTH, same feature used differently :
>> *************************************************
>>
>> ### Locus1
>> 1 havana processed_transcript 141474 173862 . - . ID=gene:ENSG00000241860;Name=RP11-34P13.13;biotype=processed_transcript;gene_id=ENSG00000241860;logic_name=havana;version=2
>> 1 havana transcript 141474 149707 . - . ID=transcript:ENST00000484859;Parent=gene:ENSG00000241860;Name=RP11-34P13.13-004;biotype=antisense;havana_transcript=OTTHUMT00000007035;havana_version=1;tag=basic;transcript_id=ENST00000484859;version=1
>> 1 havana exon 141474 143011 . - . Parent=transcript:ENST00000484859;Name=ENSE00001911218;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001911218;rank=2;version=1
>> 1 havana exon 146386 149707 . - . Parent=transcript:ENST00000484859;Name=ENSE00001860404;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001860404;rank=1;version=1
>> ### Locus2
>> 1 ensembl_havana pseudogene 11869 14412 . + . ID=gene:ENSG00000223972;Name=DDX11L1;biotype=pseudogene;description=DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [Source:HGNC Symbol%3BAcc:37102];gene_id=ENSG00000223972;logic_name=ensembl_havana_gene;version=4
>> 1 ensembl_havana processed_transcript 11869 14409 . + . ID=transcript:ENST00000456328;Parent=gene:ENSG00000223972;Name=DDX11L1-002;biotype=processed_transcript;havana_transcript=OTTHUMT00000362751;havana_version=1;tag=basic;transcript_id=ENST00000456328;version=2
>> 1 havana exon 11869 12227 . + . Parent=transcript:ENST00000456328;Name=ENSE00002234944;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002234944;rank=1;version=1
>> 1 havana exon 12613 12721 . + . Parent=transcript:ENST00000456328;Name=ENSE00003582793;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003582793;rank=2;version=1
>> 1 havana exon 13221 14409 . + . Parent=transcript:ENST00000456328;Name=ENSE00002312635;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002312635;rank=3;version=1
>> ###
>> ?? => For the first locus the processed_transcript is a top level feature (has a transcript child), then for the second locus the processed_transcript is a child of the pseudogene feature.
>>
>> Question3: Why a same feature (processed_transcript) does not follow the same schema ? In my sense, it must be always either a top level feature or a child of a top level feature.
>>
>> ********************************
>> FIFTH, semantic choice :
>> ********************************
>>
>> ###
>> 1 ensembl RNA 1340841 1341132 . - . ID=gene:ENSG00000264293;Name=RN7SL657P;biotype=misc_RNA;description=RNA%2C 7SL%2C cytoplasmic 657%2C pseudogene [Source:HGNC Symbol%3BAcc:46673];gene_id=ENSG00000264293;logic_name=ncrna;version=1
>> 1 ensembl transcript 1340841 1341132 . - . ID=transcript:ENST00000582431;Parent=gene:ENSG00000264293;Name=RN7SL657P-201;biotype=misc_RNA;tag=basic;transcript_id=ENST00000582431;version=1
>> 1 ensembl exon 1340841 1341132 . - . Parent=transcript:ENST00000582431;Name=ENSE00002720632;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002720632;rank=1;version=1
>> ###
>> ??: Just a semantic choice but a RNA that has a transcript has child is a bit strange.
>>
>> Remark: Change the transcript by RNA and have has top level feature something like RNA_gene as you do for all other RNA feature types would be great.
>>
>>
>> So I would like some clarifications about the choices you did when creating your gff3 annotation file. I would like to know as well if it’s possible for you to make the thing more consistent, it would be easier for everybody I guess.
>> Thanks in advance,
>>
>> Best regards,
>>
>> Jacques Dainat, PhD
>> NBIS (National Bioinformatics Infrastructure Sweden)
>> Genome Annotation Service
>>
>> Address: (room E10:4204 - last floor)
>> Uppsala University, BMC
>> Department of Medical Biochemistry Microbiology, Genomics
>> Husargatan 3, box 582
>> S-75123 Uppsala Sweden
>> Phone: 01 84 71 46 25
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20161031/519dfc8c/attachment.html>
More information about the Dev
mailing list