[ensembl-dev] Triticum aestivum invalid GFF3
Hans Vasquez-Gross
havasquezgross at ucdavis.edu
Sat Feb 22 01:34:38 GMT 2014
Hello,
I recently downloaded the MIPs GFF3 annotation provided on your FTP for
Triticum_aestivum.
ftp://ftp.ensemblgenomes.org/pub/plants/release-21/gff3/triticum_aestivum/Triticum_aestivum.IWGSP1.21.gff3.gz
I tried running this file for visualization in a genome browser, but it
does not validate. There seems to be a problem in the manner the ID= field
in the 9th column is setup. According to SO (
http://www.sequenceontology.org/gff3.shtml), the ID= in the 9th column MUST
be unique. But currently, all transcript/CDS/exon relationships have the
same ID collision issue which I'll explain below with the first example
problem.
If you take a look at lines 133-138 in the gff3 file, you should see this:
##sequence-region IWGSC_CSS_3AS_scaff_369935 1 200
IWGSC_CSS_3AS_scaff_369935 ensembl protein_coding_gene 1 200
. - .
ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
IWGSC_CSS_3AS_scaff_369935 ensembl transcript 1 200 .
- .
ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
IWGSC_CSS_3AS_scaff_369935 . CDS 1 198 . -
0 ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2.1;rank=1
IWGSC_CSS_3AS_scaff_369935 . exon 1 200 . -
.
ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.1;constitutive=1;ensembl_phase=-1;rank=1
IWGSC_CSS_3AS_scaff_369935 . five_prime_UTR 199 200 .
- . Parent=Traes_3AS_775C097A2.1;
The transcript and CDS definition have the exact same ID defined
"Traes_3AS_775C097A2.1" which is causing the naming collision. You will
also notice in the CDS definition line, the ID= and Parent= are exactly the
same. The parent in this case is trying to refer to the transcript ID, but
the CDS has the same ID.
ProposedSolution:
Any CDS ID could have a "C" appended after the period. For example,
Traes_3AS_775C097A2.1 would become Traes_3AS_775C097A2.C1. This is similar
to what you are doing for the Exons. The exon line would then have to be
updated with this new ID for the Parent= string. Then, the new GFF3 block
for this transcript definition would be:
##sequence-region IWGSC_CSS_3AS_scaff_369935 1 200
IWGSC_CSS_3AS_scaff_369935 ensembl protein_coding_gene 1 200
. - .
ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
IWGSC_CSS_3AS_scaff_369935 ensembl transcript 1 200 .
- .
ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
IWGSC_CSS_3AS_scaff_369935 . CDS 1 198 . -
0 ID=Traes_3AS_775C097A2.C1;Parent=Traes_3AS_775C097A2.1;rank=1
IWGSC_CSS_3AS_scaff_369935 . exon 1 200 . -
.
ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.C1;constitutive=1;ensembl_phase=-1;rank=1
IWGSC_CSS_3AS_scaff_369935 . five_prime_UTR 199 200 .
- . Parent=Traes_3AS_775C097A2.1;
Would this be a fast fix on your side to regenerate the data to be valid?
If not, I'll write my own script next week to fix the errors in the GFF3
file.
Cheers,
-Hans
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140221/7b83caec/attachment.html>
More information about the Dev
mailing list