[ensembl-dev] Triticum aestivum invalid GFF3

Hans Vasquez-Gross havasquezgross at ucdavis.edu
Sat Feb 22 01:34:38 GMT 2014


Hello,

I recently downloaded the MIPs GFF3 annotation provided on your FTP for
Triticum_aestivum.

ftp://ftp.ensemblgenomes.org/pub/plants/release-21/gff3/triticum_aestivum/Triticum_aestivum.IWGSP1.21.gff3.gz

I tried running this file for visualization in a genome browser, but it
does not validate.  There seems to be a problem in the manner the ID= field
in the 9th column is setup.  According to SO (
http://www.sequenceontology.org/gff3.shtml), the ID= in the 9th column MUST
be unique.  But currently, all transcript/CDS/exon relationships have the
same ID collision issue which I'll explain below with the first example
problem.

If you take a look at lines 133-138 in the gff3 file, you should see this:
##sequence-region   IWGSC_CSS_3AS_scaff_369935 1 200
IWGSC_CSS_3AS_scaff_369935      ensembl protein_coding_gene     1       200
    .       -       .
ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
IWGSC_CSS_3AS_scaff_369935      ensembl transcript      1       200     .
    -       .
ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
IWGSC_CSS_3AS_scaff_369935      .       CDS     1       198     .       -
    0       ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2.1;rank=1
IWGSC_CSS_3AS_scaff_369935      .       exon    1       200     .       -
    .
ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.1;constitutive=1;ensembl_phase=-1;rank=1
IWGSC_CSS_3AS_scaff_369935      .       five_prime_UTR  199     200     .
    -       .       Parent=Traes_3AS_775C097A2.1;

The transcript and CDS definition have the exact same ID defined
"Traes_3AS_775C097A2.1" which is causing the naming collision.  You will
also notice in the CDS definition line, the ID= and Parent= are exactly the
same.  The parent in this case is trying to refer to the transcript ID, but
the CDS has the same ID.

ProposedSolution:
Any CDS ID could have a "C" appended after the period.  For example,
Traes_3AS_775C097A2.1 would become Traes_3AS_775C097A2.C1.  This is similar
to what you are doing for the Exons. The exon line would then have to be
updated with this new ID for the Parent= string.  Then, the new GFF3 block
for this transcript definition would be:

##sequence-region   IWGSC_CSS_3AS_scaff_369935 1 200
IWGSC_CSS_3AS_scaff_369935      ensembl protein_coding_gene     1       200
    .       -       .
ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
IWGSC_CSS_3AS_scaff_369935      ensembl transcript      1       200     .
    -       .
ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
IWGSC_CSS_3AS_scaff_369935      .       CDS     1       198     .       -
    0       ID=Traes_3AS_775C097A2.C1;Parent=Traes_3AS_775C097A2.1;rank=1
IWGSC_CSS_3AS_scaff_369935      .       exon    1       200     .       -
    .
ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.C1;constitutive=1;ensembl_phase=-1;rank=1
IWGSC_CSS_3AS_scaff_369935      .       five_prime_UTR  199     200     .
    -       .       Parent=Traes_3AS_775C097A2.1;

Would this be a fast fix on your side to regenerate the data to be valid?
 If not, I'll write my own script next week to fix the errors in the GFF3
file.

Cheers,
-Hans
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140221/7b83caec/attachment.html>


More information about the Dev mailing list