[ensembl-dev] Triticum aestivum invalid GFF3

Arnaud Kerhornou arnaud at ebi.ac.uk
Mon Feb 24 10:10:10 GMT 2014


Hello Hans,

Sorry about that, it's something we missed. This will be corrected with 
the coming release of Ensembl Genomes, which will be out around 
mid-march, so feel free to correct it on your side in the meantime.
Note that the next release of bread wheat will include an updated gene set.

Best regards,
Arnaud

On 22/02/2014 01:34, Hans Vasquez-Gross wrote:
> Hello,
>
> I recently downloaded the MIPs GFF3 annotation provided on your FTP 
> for Triticum_aestivum.
>
> ftp://ftp.ensemblgenomes.org/pub/plants/release-21/gff3/triticum_aestivum/Triticum_aestivum.IWGSP1.21.gff3.gz
>
> I tried running this file for visualization in a genome browser, but 
> it does not validate.  There seems to be a problem in the manner the 
> ID= field in the 9th column is setup.  According to SO 
> (http://www.sequenceontology.org/gff3.shtml), the ID= in the 9th 
> column MUST be unique.  But currently, all transcript/CDS/exon 
> relationships have the same ID collision issue which I'll explain 
> below with the first example problem.
>
> If you take a look at lines 133-138 in the gff3 file, you should see this:
> ##sequence-region   IWGSC_CSS_3AS_scaff_369935 1 200
> IWGSC_CSS_3AS_scaff_369935      ensembl protein_coding_gene     1     
>   200     .       - . 
> ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
> IWGSC_CSS_3AS_scaff_369935      ensembl transcript    1       200     
> .       -       . 
> ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
> IWGSC_CSS_3AS_scaff_369935      .       CDS     1   198     .       - 
>       0 ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2.1;rank=1
> IWGSC_CSS_3AS_scaff_369935      .       exon    1   200     .       - 
>       . 
> ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.1;constitutive=1;ensembl_phase=-1;rank=1
> IWGSC_CSS_3AS_scaff_369935      . five_prime_UTR  199     200     .   
>     -       . Parent=Traes_3AS_775C097A2.1;
>
> The transcript and CDS definition have the exact same ID defined 
> "Traes_3AS_775C097A2.1" which is causing the naming collision.  You 
> will also notice in the CDS definition line, the ID= and Parent= are 
> exactly the same.  The parent in this case is trying to refer to the 
> transcript ID, but the CDS has the same ID.
>
> ProposedSolution:
> Any CDS ID could have a "C" appended after the period.  For example, 
> Traes_3AS_775C097A2.1 would become Traes_3AS_775C097A2.C1.  This is 
> similar to what you are doing for the Exons. The exon line would then 
> have to be updated with this new ID for the Parent= string.  Then, the 
> new GFF3 block for this transcript definition would be:
>
> ##sequence-region   IWGSC_CSS_3AS_scaff_369935 1 200
> IWGSC_CSS_3AS_scaff_369935      ensembl protein_coding_gene     1     
>   200     .       - . 
> ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
> IWGSC_CSS_3AS_scaff_369935      ensembl transcript    1       200     
> .       -       . 
> ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
> IWGSC_CSS_3AS_scaff_369935      .       CDS     1   198     .       - 
>       0 ID=Traes_3AS_775C097A2.C1;Parent=Traes_3AS_775C097A2.1;rank=1
> IWGSC_CSS_3AS_scaff_369935      .       exon    1   200     .       - 
>       . 
> ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.C1;constitutive=1;ensembl_phase=-1;rank=1
> IWGSC_CSS_3AS_scaff_369935      . five_prime_UTR  199     200     .   
>     -       . Parent=Traes_3AS_775C097A2.1;
>
> Would this be a fast fix on your side to regenerate the data to be 
> valid?  If not, I'll write my own script next week to fix the errors 
> in the GFF3 file.
>
> Cheers,
> -Hans
>
>
>
> _______________________________________________
> Dev mailing listDev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog:http://www.ensembl.info/



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140224/aed97134/attachment.html>


More information about the Dev mailing list