[ensembl-dev] Triticum aestivum invalid GFF3
Arnaud Kerhornou
arnaud at ebi.ac.uk
Mon Feb 24 10:10:10 GMT 2014
Hello Hans,
Sorry about that, it's something we missed. This will be corrected with
the coming release of Ensembl Genomes, which will be out around
mid-march, so feel free to correct it on your side in the meantime.
Note that the next release of bread wheat will include an updated gene set.
Best regards,
Arnaud
On 22/02/2014 01:34, Hans Vasquez-Gross wrote:
> Hello,
>
> I recently downloaded the MIPs GFF3 annotation provided on your FTP
> for Triticum_aestivum.
>
> ftp://ftp.ensemblgenomes.org/pub/plants/release-21/gff3/triticum_aestivum/Triticum_aestivum.IWGSP1.21.gff3.gz
>
> I tried running this file for visualization in a genome browser, but
> it does not validate. There seems to be a problem in the manner the
> ID= field in the 9th column is setup. According to SO
> (http://www.sequenceontology.org/gff3.shtml), the ID= in the 9th
> column MUST be unique. But currently, all transcript/CDS/exon
> relationships have the same ID collision issue which I'll explain
> below with the first example problem.
>
> If you take a look at lines 133-138 in the gff3 file, you should see this:
> ##sequence-region IWGSC_CSS_3AS_scaff_369935 1 200
> IWGSC_CSS_3AS_scaff_369935 ensembl protein_coding_gene 1
> 200 . - .
> ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
> IWGSC_CSS_3AS_scaff_369935 ensembl transcript 1 200
> . - .
> ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
> IWGSC_CSS_3AS_scaff_369935 . CDS 1 198 . -
> 0 ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2.1;rank=1
> IWGSC_CSS_3AS_scaff_369935 . exon 1 200 . -
> .
> ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.1;constitutive=1;ensembl_phase=-1;rank=1
> IWGSC_CSS_3AS_scaff_369935 . five_prime_UTR 199 200 .
> - . Parent=Traes_3AS_775C097A2.1;
>
> The transcript and CDS definition have the exact same ID defined
> "Traes_3AS_775C097A2.1" which is causing the naming collision. You
> will also notice in the CDS definition line, the ID= and Parent= are
> exactly the same. The parent in this case is trying to refer to the
> transcript ID, but the CDS has the same ID.
>
> ProposedSolution:
> Any CDS ID could have a "C" appended after the period. For example,
> Traes_3AS_775C097A2.1 would become Traes_3AS_775C097A2.C1. This is
> similar to what you are doing for the Exons. The exon line would then
> have to be updated with this new ID for the Parent= string. Then, the
> new GFF3 block for this transcript definition would be:
>
> ##sequence-region IWGSC_CSS_3AS_scaff_369935 1 200
> IWGSC_CSS_3AS_scaff_369935 ensembl protein_coding_gene 1
> 200 . - .
> ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
> IWGSC_CSS_3AS_scaff_369935 ensembl transcript 1 200
> . - .
> ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
> IWGSC_CSS_3AS_scaff_369935 . CDS 1 198 . -
> 0 ID=Traes_3AS_775C097A2.C1;Parent=Traes_3AS_775C097A2.1;rank=1
> IWGSC_CSS_3AS_scaff_369935 . exon 1 200 . -
> .
> ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.C1;constitutive=1;ensembl_phase=-1;rank=1
> IWGSC_CSS_3AS_scaff_369935 . five_prime_UTR 199 200 .
> - . Parent=Traes_3AS_775C097A2.1;
>
> Would this be a fast fix on your side to regenerate the data to be
> valid? If not, I'll write my own script next week to fix the errors
> in the GFF3 file.
>
> Cheers,
> -Hans
>
>
>
> _______________________________________________
> Dev mailing listDev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog:http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140224/aed97134/attachment.html>
More information about the Dev
mailing list