[ensembl-dev] Triticum aestivum invalid GFF3
Hans Vasquez-Gross
havasquezgross at ucdavis.edu
Mon Feb 24 19:16:03 GMT 2014
Thank you all for the suggestions. I look forward to getting the new
release in mid-march and will be fixing the version I have here.
Cheers,
-Hans
On Mon, Feb 24, 2014 at 2:35 AM, Daniel Lawson <lawson at ebi.ac.uk> wrote:
> Arnaud,
>
> There's been threads on this in the SO-devel list. Many people use the
> same ID for discontinuous features such as CDS and my feeling is that this
> is tacitly accepted but there should never be a case where there are shared
> IDs between different feature types (i.e. transcript and CDS).
>
> See http://gmod.org/wiki/GFF3#Discontinuous_Features
> and http://www.sequenceontology.org/gff3.shtml
>
> Note in the 2nd link that the CDS are marked up as discontinuous features
> with shared IDs in the example.
>
> Are you planning to resolve just the clash of IDs between features or to
> add suffices to the CDS lines? I'm assuming that the latter will break some
> browser visualizations where features are linked based on their ID and not
> the parent ID. Of course that is not necessarily the driver of GFF3
> formatting but useful to remember.
>
> cheers
> D
>
>
>
>
> On 24 February 2014 10:10, Arnaud Kerhornou <arnaud at ebi.ac.uk> wrote:
>
>>
>> Hello Hans,
>>
>> Sorry about that, it's something we missed. This will be corrected with
>> the coming release of Ensembl Genomes, which will be out around mid-march,
>> so feel free to correct it on your side in the meantime.
>> Note that the next release of bread wheat will include an updated gene
>> set.
>>
>> Best regards,
>> Arnaud
>>
>>
>> On 22/02/2014 01:34, Hans Vasquez-Gross wrote:
>>
>> Hello,
>>
>> I recently downloaded the MIPs GFF3 annotation provided on your FTP for
>> Triticum_aestivum.
>>
>>
>> ftp://ftp.ensemblgenomes.org/pub/plants/release-21/gff3/triticum_aestivum/Triticum_aestivum.IWGSP1.21.gff3.gz
>>
>> I tried running this file for visualization in a genome browser, but it
>> does not validate. There seems to be a problem in the manner the ID= field
>> in the 9th column is setup. According to SO (
>> http://www.sequenceontology.org/gff3.shtml), the ID= in the 9th column
>> MUST be unique. But currently, all transcript/CDS/exon relationships have
>> the same ID collision issue which I'll explain below with the first example
>> problem.
>>
>> If you take a look at lines 133-138 in the gff3 file, you should see
>> this:
>> ##sequence-region IWGSC_CSS_3AS_scaff_369935 1 200
>> IWGSC_CSS_3AS_scaff_369935 ensembl protein_coding_gene 1
>> 200 . - .
>> ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
>> IWGSC_CSS_3AS_scaff_369935 ensembl transcript 1 200 .
>> - .
>> ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
>> IWGSC_CSS_3AS_scaff_369935 . CDS 1 198 . -
>> 0 ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2.1;rank=1
>> IWGSC_CSS_3AS_scaff_369935 . exon 1 200 . -
>> .
>> ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.1;constitutive=1;ensembl_phase=-1;rank=1
>> IWGSC_CSS_3AS_scaff_369935 . five_prime_UTR 199 200 .
>> - . Parent=Traes_3AS_775C097A2.1;
>>
>> The transcript and CDS definition have the exact same ID defined
>> "Traes_3AS_775C097A2.1" which is causing the naming collision. You will
>> also notice in the CDS definition line, the ID= and Parent= are exactly the
>> same. The parent in this case is trying to refer to the transcript ID, but
>> the CDS has the same ID.
>>
>> ProposedSolution:
>> Any CDS ID could have a "C" appended after the period. For example,
>> Traes_3AS_775C097A2.1 would become Traes_3AS_775C097A2.C1. This is similar
>> to what you are doing for the Exons. The exon line would then have to be
>> updated with this new ID for the Parent= string. Then, the new GFF3 block
>> for this transcript definition would be:
>>
>> ##sequence-region IWGSC_CSS_3AS_scaff_369935 1 200
>> IWGSC_CSS_3AS_scaff_369935 ensembl protein_coding_gene 1
>> 200 . - .
>> ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
>> IWGSC_CSS_3AS_scaff_369935 ensembl transcript 1 200 .
>> - .
>> ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
>> IWGSC_CSS_3AS_scaff_369935 . CDS 1 198 . -
>> 0 ID=Traes_3AS_775C097A2.C1;Parent=Traes_3AS_775C097A2.1;rank=1
>> IWGSC_CSS_3AS_scaff_369935 . exon 1 200 . -
>> .
>> ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.C1;constitutive=1;ensembl_phase=-1;rank=1
>> IWGSC_CSS_3AS_scaff_369935 . five_prime_UTR 199 200 .
>> - . Parent=Traes_3AS_775C097A2.1;
>>
>> Would this be a fast fix on your side to regenerate the data to be
>> valid? If not, I'll write my own script next week to fix the errors in the
>> GFF3 file.
>>
>> Cheers,
>> -Hans
>>
>>
>>
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>>
>>
>>
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
>
> --
> Ensembl Genomes | VectorBase | i5K insect genome initiative
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
--
Hans Vasquez-Gross
Programmer
TreeGenes Database - http://dendrome.ucdavis.edu/treegenes/
Dubcovsky and Neale Lab
Department of Plant Science
University of California at Davis
Email: havasquezgross at ucdavis.edu
Phone: (530) 752-0609
Skype: hansvg.ucd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140224/46910176/attachment.html>
More information about the Dev
mailing list