[ensembl-dev] Triticum aestivum invalid GFF3

Hans Vasquez-Gross havasquezgross at ucdavis.edu
Mon Feb 24 19:16:03 GMT 2014


Thank you all for the suggestions.  I look forward to getting the new
release in mid-march and will be fixing the version I have here.

Cheers,
-Hans


On Mon, Feb 24, 2014 at 2:35 AM, Daniel Lawson <lawson at ebi.ac.uk> wrote:

> Arnaud,
>
> There's been threads on this in the SO-devel list. Many people use the
> same ID for discontinuous features such as CDS and my feeling is that this
> is tacitly accepted but there should never be a case where there are shared
> IDs between different feature types (i.e. transcript and CDS).
>
> See http://gmod.org/wiki/GFF3#Discontinuous_Features
> and http://www.sequenceontology.org/gff3.shtml
>
> Note in the 2nd link that the CDS are marked up as discontinuous features
> with shared IDs in the example.
>
> Are you planning to resolve just the clash of IDs between features or to
> add suffices to the CDS lines? I'm assuming that the latter will break some
> browser visualizations where features are linked based on their ID and not
> the parent ID. Of course that is not necessarily the driver of GFF3
> formatting but useful to remember.
>
> cheers
> D
>
>
>
>
> On 24 February 2014 10:10, Arnaud Kerhornou <arnaud at ebi.ac.uk> wrote:
>
>>
>>  Hello Hans,
>>
>> Sorry about that, it's something we missed. This will be corrected with
>> the coming release of Ensembl Genomes, which will be out around mid-march,
>> so feel free to correct it on your side in the meantime.
>> Note that the next release of bread wheat will include an updated gene
>> set.
>>
>> Best regards,
>> Arnaud
>>
>>
>> On 22/02/2014 01:34, Hans Vasquez-Gross wrote:
>>
>>  Hello,
>>
>>  I recently downloaded the MIPs GFF3 annotation provided on your FTP for
>> Triticum_aestivum.
>>
>>
>> ftp://ftp.ensemblgenomes.org/pub/plants/release-21/gff3/triticum_aestivum/Triticum_aestivum.IWGSP1.21.gff3.gz
>>
>>  I tried running this file for visualization in a genome browser, but it
>> does not validate.  There seems to be a problem in the manner the ID= field
>> in the 9th column is setup.  According to SO (
>> http://www.sequenceontology.org/gff3.shtml), the ID= in the 9th column
>> MUST be unique.  But currently, all transcript/CDS/exon relationships have
>> the same ID collision issue which I'll explain below with the first example
>> problem.
>>
>>  If you take a look at lines 133-138 in the gff3 file, you should see
>> this:
>>  ##sequence-region   IWGSC_CSS_3AS_scaff_369935 1 200
>> IWGSC_CSS_3AS_scaff_369935      ensembl protein_coding_gene     1
>> 200     .       -       .
>> ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
>> IWGSC_CSS_3AS_scaff_369935      ensembl transcript      1       200     .
>>       -       .
>> ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
>> IWGSC_CSS_3AS_scaff_369935      .       CDS     1       198     .       -
>>       0       ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2.1;rank=1
>> IWGSC_CSS_3AS_scaff_369935      .       exon    1       200     .       -
>>       .
>> ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.1;constitutive=1;ensembl_phase=-1;rank=1
>> IWGSC_CSS_3AS_scaff_369935      .       five_prime_UTR  199     200     .
>>       -       .       Parent=Traes_3AS_775C097A2.1;
>>
>>  The transcript and CDS definition have the exact same ID defined
>> "Traes_3AS_775C097A2.1" which is causing the naming collision.  You will
>> also notice in the CDS definition line, the ID= and Parent= are exactly the
>> same.  The parent in this case is trying to refer to the transcript ID, but
>> the CDS has the same ID.
>>
>>  ProposedSolution:
>> Any CDS ID could have a "C" appended after the period.  For example,
>> Traes_3AS_775C097A2.1 would become Traes_3AS_775C097A2.C1.  This is similar
>> to what you are doing for the Exons. The exon line would then have to be
>> updated with this new ID for the Parent= string.  Then, the new GFF3 block
>> for this transcript definition would be:
>>
>> ##sequence-region   IWGSC_CSS_3AS_scaff_369935 1 200
>> IWGSC_CSS_3AS_scaff_369935      ensembl protein_coding_gene     1
>> 200     .       -       .
>> ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
>> IWGSC_CSS_3AS_scaff_369935      ensembl transcript      1       200     .
>>       -       .
>> ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
>> IWGSC_CSS_3AS_scaff_369935      .       CDS     1       198     .       -
>>       0       ID=Traes_3AS_775C097A2.C1;Parent=Traes_3AS_775C097A2.1;rank=1
>> IWGSC_CSS_3AS_scaff_369935      .       exon    1       200     .       -
>>       .
>> ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.C1;constitutive=1;ensembl_phase=-1;rank=1
>> IWGSC_CSS_3AS_scaff_369935      .       five_prime_UTR  199     200     .
>>       -       .       Parent=Traes_3AS_775C097A2.1;
>>
>>  Would this be a fast fix on your side to regenerate the data to be
>> valid?  If not, I'll write my own script next week to fix the errors in the
>> GFF3 file.
>>
>>  Cheers,
>> -Hans
>>
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>>
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
>
> --
> Ensembl Genomes | VectorBase | i5K insect genome initiative
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>


-- 
Hans Vasquez-Gross
Programmer
TreeGenes Database - http://dendrome.ucdavis.edu/treegenes/
Dubcovsky and Neale Lab
Department of Plant Science
University of California at Davis
Email: havasquezgross at ucdavis.edu
Phone: (530) 752-0609
Skype: hansvg.ucd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140224/46910176/attachment.html>


More information about the Dev mailing list