[ensembl-dev] Triticum aestivum invalid GFF3

Daniel Lawson lawson at ebi.ac.uk
Mon Feb 24 10:35:56 GMT 2014


Arnaud,

There's been threads on this in the SO-devel list. Many people use the same
ID for discontinuous features such as CDS and my feeling is that this is
tacitly accepted but there should never be a case where there are shared
IDs between different feature types (i.e. transcript and CDS).

See http://gmod.org/wiki/GFF3#Discontinuous_Features
and http://www.sequenceontology.org/gff3.shtml

Note in the 2nd link that the CDS are marked up as discontinuous features
with shared IDs in the example.

Are you planning to resolve just the clash of IDs between features or to
add suffices to the CDS lines? I'm assuming that the latter will break some
browser visualizations where features are linked based on their ID and not
the parent ID. Of course that is not necessarily the driver of GFF3
formatting but useful to remember.

cheers
D




On 24 February 2014 10:10, Arnaud Kerhornou <arnaud at ebi.ac.uk> wrote:

>
>  Hello Hans,
>
> Sorry about that, it's something we missed. This will be corrected with
> the coming release of Ensembl Genomes, which will be out around mid-march,
> so feel free to correct it on your side in the meantime.
> Note that the next release of bread wheat will include an updated gene set.
>
> Best regards,
> Arnaud
>
>
> On 22/02/2014 01:34, Hans Vasquez-Gross wrote:
>
>  Hello,
>
>  I recently downloaded the MIPs GFF3 annotation provided on your FTP for
> Triticum_aestivum.
>
>
> ftp://ftp.ensemblgenomes.org/pub/plants/release-21/gff3/triticum_aestivum/Triticum_aestivum.IWGSP1.21.gff3.gz
>
>  I tried running this file for visualization in a genome browser, but it
> does not validate.  There seems to be a problem in the manner the ID= field
> in the 9th column is setup.  According to SO (
> http://www.sequenceontology.org/gff3.shtml), the ID= in the 9th column
> MUST be unique.  But currently, all transcript/CDS/exon relationships have
> the same ID collision issue which I'll explain below with the first example
> problem.
>
>  If you take a look at lines 133-138 in the gff3 file, you should see
> this:
>  ##sequence-region   IWGSC_CSS_3AS_scaff_369935 1 200
> IWGSC_CSS_3AS_scaff_369935      ensembl protein_coding_gene     1
> 200     .       -       .
> ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
> IWGSC_CSS_3AS_scaff_369935      ensembl transcript      1       200     .
>       -       .
> ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
> IWGSC_CSS_3AS_scaff_369935      .       CDS     1       198     .       -
>       0       ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2.1;rank=1
> IWGSC_CSS_3AS_scaff_369935      .       exon    1       200     .       -
>       .
> ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.1;constitutive=1;ensembl_phase=-1;rank=1
> IWGSC_CSS_3AS_scaff_369935      .       five_prime_UTR  199     200     .
>       -       .       Parent=Traes_3AS_775C097A2.1;
>
>  The transcript and CDS definition have the exact same ID defined
> "Traes_3AS_775C097A2.1" which is causing the naming collision.  You will
> also notice in the CDS definition line, the ID= and Parent= are exactly the
> same.  The parent in this case is trying to refer to the transcript ID, but
> the CDS has the same ID.
>
>  ProposedSolution:
> Any CDS ID could have a "C" appended after the period.  For example,
> Traes_3AS_775C097A2.1 would become Traes_3AS_775C097A2.C1.  This is similar
> to what you are doing for the Exons. The exon line would then have to be
> updated with this new ID for the Parent= string.  Then, the new GFF3 block
> for this transcript definition would be:
>
> ##sequence-region   IWGSC_CSS_3AS_scaff_369935 1 200
> IWGSC_CSS_3AS_scaff_369935      ensembl protein_coding_gene     1
> 200     .       -       .
> ID=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
> IWGSC_CSS_3AS_scaff_369935      ensembl transcript      1       200     .
>       -       .
> ID=Traes_3AS_775C097A2.1;Parent=Traes_3AS_775C097A2;biotype=protein_coding;logic_name=mips_taestivum
> IWGSC_CSS_3AS_scaff_369935      .       CDS     1       198     .       -
>       0       ID=Traes_3AS_775C097A2.C1;Parent=Traes_3AS_775C097A2.1;rank=1
> IWGSC_CSS_3AS_scaff_369935      .       exon    1       200     .       -
>       .
> ID=Traes_3AS_775C097A2.E1;Parent=Traes_3AS_775C097A2.C1;constitutive=1;ensembl_phase=-1;rank=1
> IWGSC_CSS_3AS_scaff_369935      .       five_prime_UTR  199     200     .
>       -       .       Parent=Traes_3AS_775C097A2.1;
>
>  Would this be a fast fix on your side to regenerate the data to be
> valid?  If not, I'll write my own script next week to fix the errors in the
> GFF3 file.
>
>  Cheers,
> -Hans
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>


-- 
Ensembl Genomes | VectorBase | i5K insect genome initiative
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140224/b094cc42/attachment.html>


More information about the Dev mailing list