[ensembl-dev] Error on using parse_ncbi_gff3.pl

Thibaut Hourlier thibaut at ebi.ac.uk
Wed Sep 21 17:16:26 BST 2016


Hi David,
I tested with the latest commits for release/85, there was some postfix but I had the same results.
The three species which loaded fine have all been dumped after the pig file. I suspect that NCBI changed their dumper and we would have updated our parser. Because we already loaded pig in our databases we didn't test the pig file.

If you have another species that you are trying to load and the file has been dumped in 2015, it would be interesting to know if it loads without problem.

I have also tried with earlier versions of the script but I have the same error.

I had a look at the file and I had a gene (gene32440) which seems to be alone but it should probably be link with the transcript (rna56687):
NC_010459.4     BestRefSeq      mRNA    15800576        15805994        .       +       .       ID=rna56687;Dbxref=GeneID:397154,Genbank:NM_214081.2;Name=NM_214081.2;Note=The RefSeq transcript has 1 frameshift and aligns at 12%25 coverage compared to this genomic sequence;end_range=15805994,.;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CHGB;partial=true;product=chromogranin B;transcript_id=NM_214081.2
NC_010459.4     BestRefSeq      exon    15800576        15800725        .       +       .       ID=id489083;Parent=rna56687;Dbxref=GeneID:397154,Genbank:NM_214081.2;Note=The RefSeq transcript has 1 frameshift and aligns at 12%25 coverage compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CHGB;partial=true;product=chromogranin B;transcript_id=NM_214081.2
NC_010459.4     BestRefSeq      exon    15805408        15805454        .       +       .       ID=id489084;Parent=rna56687;Dbxref=GeneID:397154,Genbank:NM_214081.2;Note=The RefSeq transcript has 1 frameshift and aligns at 12%25 coverage compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CHGB;partial=true;product=chromogranin B;transcript_id=NM_214081.2
NC_010459.4     BestRefSeq      exon    15805895        15805994        .       +       .       ID=id489085;Parent=rna56687;Dbxref=GeneID:397154,Genbank:NM_214081.2;Note=The RefSeq transcript has 1 frameshift and aligns at 12%25 coverage compared to this genomic sequence;end_range=15805994,.;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CHGB;partial=true;product=chromogranin B;transcript_id=NM_214081.2
NC_010459.4     BestRefSeq      gene    15800576        15805992        .       +       .       ID=gene32440;Dbxref=GeneID:397154;Name=CHGB;description=chromogranin B (secretogranin 1);end_range=15805992,.;gbkey=Gene;gene=CHGB;gene_biotype=protein_coding;gene_synonym=CGB;partial=true

You may want to contact NCBI as their file could be wrong and they probably should dump it again.

Hope this helps.

Thanks
Thibaut

> On 21 Sep 2016, at 15:36, Herzig, David <david.herzig at roche.com> wrote:
> 
> Hi
> 
> ensembl-pipeline: master / b1ca6be3ad6dfc4e960bce7fce5733f745102710
> ensembl-io: release/85 / 9eacea74ccaf8480aafc82c6b0ffc626a1537b29
> 
> thx.
> David
> 
> On Wed, Sep 21, 2016 at 3:18 PM, Thibaut Hourlier <thibaut at ebi.ac.uk> wrote:
> Hi David,
> Unfortunately NCBI does not always write their GFF the same way for all their species so a fix for a species could bring a bug for another species.
> Could you please tell us which branch/last commit for your ensembl-pipeline and ensembl-io repositories?
> 
> Thanks
> Thibaut
> 
>> On 20 Sep 2016, at 13:26, Daniel Barrell <daniel.barrell at eaglegenomics.com> wrote:
>> 
>> Odd, D. rerio should have also failed then if my suspicions were correct. Guess there must be something else going on here.
>> 
>> Dan
>> 
>> 
>> 
>> 
>> 
>> Daniel Barrell
>> Platform Specialist
>> <E_Email_Sig.jpg>
>> eaglediscover Best of Show Winner at Bio-IT World 2016
>> 
>> Eagle Genomics Ltd
>> T: +44 (0)1223 654481
>> http://www.eaglegenomics.com  
>> Disclaimer: http://www.eaglegenomics.com/about/privacy-statement/ 
>> 
>> https://youtu.be/rPdgFTo0FZM 
>> 
>> On 20 September 2016 at 12:14, Herzig, David <david.herzig at roche.com> wrote:
>> Hi Daniel
>> 
>> Thx for the feedback.
>> 
>> I was able to use it for:
>> - d rerio
>> - m musculus
>> - r norvegicus
>> 
>> regards,
>> David
>> 
>> 
>> On Tue, Sep 20, 2016 at 1:11 PM, Daniel Barrell <daniel.barrell at eaglegenomics.com> wrote:
>> Hi David,
>> 
>> Line 1184334 is the last line of the GFF3 file and contains '###'. There used to be code to ignore lines like these:
>> 
>> +      next if $line =~ /^#/;
>> 
>> When the script moved to use ensembl-io I think it may have lost this check, however I would expect ensembl-io to handle the '###'. Which species files worked? I checked on NCBI and other species (e.g. horse) would also fail in the same way.
>> 
>> Dan
>> 
>> 
>> 
>> 
>> 
>> 
>> Daniel Barrell
>> Platform Specialist
>> <E_Email_Sig.jpg>
>> eaglediscover Best of Show Winner at Bio-IT World 2016
>> 
>> Eagle Genomics Ltd
>> T: +44 (0)1223 654481
>> http://www.eaglegenomics.com  
>> Disclaimer: http://www.eaglegenomics.com/about/privacy-statement/ 
>> 
>> https://youtu.be/rPdgFTo0FZM 
>> 
>> On 20 September 2016 at 11:16, Herzig, David <david.herzig at roche.com> wrote:
>> Hi Ensembl Users
>> 
>> I have setup the ensembl environment for several species. Everything is ok.
>> After that I imported data from NCBI by using the parse_ncbi_gff3.pl script. Works fine for almost all species. But for the specie sus scrofa I have the following issue:
>> 
>> I downloaded the file from NCBI:
>> /ftp.ncbi.nlm.nih.gov/genomes/Sus_scrofa/GFF/ref_Sscrofa10.2_top_level.gff3
>> 
>> I used the parse_ncbi_gff3.pl script to import it.
>> 
>> The process starts successfully but after a while I get the following error message and the process stops:
>> 
>> Can't call method "phase" on an undefined value at /home/ensembl/release-85/ensembl-pipeline/scripts/refseq_import/parse_ncbi_gff3.pl line 882, <__ANONIO__> line 1184334.
>> 
>> Any ideas?
>> 
>> regards,
>> David
>> 
>> 
>> -- 
>> David Herzig
>> Scientist, pRED Informatics
>> Roche Pharma Research and Early Development
>> 
>> Roche Innovation Center Basel
>> 
>> F. Hoffmann-La Roche Ltd
>> Grenzacherstrasse 124
>> 4070 Basel
>> Switzerland
>> Phone +41 61 687 31 70
>> 
>> Learn more about pRED Informatics at http://go.roche.com/pREDi
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> 
>> 
>> -- 
>> David Herzig
>> Scientist, pRED Informatics
>> Roche Pharma Research and Early Development
>> 
>> Roche Innovation Center Basel
>> 
>> F. Hoffmann-La Roche Ltd
>> Grenzacherstrasse 124
>> 4070 Basel
>> Switzerland
>> Phone +41 61 687 31 70
>> 
>> Learn more about pRED Informatics at http://go.roche.com/pREDi
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
> 
> 
> 
> 
> -- 
> David Herzig
> Scientist, pRED Informatics
> Roche Pharma Research and Early Development
> 
> Roche Innovation Center Basel
> 
> F. Hoffmann-La Roche Ltd
> Grenzacherstrasse 124
> 4070 Basel
> Switzerland
> Phone +41 61 687 31 70
> 
> Learn more about pRED Informatics at http://go.roche.com/pREDi
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list