[ensembl-dev] Error on using parse_ncbi_gff3.pl

Thibaut Hourlier thibaut at ebi.ac.uk
Thu Sep 22 09:46:03 BST 2016


Hi David,
The problematic feature is NM_001001532.2, ID=rna42218. The transcript has four exons but the two last ones a separated by a three bases intron. The CDS has three parts and the last part is exon number three but including the three bases intron.
NC_010454.3 BestRefSeq  mRNA  22140218  22152033  . + . ID=rna42218;Parent=gene24394;Dbxref=GeneID:396663,Genbank:NM_001001532.2;Name=NM_001001532.2;Note=The RefSeq transcript has 7 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CCR7;product=chemokine (C-C motif) receptor 7;transcript_id=NM_001001532.2
NC_010454.3 BestRefSeq  exon  22140218  22140275  . + . ID=id356623;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NM_001001532.2;Note=The RefSeq transcript has 7 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CCR7;product=chemokine (C-C motif) receptor 7;transcript_id=NM_001001532.2
NC_010454.3 BestRefSeq  exon  22146953  22147005  . + . ID=id356624;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NM_001001532.2;Note=The RefSeq transcript has 7 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CCR7;product=chemokine (C-C motif) receptor 7;transcript_id=NM_001001532.2
NC_010454.3 BestRefSeq  exon  22150106  22151182  . + . ID=id356625;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NM_001001532.2;Note=The RefSeq transcript has 7 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CCR7;product=chemokine (C-C motif) receptor 7;transcript_id=NM_001001532.2
NC_010454.3 BestRefSeq  exon  22151185  22152033  . + . ID=id356626;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NM_001001532.2;Note=The RefSeq transcript has 7 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CCR7;product=chemokine (C-C motif) receptor 7;transcript_id=NM_001001532.2
NC_010454.3 BestRefSeq  CDS 22140266  22140275  . + 0 ID=cds30293;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NP_001001532.1;Name=NP_001001532.1;gbkey=CDS;gene=CCR7;partial=true;product=C-C chemokine receptor type 7 precursor;protein_id=NP_001001532.1
NC_010454.3 BestRefSeq  CDS 22146953  22147005  . + 2 ID=cds30293;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NP_001001532.1;Name=NP_001001532.1;gbkey=CDS;gene=CCR7;partial=true;product=C-C chemokine receptor type 7 precursor;protein_id=NP_001001532.1
NC_010454.3 BestRefSeq  CDS 22150106  22151185  . + 0 ID=cds30293;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NP_001001532.1;Name=NP_001001532.1;gbkey=CDS;gene=CCR7;partial=true;product=C-C chemokine receptor type 7 precursor;protein_id=NP_001001532.1

As I said previously the GFF does not seem correct to me so if NCBI cannot redump the file the easiest would be to remove this transcript, there is a gnomon transcript for the same gene so you shouldn't loose information.

Hope this helps
Thibaut

> On 21 Sep 2016, at 17:16, Thibaut Hourlier <thibaut at ebi.ac.uk> wrote:
> 
> Hi David,
> I tested with the latest commits for release/85, there was some postfix but I had the same results.
> The three species which loaded fine have all been dumped after the pig file. I suspect that NCBI changed their dumper and we would have updated our parser. Because we already loaded pig in our databases we didn't test the pig file.
> 
> If you have another species that you are trying to load and the file has been dumped in 2015, it would be interesting to know if it loads without problem.
> 
> I have also tried with earlier versions of the script but I have the same error.
> 
> I had a look at the file and I had a gene (gene32440) which seems to be alone but it should probably be link with the transcript (rna56687):
> NC_010459.4     BestRefSeq      mRNA    15800576        15805994        .       +       .       ID=rna56687;Dbxref=GeneID:397154,Genbank:NM_214081.2;Name=NM_214081.2;Note=The RefSeq transcript has 1 frameshift and aligns at 12%25 coverage compared to this genomic sequence;end_range=15805994,.;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CHGB;partial=true;product=chromogranin B;transcript_id=NM_214081.2
> NC_010459.4     BestRefSeq      exon    15800576        15800725        .       +       .       ID=id489083;Parent=rna56687;Dbxref=GeneID:397154,Genbank:NM_214081.2;Note=The RefSeq transcript has 1 frameshift and aligns at 12%25 coverage compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CHGB;partial=true;product=chromogranin B;transcript_id=NM_214081.2
> NC_010459.4     BestRefSeq      exon    15805408        15805454        .       +       .       ID=id489084;Parent=rna56687;Dbxref=GeneID:397154,Genbank:NM_214081.2;Note=The RefSeq transcript has 1 frameshift and aligns at 12%25 coverage compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CHGB;partial=true;product=chromogranin B;transcript_id=NM_214081.2
> NC_010459.4     BestRefSeq      exon    15805895        15805994        .       +       .       ID=id489085;Parent=rna56687;Dbxref=GeneID:397154,Genbank:NM_214081.2;Note=The RefSeq transcript has 1 frameshift and aligns at 12%25 coverage compared to this genomic sequence;end_range=15805994,.;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CHGB;partial=true;product=chromogranin B;transcript_id=NM_214081.2
> NC_010459.4     BestRefSeq      gene    15800576        15805992        .       +       .       ID=gene32440;Dbxref=GeneID:397154;Name=CHGB;description=chromogranin B (secretogranin 1);end_range=15805992,.;gbkey=Gene;gene=CHGB;gene_biotype=protein_coding;gene_synonym=CGB;partial=true
> 
> You may want to contact NCBI as their file could be wrong and they probably should dump it again.
> 
> Hope this helps.
> 
> Thanks
> Thibaut
> 
>> On 21 Sep 2016, at 15:36, Herzig, David <david.herzig at roche.com> wrote:
>> 
>> Hi
>> 
>> ensembl-pipeline: master / b1ca6be3ad6dfc4e960bce7fce5733f745102710
>> ensembl-io: release/85 / 9eacea74ccaf8480aafc82c6b0ffc626a1537b29
>> 
>> thx.
>> David
>> 
>> On Wed, Sep 21, 2016 at 3:18 PM, Thibaut Hourlier <thibaut at ebi.ac.uk> wrote:
>> Hi David,
>> Unfortunately NCBI does not always write their GFF the same way for all their species so a fix for a species could bring a bug for another species.
>> Could you please tell us which branch/last commit for your ensembl-pipeline and ensembl-io repositories?
>> 
>> Thanks
>> Thibaut
>> 
>>> On 20 Sep 2016, at 13:26, Daniel Barrell <daniel.barrell at eaglegenomics.com> wrote:
>>> 
>>> Odd, D. rerio should have also failed then if my suspicions were correct. Guess there must be something else going on here.
>>> 
>>> Dan
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Daniel Barrell
>>> Platform Specialist
>>> <E_Email_Sig.jpg>
>>> eaglediscover Best of Show Winner at Bio-IT World 2016
>>> 
>>> Eagle Genomics Ltd
>>> T: +44 (0)1223 654481
>>> http://www.eaglegenomics.com  
>>> Disclaimer: http://www.eaglegenomics.com/about/privacy-statement/ 
>>> 
>>> https://youtu.be/rPdgFTo0FZM 
>>> 
>>> On 20 September 2016 at 12:14, Herzig, David <david.herzig at roche.com> wrote:
>>> Hi Daniel
>>> 
>>> Thx for the feedback.
>>> 
>>> I was able to use it for:
>>> - d rerio
>>> - m musculus
>>> - r norvegicus
>>> 
>>> regards,
>>> David
>>> 
>>> 
>>> On Tue, Sep 20, 2016 at 1:11 PM, Daniel Barrell <daniel.barrell at eaglegenomics.com> wrote:
>>> Hi David,
>>> 
>>> Line 1184334 is the last line of the GFF3 file and contains '###'. There used to be code to ignore lines like these:
>>> 
>>> +      next if $line =~ /^#/;
>>> 
>>> When the script moved to use ensembl-io I think it may have lost this check, however I would expect ensembl-io to handle the '###'. Which species files worked? I checked on NCBI and other species (e.g. horse) would also fail in the same way.
>>> 
>>> Dan
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Daniel Barrell
>>> Platform Specialist
>>> <E_Email_Sig.jpg>
>>> eaglediscover Best of Show Winner at Bio-IT World 2016
>>> 
>>> Eagle Genomics Ltd
>>> T: +44 (0)1223 654481
>>> http://www.eaglegenomics.com  
>>> Disclaimer: http://www.eaglegenomics.com/about/privacy-statement/ 
>>> 
>>> https://youtu.be/rPdgFTo0FZM 
>>> 
>>> On 20 September 2016 at 11:16, Herzig, David <david.herzig at roche.com> wrote:
>>> Hi Ensembl Users
>>> 
>>> I have setup the ensembl environment for several species. Everything is ok.
>>> After that I imported data from NCBI by using the parse_ncbi_gff3.pl script. Works fine for almost all species. But for the specie sus scrofa I have the following issue:
>>> 
>>> I downloaded the file from NCBI:
>>> /ftp.ncbi.nlm.nih.gov/genomes/Sus_scrofa/GFF/ref_Sscrofa10.2_top_level.gff3
>>> 
>>> I used the parse_ncbi_gff3.pl script to import it.
>>> 
>>> The process starts successfully but after a while I get the following error message and the process stops:
>>> 
>>> Can't call method "phase" on an undefined value at /home/ensembl/release-85/ensembl-pipeline/scripts/refseq_import/parse_ncbi_gff3.pl line 882, <__ANONIO__> line 1184334.
>>> 
>>> Any ideas?
>>> 
>>> regards,
>>> David
>>> 
>>> 
>>> -- 
>>> David Herzig
>>> Scientist, pRED Informatics
>>> Roche Pharma Research and Early Development
>>> 
>>> Roche Innovation Center Basel
>>> 
>>> F. Hoffmann-La Roche Ltd
>>> Grenzacherstrasse 124
>>> 4070 Basel
>>> Switzerland
>>> Phone +41 61 687 31 70
>>> 
>>> Learn more about pRED Informatics at http://go.roche.com/pREDi
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> David Herzig
>>> Scientist, pRED Informatics
>>> Roche Pharma Research and Early Development
>>> 
>>> Roche Innovation Center Basel
>>> 
>>> F. Hoffmann-La Roche Ltd
>>> Grenzacherstrasse 124
>>> 4070 Basel
>>> Switzerland
>>> Phone +41 61 687 31 70
>>> 
>>> Learn more about pRED Informatics at http://go.roche.com/pREDi
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> 
>> 
>> -- 
>> David Herzig
>> Scientist, pRED Informatics
>> Roche Pharma Research and Early Development
>> 
>> Roche Innovation Center Basel
>> 
>> F. Hoffmann-La Roche Ltd
>> Grenzacherstrasse 124
>> 4070 Basel
>> Switzerland
>> Phone +41 61 687 31 70
>> 
>> Learn more about pRED Informatics at http://go.roche.com/pREDi
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list