[ensembl-dev] Error on using parse_ncbi_gff3.pl
Thibaut Hourlier
thibaut at ebi.ac.uk
Thu Sep 22 09:46:03 BST 2016
Hi David,
The problematic feature is NM_001001532.2, ID=rna42218. The transcript has four exons but the two last ones a separated by a three bases intron. The CDS has three parts and the last part is exon number three but including the three bases intron.
NC_010454.3 BestRefSeq mRNA 22140218 22152033 . + . ID=rna42218;Parent=gene24394;Dbxref=GeneID:396663,Genbank:NM_001001532.2;Name=NM_001001532.2;Note=The RefSeq transcript has 7 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CCR7;product=chemokine (C-C motif) receptor 7;transcript_id=NM_001001532.2
NC_010454.3 BestRefSeq exon 22140218 22140275 . + . ID=id356623;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NM_001001532.2;Note=The RefSeq transcript has 7 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CCR7;product=chemokine (C-C motif) receptor 7;transcript_id=NM_001001532.2
NC_010454.3 BestRefSeq exon 22146953 22147005 . + . ID=id356624;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NM_001001532.2;Note=The RefSeq transcript has 7 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CCR7;product=chemokine (C-C motif) receptor 7;transcript_id=NM_001001532.2
NC_010454.3 BestRefSeq exon 22150106 22151182 . + . ID=id356625;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NM_001001532.2;Note=The RefSeq transcript has 7 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CCR7;product=chemokine (C-C motif) receptor 7;transcript_id=NM_001001532.2
NC_010454.3 BestRefSeq exon 22151185 22152033 . + . ID=id356626;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NM_001001532.2;Note=The RefSeq transcript has 7 substitutions%2C 2 non-frameshifting indels compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CCR7;product=chemokine (C-C motif) receptor 7;transcript_id=NM_001001532.2
NC_010454.3 BestRefSeq CDS 22140266 22140275 . + 0 ID=cds30293;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NP_001001532.1;Name=NP_001001532.1;gbkey=CDS;gene=CCR7;partial=true;product=C-C chemokine receptor type 7 precursor;protein_id=NP_001001532.1
NC_010454.3 BestRefSeq CDS 22146953 22147005 . + 2 ID=cds30293;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NP_001001532.1;Name=NP_001001532.1;gbkey=CDS;gene=CCR7;partial=true;product=C-C chemokine receptor type 7 precursor;protein_id=NP_001001532.1
NC_010454.3 BestRefSeq CDS 22150106 22151185 . + 0 ID=cds30293;Parent=rna42218;Dbxref=GeneID:396663,Genbank:NP_001001532.1;Name=NP_001001532.1;gbkey=CDS;gene=CCR7;partial=true;product=C-C chemokine receptor type 7 precursor;protein_id=NP_001001532.1
As I said previously the GFF does not seem correct to me so if NCBI cannot redump the file the easiest would be to remove this transcript, there is a gnomon transcript for the same gene so you shouldn't loose information.
Hope this helps
Thibaut
> On 21 Sep 2016, at 17:16, Thibaut Hourlier <thibaut at ebi.ac.uk> wrote:
>
> Hi David,
> I tested with the latest commits for release/85, there was some postfix but I had the same results.
> The three species which loaded fine have all been dumped after the pig file. I suspect that NCBI changed their dumper and we would have updated our parser. Because we already loaded pig in our databases we didn't test the pig file.
>
> If you have another species that you are trying to load and the file has been dumped in 2015, it would be interesting to know if it loads without problem.
>
> I have also tried with earlier versions of the script but I have the same error.
>
> I had a look at the file and I had a gene (gene32440) which seems to be alone but it should probably be link with the transcript (rna56687):
> NC_010459.4 BestRefSeq mRNA 15800576 15805994 . + . ID=rna56687;Dbxref=GeneID:397154,Genbank:NM_214081.2;Name=NM_214081.2;Note=The RefSeq transcript has 1 frameshift and aligns at 12%25 coverage compared to this genomic sequence;end_range=15805994,.;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CHGB;partial=true;product=chromogranin B;transcript_id=NM_214081.2
> NC_010459.4 BestRefSeq exon 15800576 15800725 . + . ID=id489083;Parent=rna56687;Dbxref=GeneID:397154,Genbank:NM_214081.2;Note=The RefSeq transcript has 1 frameshift and aligns at 12%25 coverage compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CHGB;partial=true;product=chromogranin B;transcript_id=NM_214081.2
> NC_010459.4 BestRefSeq exon 15805408 15805454 . + . ID=id489084;Parent=rna56687;Dbxref=GeneID:397154,Genbank:NM_214081.2;Note=The RefSeq transcript has 1 frameshift and aligns at 12%25 coverage compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CHGB;partial=true;product=chromogranin B;transcript_id=NM_214081.2
> NC_010459.4 BestRefSeq exon 15805895 15805994 . + . ID=id489085;Parent=rna56687;Dbxref=GeneID:397154,Genbank:NM_214081.2;Note=The RefSeq transcript has 1 frameshift and aligns at 12%25 coverage compared to this genomic sequence;end_range=15805994,.;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=CHGB;partial=true;product=chromogranin B;transcript_id=NM_214081.2
> NC_010459.4 BestRefSeq gene 15800576 15805992 . + . ID=gene32440;Dbxref=GeneID:397154;Name=CHGB;description=chromogranin B (secretogranin 1);end_range=15805992,.;gbkey=Gene;gene=CHGB;gene_biotype=protein_coding;gene_synonym=CGB;partial=true
>
> You may want to contact NCBI as their file could be wrong and they probably should dump it again.
>
> Hope this helps.
>
> Thanks
> Thibaut
>
>> On 21 Sep 2016, at 15:36, Herzig, David <david.herzig at roche.com> wrote:
>>
>> Hi
>>
>> ensembl-pipeline: master / b1ca6be3ad6dfc4e960bce7fce5733f745102710
>> ensembl-io: release/85 / 9eacea74ccaf8480aafc82c6b0ffc626a1537b29
>>
>> thx.
>> David
>>
>> On Wed, Sep 21, 2016 at 3:18 PM, Thibaut Hourlier <thibaut at ebi.ac.uk> wrote:
>> Hi David,
>> Unfortunately NCBI does not always write their GFF the same way for all their species so a fix for a species could bring a bug for another species.
>> Could you please tell us which branch/last commit for your ensembl-pipeline and ensembl-io repositories?
>>
>> Thanks
>> Thibaut
>>
>>> On 20 Sep 2016, at 13:26, Daniel Barrell <daniel.barrell at eaglegenomics.com> wrote:
>>>
>>> Odd, D. rerio should have also failed then if my suspicions were correct. Guess there must be something else going on here.
>>>
>>> Dan
>>>
>>>
>>>
>>>
>>>
>>> Daniel Barrell
>>> Platform Specialist
>>> <E_Email_Sig.jpg>
>>> eaglediscover Best of Show Winner at Bio-IT World 2016
>>>
>>> Eagle Genomics Ltd
>>> T: +44 (0)1223 654481
>>> http://www.eaglegenomics.com
>>> Disclaimer: http://www.eaglegenomics.com/about/privacy-statement/
>>>
>>> https://youtu.be/rPdgFTo0FZM
>>>
>>> On 20 September 2016 at 12:14, Herzig, David <david.herzig at roche.com> wrote:
>>> Hi Daniel
>>>
>>> Thx for the feedback.
>>>
>>> I was able to use it for:
>>> - d rerio
>>> - m musculus
>>> - r norvegicus
>>>
>>> regards,
>>> David
>>>
>>>
>>> On Tue, Sep 20, 2016 at 1:11 PM, Daniel Barrell <daniel.barrell at eaglegenomics.com> wrote:
>>> Hi David,
>>>
>>> Line 1184334 is the last line of the GFF3 file and contains '###'. There used to be code to ignore lines like these:
>>>
>>> + next if $line =~ /^#/;
>>>
>>> When the script moved to use ensembl-io I think it may have lost this check, however I would expect ensembl-io to handle the '###'. Which species files worked? I checked on NCBI and other species (e.g. horse) would also fail in the same way.
>>>
>>> Dan
>>>
>>>
>>>
>>>
>>>
>>>
>>> Daniel Barrell
>>> Platform Specialist
>>> <E_Email_Sig.jpg>
>>> eaglediscover Best of Show Winner at Bio-IT World 2016
>>>
>>> Eagle Genomics Ltd
>>> T: +44 (0)1223 654481
>>> http://www.eaglegenomics.com
>>> Disclaimer: http://www.eaglegenomics.com/about/privacy-statement/
>>>
>>> https://youtu.be/rPdgFTo0FZM
>>>
>>> On 20 September 2016 at 11:16, Herzig, David <david.herzig at roche.com> wrote:
>>> Hi Ensembl Users
>>>
>>> I have setup the ensembl environment for several species. Everything is ok.
>>> After that I imported data from NCBI by using the parse_ncbi_gff3.pl script. Works fine for almost all species. But for the specie sus scrofa I have the following issue:
>>>
>>> I downloaded the file from NCBI:
>>> /ftp.ncbi.nlm.nih.gov/genomes/Sus_scrofa/GFF/ref_Sscrofa10.2_top_level.gff3
>>>
>>> I used the parse_ncbi_gff3.pl script to import it.
>>>
>>> The process starts successfully but after a while I get the following error message and the process stops:
>>>
>>> Can't call method "phase" on an undefined value at /home/ensembl/release-85/ensembl-pipeline/scripts/refseq_import/parse_ncbi_gff3.pl line 882, <__ANONIO__> line 1184334.
>>>
>>> Any ideas?
>>>
>>> regards,
>>> David
>>>
>>>
>>> --
>>> David Herzig
>>> Scientist, pRED Informatics
>>> Roche Pharma Research and Early Development
>>>
>>> Roche Innovation Center Basel
>>>
>>> F. Hoffmann-La Roche Ltd
>>> Grenzacherstrasse 124
>>> 4070 Basel
>>> Switzerland
>>> Phone +41 61 687 31 70
>>>
>>> Learn more about pRED Informatics at http://go.roche.com/pREDi
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>>
>>>
>>> --
>>> David Herzig
>>> Scientist, pRED Informatics
>>> Roche Pharma Research and Early Development
>>>
>>> Roche Innovation Center Basel
>>>
>>> F. Hoffmann-La Roche Ltd
>>> Grenzacherstrasse 124
>>> 4070 Basel
>>> Switzerland
>>> Phone +41 61 687 31 70
>>>
>>> Learn more about pRED Informatics at http://go.roche.com/pREDi
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>>
>>
>> --
>> David Herzig
>> Scientist, pRED Informatics
>> Roche Pharma Research and Early Development
>>
>> Roche Innovation Center Basel
>>
>> F. Hoffmann-La Roche Ltd
>> Grenzacherstrasse 124
>> 4070 Basel
>> Switzerland
>> Phone +41 61 687 31 70
>>
>> Learn more about pRED Informatics at http://go.roche.com/pREDi
>>
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
More information about the Dev
mailing list