[ensembl-dev] NCBI formatdb failing

Anne Parker ap5 at sanger.ac.uk
Mon May 28 16:01:01 BST 2012


Hi Venkata

Our software team is looking at the script that generates these files, and we hope to have a working file uploaded soon.

Regards

Anne



On 23 May 2012, at 10:50, Venkata Satagopam wrote:

> Hi
> 
> Just downloaded Ensembl v67 peptide fasta files, run the NCBI formatdb, all expect four species failed... see the formatdb logs below ...
> 
> ###################
> ========================[ May 22, 2012  8:51 PM ]========================
> Version 2.2.24 [Aug-08-2010]
> Started database file "Homo_sapiens.GRCh37.67.pep.all.fa"
> WARNING: [000.000] Cannot add sequence number 4365 (lcl|4365_Homo_sapiens.GRCh37.67.pep.all.f) because it has zero-length.
> 
> Removed single-volume database Homo_sapiens.GRCh37.67.pep.all.fa
> FATAL ERROR: [001.000] Fatal error when adding sequence to BLAST database.
> 
> ========================[ May 22, 2012  8:52 PM ]========================
> Version 2.2.24 [Aug-08-2010]
> Started database file "Mus_musculus.NCBIM37.67.pep.all.fa"
> WARNING: [000.000] Cannot add sequence number 19278 (lcl|19278_Mus_musculus.NCBIM37.67.pep.all.) because it has zero-length.
> 
> Removed single-volume database Mus_musculus.NCBIM37.67.pep.all.fa
> FATAL ERROR: [001.000] Fatal error when adding sequence to BLAST database.
> 
> ========================[ May 22, 2012  8:51 PM ]========================
> Version 2.2.24 [Aug-08-2010]
> Started database file "Callithrix_jacchus.C_jacchus3.2.1.67.pep.all.fa"
> WARNING: [000.000] Cannot add sequence number 14764 (lcl|14764_Callithrix_jacchus.C_jacchus3.2.) because it has zero-length.
> 
> Removed single-volume database Callithrix_jacchus.C_jacchus3.2.1.67.pep.all.fa
> FATAL ERROR: [001.000] Fatal error when adding sequence to BLAST database.
> 
> ========================[ May 22, 2012  8:51 PM ]========================
> Version 2.2.24 [Aug-08-2010]
> Started database file "Danio_rerio.Zv9.67.pep.all.fa"
> WARNING: [000.000] Cannot add sequence number 18274 (lcl|18274_Danio_rerio.Zv9.67.pep.all.fa) because it has zero-length.
> 
> Removed single-volume database Danio_rerio.Zv9.67.pep.all.fa
> FATAL ERROR: [001.000] Fatal error when adding sequence to BLAST database.
> 
> ###################
> 
> Then looked into bit deeper, why only these four fasta files failing, then found that for some entries sequences were missing,  corresponding list of Ensembl protein ids given below
> 
> 1. Danio_rerio.Zv9.67.pep.all.fa
> ENSDARP00000124078
> 
> 2. Callithrix_jacchus.C_jacchus3.2.1.67.pep.all.fa
> ENSCJAP00000015257
> 
> 3. Mus_musculus.NCBIM37.67.pep.all.fa
> ENSMUSP00000118372
> ENSMUSP00000120375
> ENSMUSP00000124076
> ENSMUSP00000134515
> ENSMUSP00000133928
> 
> 4. Homo_sapiens.GRCh37.67.pep.all.fa
> ENSP00000433535
> ENSP00000454527
> ENSP00000426696
> ENSP00000427330
> ENSP00000398318
> ENSP00000427025
> ENSP00000414758
> ENSP00000405652
> ENSP00000432174
> ENSP00000453420
> ENSP00000428295
> ENSP00000436303
> ENSP00000432344
> 
> After removing these entries from the fasta files, the formatdb is through ...
> 
> bash-3.2$ more formatdb.log 
> 
> ========================[ May 23, 2012 11:08 AM ]========================
> Version 2.2.24 [Aug-08-2010]
> Started database file "Danio_rerio.Zv9.67.pep.all.fa"
> Formatted 42170 sequences in volume 0
> SUCCESS: formatted database Danio_rerio.Zv9.67.pep.all.fa
> 
> ========================[ May 23, 2012 11:18 AM ]========================
> Version 2.2.24 [Aug-08-2010]
> Started database file "Homo_sapiens.GRCh37.67.pep.all.fa"
> Formatted 100341 sequences in volume 0
> SUCCESS: formatted database Homo_sapiens.GRCh37.67.pep.all.fa
> 
> ========================[ May 23, 2012 11:23 AM ]========================
> Version 2.2.24 [Aug-08-2010]
> Started database file "Mus_musculus.NCBIM37.67.pep.all.fa"
> Formatted 56785 sequences in volume 0
> SUCCESS: formatted database Mus_musculus.NCBIM37.67.pep.all.fa
> 
> ========================[ May 23, 2012 11:27 AM ]========================
> Version 2.2.24 [Aug-08-2010]
> Started database file "Callithrix_jacchus.C_jacchus3.2.1.67.pep.all.fa"
> Formatted 43791 sequences in volume 0
> SUCCESS: formatted database Callithrix_jacchus.C_jacchus3.2.1.67.pep.all.fa
> bash-3.2$
> 
> 
> When I looked into the Ensembl web interface for more details about these proteins with missing sequences, for example ENSP00000433535
> 
> http://www.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?db=core;g=ENSG00000116337;r=1:110158726-110174673;t=ENST00000474459
> 
> or another example ENSDARP00000124078
> 
> http://www.ensembl.org/Danio_rerio/Transcript/ProteinSummary?db=core;g=ENSDARG00000041217;r=3:14689988-14736032;t=ENSDART00000131720
> 
> Both these web pages saying amino acid (aa) length is 1 for these two manually checked entries. I haven't checked other proteins with missing sequences. 
> 
> I guess other people might also having problem to run the blast with these sequences, it may be worth fixing fasta files on ftp site.
> 
> Best Regards
> Venkata
> 
> 
> Venkata P. Satagopam
> Schneider Group
> Structural and Computational Biology
> EMBL
> Meyerhofstr. 1
> 69117 Heidelberg
> 
> phone: +49-(0)-6221-387-140
> fax:  +49-(0)-6221-387-517
> venkata.satagopam at embl.de
> http://www.embl-heidelberg.de/~satagopa/ 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

Anne Parker
Ensembl Web Production Manager
http://www.ensembl.org







More information about the Dev mailing list