[ensembl-dev] NCBI formatdb failing

Venkata Satagopam satagopa at embl.de
Wed May 23 10:50:42 BST 2012


Hi

Just downloaded Ensembl v67 peptide fasta files, run the NCBI formatdb, all expect four species failed... see the formatdb logs below ...

###################
========================[ May 22, 2012  8:51 PM ]========================
Version 2.2.24 [Aug-08-2010]
Started database file "Homo_sapiens.GRCh37.67.pep.all.fa"
WARNING: [000.000] Cannot add sequence number 4365 (lcl|4365_Homo_sapiens.GRCh37.67.pep.all.f) because it has zero-length.

Removed single-volume database Homo_sapiens.GRCh37.67.pep.all.fa
FATAL ERROR: [001.000] Fatal error when adding sequence to BLAST database.

========================[ May 22, 2012  8:52 PM ]========================
Version 2.2.24 [Aug-08-2010]
Started database file "Mus_musculus.NCBIM37.67.pep.all.fa"
WARNING: [000.000] Cannot add sequence number 19278 (lcl|19278_Mus_musculus.NCBIM37.67.pep.all.) because it has zero-length.

Removed single-volume database Mus_musculus.NCBIM37.67.pep.all.fa
FATAL ERROR: [001.000] Fatal error when adding sequence to BLAST database.

========================[ May 22, 2012  8:51 PM ]========================
Version 2.2.24 [Aug-08-2010]
Started database file "Callithrix_jacchus.C_jacchus3.2.1.67.pep.all.fa"
WARNING: [000.000] Cannot add sequence number 14764 (lcl|14764_Callithrix_jacchus.C_jacchus3.2.) because it has zero-length.

Removed single-volume database Callithrix_jacchus.C_jacchus3.2.1.67.pep.all.fa
FATAL ERROR: [001.000] Fatal error when adding sequence to BLAST database.

========================[ May 22, 2012  8:51 PM ]========================
Version 2.2.24 [Aug-08-2010]
Started database file "Danio_rerio.Zv9.67.pep.all.fa"
WARNING: [000.000] Cannot add sequence number 18274 (lcl|18274_Danio_rerio.Zv9.67.pep.all.fa) because it has zero-length.

Removed single-volume database Danio_rerio.Zv9.67.pep.all.fa
FATAL ERROR: [001.000] Fatal error when adding sequence to BLAST database.

###################

Then looked into bit deeper, why only these four fasta files failing, then found that for some entries sequences were missing,  corresponding list of Ensembl protein ids given below

1. Danio_rerio.Zv9.67.pep.all.fa
ENSDARP00000124078

2. Callithrix_jacchus.C_jacchus3.2.1.67.pep.all.fa
ENSCJAP00000015257

3. Mus_musculus.NCBIM37.67.pep.all.fa
ENSMUSP00000118372
ENSMUSP00000120375
ENSMUSP00000124076
ENSMUSP00000134515
ENSMUSP00000133928

4. Homo_sapiens.GRCh37.67.pep.all.fa
ENSP00000433535
ENSP00000454527
ENSP00000426696
ENSP00000427330
ENSP00000398318
ENSP00000427025
ENSP00000414758
ENSP00000405652
ENSP00000432174
ENSP00000453420
ENSP00000428295
ENSP00000436303
ENSP00000432344

After removing these entries from the fasta files, the formatdb is through ...

bash-3.2$ more formatdb.log 

========================[ May 23, 2012 11:08 AM ]========================
Version 2.2.24 [Aug-08-2010]
Started database file "Danio_rerio.Zv9.67.pep.all.fa"
Formatted 42170 sequences in volume 0
SUCCESS: formatted database Danio_rerio.Zv9.67.pep.all.fa

========================[ May 23, 2012 11:18 AM ]========================
Version 2.2.24 [Aug-08-2010]
Started database file "Homo_sapiens.GRCh37.67.pep.all.fa"
Formatted 100341 sequences in volume 0
SUCCESS: formatted database Homo_sapiens.GRCh37.67.pep.all.fa

========================[ May 23, 2012 11:23 AM ]========================
Version 2.2.24 [Aug-08-2010]
Started database file "Mus_musculus.NCBIM37.67.pep.all.fa"
Formatted 56785 sequences in volume 0
SUCCESS: formatted database Mus_musculus.NCBIM37.67.pep.all.fa

========================[ May 23, 2012 11:27 AM ]========================
Version 2.2.24 [Aug-08-2010]
Started database file "Callithrix_jacchus.C_jacchus3.2.1.67.pep.all.fa"
Formatted 43791 sequences in volume 0
SUCCESS: formatted database Callithrix_jacchus.C_jacchus3.2.1.67.pep.all.fa
bash-3.2$


When I looked into the Ensembl web interface for more details about these proteins with missing sequences, for example ENSP00000433535

http://www.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?db=core;g=ENSG00000116337;r=1:110158726-110174673;t=ENST00000474459

or another example ENSDARP00000124078

http://www.ensembl.org/Danio_rerio/Transcript/ProteinSummary?db=core;g=ENSDARG00000041217;r=3:14689988-14736032;t=ENSDART00000131720

Both these web pages saying amino acid (aa) length is 1 for these two manually checked entries. I haven't checked other proteins with missing sequences. 

I guess other people might also having problem to run the blast with these sequences, it may be worth fixing fasta files on ftp site.

Best Regards
Venkata


Venkata P. Satagopam
Schneider Group
Structural and Computational Biology
EMBL
Meyerhofstr. 1
69117 Heidelberg

phone: +49-(0)-6221-387-140
fax:  +49-(0)-6221-387-517
venkata.satagopam at embl.de
http://www.embl-heidelberg.de/~satagopa/ 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20120523/8579722f/attachment.html>


More information about the Dev mailing list