[ensembl-dev] Problem with formatdb and several pep FASTA files

Andy Yates ayates at ebi.ac.uk
Thu Jun 14 11:44:05 BST 2012


Hi Toni,

I think I can see the pattern. The species I previously quoted where those whose pep.all file were affected. However there are a lot more species whose abinitio protein models have a length of 1. 

It looks like replacing these dumps is the right way to go however can I suggest an alternative for you. You could switch to using makeblastdb from the ncbi+ blast package. When working on the same version blast's formatdb and blast+'s makeblastdb create compatible indexes. The advantage is makeblastdb skips empty sequences e.g.

my-machine andy$ ncbi-blast-2.2.25+/bin/makeblastdb -in Danio_rerio.Zv9.67.pep.all.fa  -dbtype prot -out Danio_rerio.Zv9.67.pep.all.fa

Building a new DB, current time: 06/14/2012 09:56:19
New DB name:   Danio_rerio.Zv9.67.pep.all.fa
New DB title:  Danio_rerio.Zv9.67.pep.all.fa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Ignoring sequence 'lcl|18274' as it has no sequence data
Adding sequences from FASTA; added 42170 sequences in 2.44899 seconds.

blastall & blastp both produce the same results when given the makeblastdb and formatdb (with ENSDARP00000124078 taken out).

Andy

On 13 Jun 2012, at 17:37, Toni Hermoso Pulido wrote:

> Hi Andy,
> 
> if I rely on my pipeline, I would dare to say that there are more
> files affected (I don't know how many empty seqs per file, though)
> 
> One excerpt of error messages when FASTA files cannot be formatted by
> NCBI Blast below:
> 
> mv: cannot stat
> `/db/ensembl/release-67/callithrix_jacchus/proteome/Callithrix_jacchus.C_jacchus3.2.1.67.pep.all.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/choloepus_hoffmanni/proteome/Choloepus_hoffmanni.choHof1.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/danio_rerio/proteome/Danio_rerio.Zv9.67.pep.all.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/echinops_telfairi/proteome/Echinops_telfairi.TENREC.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/erinaceus_europaeus/proteome/Erinaceus_europaeus.HEDGEHOG.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/felis_catus/proteome/Felis_catus.CAT.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/gadus_morhua/proteome/Gadus_morhua.gadMor1.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/homo_sapiens/proteome/Homo_sapiens.GRCh37.67.pep.all.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/macropus_eugenii/proteome/Macropus_eugenii.Meug_1.0.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/mus_musculus/proteome/Mus_musculus.NCBIM37.67.pep.all.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/ochotona_princeps/proteome/Ochotona_princeps.pika.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/ornithorhynchus_anatinus/proteome/Ornithorhynchus_anatinus.OANA5.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/oryzias_latipes/proteome/Oryzias_latipes.MEDAKA1.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/procavia_capensis/proteome/Procavia_capensis.proCap1.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/sorex_araneus/proteome/Sorex_araneus.COMMON_SHREW1.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/tarsius_syrichta/proteome/Tarsius_syrichta.tarSyr1.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/tetraodon_nigroviridis/proteome/Tetraodon_nigroviridis.TETRAODON8.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/vicugna_pacos/proteome/Vicugna_pacos.vicPac1.67.pep.abinitio.fa.*':
> No such file or directory
> mv: cannot stat
> `/db/ensembl/release-67/xenopus_tropicalis/proteome/Xenopus_tropicalis.JGI_4.2.67.pep.abinitio.fa.*':
> No such file or directory
> 
> So I understand you plan to replace the files in the FTP site, don't you?
> 
> Thanks for all,
> 
> 2012/6/13 Andy Yates <ayates at ebi.ac.uk>:
>> Hi Toni,
>> 
>> We are currently aware of this issue. These 0 length sequences have appeared due to a bug in our FASTA serialiser being unable to handle sequences of length 1. This was not picked up during our dumping process as we do not generate NCBI blast indexes. The files are now being regenerated. The current list of known affected species and their protein counts are:
>> 
>> callithrix_jacchus      1
>> danio_rerio     1
>> homo_sapiens    13
>> mus_musculus    5
>> 
>> Does this correspond to your own list?
>> 
>> All the best,
>> 
>> Andy
>> 
>> Andrew Yates                   Ensembl Core Software Project Leader
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensembl.org/
>> 
>> On 13 Jun 2012, at 15:25, Toni Hermoso Pulido wrote:
>> 
>>> Hello,
>>> 
>>> there seems to be a problem with a few FASTA pep files of some
>>> organisms when performing a formatdb (2.2.25 and 2.2.26 tested):
>>> 
>>> $ blast/blast-2.2.26/bin/formatdb -i Mus_musculus.NCBIM37.67.pep.all.fa
>>> [formatdb] WARNING: Cannot add sequence number 19278
>>> (lcl|19278_Mus_musculus.NCBIM37.67.pep.all.) because it has
>>> zero-length.
>>> 
>>> [formatdb] FATAL ERROR: Fatal error when adding sequence to BLAST database.
>>> 
>>> This happens with empty FASTA, in this case:
>>>> ENSMUSP00000118372 pep:known chromosome:NCBIM37:4:117507600:117515714:1 gene:ENSMUSG00000028542 transcript:ENSMUST00000151316 gene_biotype:protein_coding transcript_biotype:protein_coding
>>> 
>>> I haven't experienced a similar issue in the past.
>>> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list