[ensembl-dev] Problem with formatdb and several pep FASTA files

Toni Hermoso Pulido toni.hermoso at crg.cat
Fri Jun 15 09:55:35 BST 2012


Hi Andy,

thanks for this solution!

2012/6/14 Andy Yates <ayates at ebi.ac.uk>:
> Hi Toni,
>
> I think I can see the pattern. The species I previously quoted where those whose pep.all file were affected. However there are a lot more species whose abinitio protein models have a length of 1.
>
> It looks like replacing these dumps is the right way to go however can I suggest an alternative for you. You could switch to using makeblastdb from the ncbi+ blast package. When working on the same version blast's formatdb and blast+'s makeblastdb create compatible indexes. The advantage is makeblastdb skips empty sequences e.g.
>
> my-machine andy$ ncbi-blast-2.2.25+/bin/makeblastdb -in Danio_rerio.Zv9.67.pep.all.fa  -dbtype prot -out Danio_rerio.Zv9.67.pep.all.fa
>
> Building a new DB, current time: 06/14/2012 09:56:19
> New DB name:   Danio_rerio.Zv9.67.pep.all.fa
> New DB title:  Danio_rerio.Zv9.67.pep.all.fa
> Sequence type: Protein
> Keep Linkouts: T
> Keep MBits: T
> Maximum file size: 1073741824B
> Ignoring sequence 'lcl|18274' as it has no sequence data
> Adding sequences from FASTA; added 42170 sequences in 2.44899 seconds.
>
> blastall & blastp both produce the same results when given the makeblastdb and formatdb (with ENSDARP00000124078 taken out).
>
> Andy
>
> On 13 Jun 2012, at 17:37, Toni Hermoso Pulido wrote:
>
>> Hi Andy,
>>
>> if I rely on my pipeline, I would dare to say that there are more
>> files affected (I don't know how many empty seqs per file, though)
>>
>> One excerpt of error messages when FASTA files cannot be formatted by
>> NCBI Blast below:
>>
>> mv: cannot stat
>> `/db/ensembl/release-67/callithrix_jacchus/proteome/Callithrix_jacchus.C_jacchus3.2.1.67.pep.all.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/choloepus_hoffmanni/proteome/Choloepus_hoffmanni.choHof1.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/danio_rerio/proteome/Danio_rerio.Zv9.67.pep.all.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/echinops_telfairi/proteome/Echinops_telfairi.TENREC.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/erinaceus_europaeus/proteome/Erinaceus_europaeus.HEDGEHOG.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/felis_catus/proteome/Felis_catus.CAT.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/gadus_morhua/proteome/Gadus_morhua.gadMor1.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/homo_sapiens/proteome/Homo_sapiens.GRCh37.67.pep.all.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/macropus_eugenii/proteome/Macropus_eugenii.Meug_1.0.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/mus_musculus/proteome/Mus_musculus.NCBIM37.67.pep.all.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/ochotona_princeps/proteome/Ochotona_princeps.pika.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/ornithorhynchus_anatinus/proteome/Ornithorhynchus_anatinus.OANA5.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/oryzias_latipes/proteome/Oryzias_latipes.MEDAKA1.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/procavia_capensis/proteome/Procavia_capensis.proCap1.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/sorex_araneus/proteome/Sorex_araneus.COMMON_SHREW1.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/tarsius_syrichta/proteome/Tarsius_syrichta.tarSyr1.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/tetraodon_nigroviridis/proteome/Tetraodon_nigroviridis.TETRAODON8.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/vicugna_pacos/proteome/Vicugna_pacos.vicPac1.67.pep.abinitio.fa.*':
>> No such file or directory
>> mv: cannot stat
>> `/db/ensembl/release-67/xenopus_tropicalis/proteome/Xenopus_tropicalis.JGI_4.2.67.pep.abinitio.fa.*':
>> No such file or directory
>>
>> So I understand you plan to replace the files in the FTP site, don't you?
>>
>> Thanks for all,
>>
>> 2012/6/13 Andy Yates <ayates at ebi.ac.uk>:
>>> Hi Toni,
>>>
>>> We are currently aware of this issue. These 0 length sequences have appeared due to a bug in our FASTA serialiser being unable to handle sequences of length 1. This was not picked up during our dumping process as we do not generate NCBI blast indexes. The files are now being regenerated. The current list of known affected species and their protein counts are:
>>>
>>> callithrix_jacchus      1
>>> danio_rerio     1
>>> homo_sapiens    13
>>> mus_musculus    5
>>>
>>> Does this correspond to your own list?
>>>
>>> All the best,
>>>
>>> Andy
>>>
>>> Andrew Yates                   Ensembl Core Software Project Leader
>>> EMBL-EBI                       Tel: +44-(0)1223-492538
>>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>>> Cambridge CB10 1SD, UK         http://www.ensembl.org/
>>>
>>> On 13 Jun 2012, at 15:25, Toni Hermoso Pulido wrote:
>>>
>>>> Hello,
>>>>
>>>> there seems to be a problem with a few FASTA pep files of some
>>>> organisms when performing a formatdb (2.2.25 and 2.2.26 tested):
>>>>
>>>> $ blast/blast-2.2.26/bin/formatdb -i Mus_musculus.NCBIM37.67.pep.all.fa
>>>> [formatdb] WARNING: Cannot add sequence number 19278
>>>> (lcl|19278_Mus_musculus.NCBIM37.67.pep.all.) because it has
>>>> zero-length.
>>>>
>>>> [formatdb] FATAL ERROR: Fatal error when adding sequence to BLAST database.
>>>>
>>>> This happens with empty FASTA, in this case:
>>>>> ENSMUSP00000118372 pep:known chromosome:NCBIM37:4:117507600:117515714:1 gene:ENSMUSG00000028542 transcript:ENSMUST00000151316 gene_biotype:protein_coding transcript_biotype:protein_coding
>>>>
>>>> I haven't experienced a similar issue in the past.
>>>>




More information about the Dev mailing list