[ensembl-dev] Problem with formatdb and several pep FASTA files

Toni Hermoso Pulido toni.hermoso at crg.cat
Wed Jun 13 17:37:25 BST 2012


Hi Andy,

if I rely on my pipeline, I would dare to say that there are more
files affected (I don't know how many empty seqs per file, though)

One excerpt of error messages when FASTA files cannot be formatted by
NCBI Blast below:

mv: cannot stat
`/db/ensembl/release-67/callithrix_jacchus/proteome/Callithrix_jacchus.C_jacchus3.2.1.67.pep.all.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/choloepus_hoffmanni/proteome/Choloepus_hoffmanni.choHof1.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/danio_rerio/proteome/Danio_rerio.Zv9.67.pep.all.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/echinops_telfairi/proteome/Echinops_telfairi.TENREC.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/erinaceus_europaeus/proteome/Erinaceus_europaeus.HEDGEHOG.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/felis_catus/proteome/Felis_catus.CAT.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/gadus_morhua/proteome/Gadus_morhua.gadMor1.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/homo_sapiens/proteome/Homo_sapiens.GRCh37.67.pep.all.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/macropus_eugenii/proteome/Macropus_eugenii.Meug_1.0.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/mus_musculus/proteome/Mus_musculus.NCBIM37.67.pep.all.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/ochotona_princeps/proteome/Ochotona_princeps.pika.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/ornithorhynchus_anatinus/proteome/Ornithorhynchus_anatinus.OANA5.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/oryzias_latipes/proteome/Oryzias_latipes.MEDAKA1.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/procavia_capensis/proteome/Procavia_capensis.proCap1.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/sorex_araneus/proteome/Sorex_araneus.COMMON_SHREW1.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/tarsius_syrichta/proteome/Tarsius_syrichta.tarSyr1.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/tetraodon_nigroviridis/proteome/Tetraodon_nigroviridis.TETRAODON8.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/vicugna_pacos/proteome/Vicugna_pacos.vicPac1.67.pep.abinitio.fa.*':
No such file or directory
mv: cannot stat
`/db/ensembl/release-67/xenopus_tropicalis/proteome/Xenopus_tropicalis.JGI_4.2.67.pep.abinitio.fa.*':
No such file or directory

So I understand you plan to replace the files in the FTP site, don't you?

Thanks for all,

2012/6/13 Andy Yates <ayates at ebi.ac.uk>:
> Hi Toni,
>
> We are currently aware of this issue. These 0 length sequences have appeared due to a bug in our FASTA serialiser being unable to handle sequences of length 1. This was not picked up during our dumping process as we do not generate NCBI blast indexes. The files are now being regenerated. The current list of known affected species and their protein counts are:
>
> callithrix_jacchus      1
> danio_rerio     1
> homo_sapiens    13
> mus_musculus    5
>
> Does this correspond to your own list?
>
> All the best,
>
> Andy
>
> Andrew Yates                   Ensembl Core Software Project Leader
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensembl.org/
>
> On 13 Jun 2012, at 15:25, Toni Hermoso Pulido wrote:
>
>> Hello,
>>
>> there seems to be a problem with a few FASTA pep files of some
>> organisms when performing a formatdb (2.2.25 and 2.2.26 tested):
>>
>> $ blast/blast-2.2.26/bin/formatdb -i Mus_musculus.NCBIM37.67.pep.all.fa
>> [formatdb] WARNING: Cannot add sequence number 19278
>> (lcl|19278_Mus_musculus.NCBIM37.67.pep.all.) because it has
>> zero-length.
>>
>> [formatdb] FATAL ERROR: Fatal error when adding sequence to BLAST database.
>>
>> This happens with empty FASTA, in this case:
>>> ENSMUSP00000118372 pep:known chromosome:NCBIM37:4:117507600:117515714:1 gene:ENSMUSG00000028542 transcript:ENSMUST00000151316 gene_biotype:protein_coding transcript_biotype:protein_coding
>>
>> I haven't experienced a similar issue in the past.
>>




More information about the Dev mailing list