[ensembl-dev] Gene build pipeline - WU-Blast versus NCBI Blast

Thu Aug 29 08:09:03 BST 2013

Hi Thibaut et al,

sorry to bother you guys again - last question(s), I promise.

1) Using WU-Blast

I have been trying to switch to wublast to remove a bit of overhead from 
setting up the pipeline with NCBI. However, when following the 
documentation, I get this as a command line call:

wublastp /references/databases/uniprot_swissprot/uniprot.fasta 
/tmp/seq.9741.42380.fa -gi -cpus=1

The thing is that my version of Wu-blast (BLASTP 2.0a19MP-WashU 
[14-Jul-1998] [Build linux-x86 18:51:39 30-Jul-1998]) does not have a 
'-cpus' flag and I didn't set it. Hence, it fails.

The flag is set within within the Blast module(s) automatically, like in 
Analysis::Runnable::Blast.pm:

$self->options('-cpus=1') if(!$self->options);

So I am guessing you guys must be using some other version of Wu-Blast? 
I can't find any other than the one I got, since the original code from 
http://blast.wustl.edu/ is now hosted elsewhere and the only legacy 
version of their blast suite (Wu-blast) is the one I downloaded.

Any idea where I can get the version used by EnsEMBL? Or is it the same 
and the error lies elsewhere?

2) Having to specify '-p blastn' as a parameter for DNA-Peptide blasts 
when running the NCBI suite.
(This is from a conversation with Thibaut directly - the pipeline fails 
to determine the right blastall program to use when running DNA/Peptide 
blasts)

So I had a look at the code, since I still didn't understand why 
BlastGenscanPep.pm would automatically figure out which blastall 
programm (blastall -p blastp ) to use and BlastGenscanDna.pm wouldn't . 
Seems like this descision process is coded into the BlastGenscanPep.pm 
module, but was left out of BlastGenscanDna.pm. I wonder why that is? 
Makes using NCBI for the pipeline a little awkward (which I wouldn't 
mind if I could get the stupid Wu-Blast to work ;))

 From BlastGenscanPep:

if ( $blast{-type}=~m/ncbi/ ) {
     if ( $options{-query_type}=~m/pep/ ) {
       if ( $options{-database_type}=~m/pep/ ) {
            $options_string = '-p blastp' ;
       } elsif ( $options{-database_type}=~m/dna/ ) {
          $options_string = '-p tblastn' ;
       }
     }

     if ( $options{-query_type}=~m/dna/ ) {
       if ( $options{-database_type}=~m/dna/ ) {
            $options_string = '-p blastn' ;
       }elsif ( $options{-database_type}=~m/pep/ ) {
            $options_string = '-p blastx' ;
       }
     }
   }

Cheers,

Marc
> Hi Marc,
>
> On 28 Aug 2013, at 14:35, Marc Hoeppner <mphoeppner at gmail.com 
> <mailto:mphoeppner at gmail.com>> wrote:
>
>> Hi Thibaut,
>>
>> thanks for the help, been able to fix my problems.
>>
>> Turns out that yes, the coordinate system rank seems to matter - but 
>> I think the problem was that Pmatch was configured to write to the 
>> GENEWISE_DB, which existed and had tables and everything, but wasn't 
>> loaded with any data (such as a coordinate system).
> Whenever you write in a database using the Genebuild pipeline you need 
> to check that some tables are in sync with your reference. Here are 
> the tables:
> analysis
> assembly
> assembly_exception
> coord_system
> seq_region
> seq_region_attrib
> meta
> attrib_type
> seq_region_synonym
>
>
>>
>> For the unigene blast, adding the blastall program option manually 
>> into the parameter string solved that problem, but I still wonder if 
>> there is a bug in the code (it should do this automatically).
> As it is an blastall option and can be set in the analysis 
> configuration I don't see it as a bug
>
> Regards
> Thibaut
>
>>
>> Anyway, thanks again!
>>
>> /Marc
>>> Hi Marc,
>>>
>>> On 27 Aug 2013, at 13:53, Marc Hoeppner <mphoeppner at gmail.com 
>>> <mailto:mphoeppner at gmail.com>> wrote:
>>>
>>>> Hi Thibaut,
>>>>
>>>> thanks for the feedback. Answers to your comments in line:
>>>>
>>>>
>>>>> Hi Marc,
>>>>>
>>>>> On 26 Aug 2013, at 10:59, Marc Hoeppner <mphoeppner at gmail.com 
>>>>> <mailto:mphoeppner at gmail.com>> wrote:
>>>>>
>>>>>> Hi EnsEMBL team,
>>>>>>
>>>>>> been playing with the pipeline again, but am having problems 
>>>>>> (again). Please see below for details - am happy about any 
>>>>>> suggestions.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Marc
>>>>>>
>>>>>> ########
>>>>>> 1) Pmatch
>>>>>> ########
>>>>>>
>>>>>> I set up a pmatch analysis as by the documentation and it runs 
>>>>>> fine on my test dataset (small chicken chromosome) when I try it 
>>>>>> with test_RunnableDB. However, when I run the pipeline, I get this:
>>>>>>
>>>>>> TARGET  0.064u 0.008s 0+0k 0pf 0sw
>>>>>> BUILD   0.116u 0.040s 0+0k 0pf 0sw
>>>>>> SEARCH  22.949u 0.172s 0+0k 0pf 0sw
>>>>>> WARN: For multiple species use species attribute in DBAdaptor->new()
>>>>>> WRITING: Lost the will to live Error
>>>>>> Job 1198 failed: [
>>>>>> -------------------- EXCEPTION --------------------
>>>>>> MSG: Problems for Pmatch writing output for 
>>>>>> chromosome:vchicken_test:10:1:19911089:1 [Can't call method 
>>>>>> "version" on an undefined value at 
>>>>>> /opt/bioinformatics/ensembl-70/ensembl/modules/Bio/EnsEMBL/DBSQL/MetaContainer.pm 
>>>>>> line 218.
>>>>>> ]
>>>>>> STACK Bio::EnsEMBL::Pipeline::Job::run_module 
>>>>>> /opt/bioinformatics/ensembl-70/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Job.pm:720
>>>>>> STACK (eval) 
>>>>>> /opt/bioinformatics/ensembl-70/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/runner.pl:219
>>>>>> STACK main::run_jobs_with_lsfcopy 
>>>>>> /opt/bioinformatics/ensembl-70/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/runner.pl:218
>>>>>> STACK toplevel 
>>>>>> /opt/bioinformatics/ensembl-70/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/runner.pl:128
>>>>>> Date (localtime)    = Fri Aug 23 14:53:27 2013
>>>>>> Ensembl API version = 70
>>>>>>
>>>>> We would need to see how your coord_system and meta tables are 
>>>>> populated.
>>>>> The API complains that it can't find the version of your assembly. 
>>>>> Your coord_system table should look like this one:
>>>>> +-----------------------+----------------+------------------+------------+-------+--------------------------------+
>>>>> | coord_system_id | species_id | name               | version   | 
>>>>> rank | attrib                         |
>>>>> +-----------------------+----------------+------------------+------------+-------+--------------------------------+
>>>>> |                             1 |               1 | contig         
>>>>>      | NULL      |       3 | default_version,sequence_level |
>>>>> |                             2 |               1 | scaffold       
>>>>>     | oryCun2 |       2 | default_version            |
>>>>> |                             3 |               1 | chromosome | 
>>>>> oryCun2 |       1 | default_version  |
>>>>>>
>>>> This is my coord_system table:
>>>>
>>>> +-----------------+------------+------------+---------------+------+--------------------------------+
>>>> | coord_system_id | species_id | name       | version       | rank 
>>>> | attrib                         |
>>>> +-----------------+------------+------------+---------------+------+--------------------------------+
>>>> |               1 |          1 | chromosome | vchicken_test |    1 
>>>> | default_version                |
>>>> |               2 |          1 | contig     | vchicken_test |    3 
>>>> | default_version,sequence_level |
>>>> +-----------------+------------+------------+---------------+------+--------------------------------+
>>>>
>>>>
>>>> I don't have a supercontig layer, since I am faking contigs from 
>>>> assembled sequences for testing purposes. I think I had that 
>>>> discussed over this mailing list as well and was told that the API 
>>>> code should be able to deal with a contig-chromosome setup. 
>>>> Anything suspicious here?
>>> I'm not sure that this is the problem but you should change the rank 
>>> of your contig, set it to 2. The API might be looking for the 
>>> coordinate system with the rank 2 and fails to find it at the moment.
>>>
>>>>>> ##########
>>>>>> 2) Unigene
>>>>>> ##########
>>>>>>
>>>>>> This one really bothers me I think everything is set up correctly 
>>>>>> (downloaded the unigene file, header seems to comply with the 
>>>>>> reference formatting in Blast.pm etc), bit I cannot for the life 
>>>>>> of me get it to work. Specifically, I am trying to use ncbi blast 
>>>>>> and the command just looks off - seems like it tries to do a mix 
>>>>>> of Wublast and Ncbi blast (works fine with Uniprot though - so 
>>>>>> perhaps something with the BlastGenscanDna module?).
>>>>>>
>>>>>> Running job 1791
>>>>>> Module is BlastGenscanDNA
>>>>>> Input id is contig:vchicken_test:10_68:1:50000:1
>>>>>> Analysis is unigene
>>>>>> Files are 
>>>>>> /data2/projects/annotation/EnsEMBL/chicken/output//unigene/0/contig:vchicken_test:10_114:1:50000:1.unigene.55.retry2.out 
>>>>>> /data2/projects/annotation/EnsEMBL/chicken/output//unigene/0/contig:vchicken_test:10_114:1:50000:1.unige$ 
>>>>>>
>>>>>>
>>>>>> -------------------- WARNING ----------------------
>>>>>> MSG: Error running Blast cmd </usr/bin/blastall -d 
>>>>>> /data2/projects/annotation/EnsEMBL/chicken/refseqs/unigene.fa -i 
>>>>>> /tmp/seq.22305.24863.fa -cpus=1 2>&1 > 
>>>>>> /tmp/unigene.fa.22305.5651.blast.out>. Returned error 256 BLAST 
>>>>>> EXIT: '1', SIGNA$
>>>>>> FILE: Analysis/Runnable/Blast.pm LINE: 380
>>>>>> CALLED BY: EnsEMBL/Analysis/Runnable.pm  LINE: 729
>>>>>> Date (localtime)    = Fri Aug 23 14:54:47 2013
>>>>>> Ensembl API version = 70
>>>>> Have you tried to run the command by itself to see if it works? 
>>>>> The error message you have seems to be from the ncbi blast program.
>>>>> As the module dies the temporary file containing your chicken 
>>>>> sequence should still exists. If not, you will need to comment a 
>>>>> line in the run method of 
>>>>> ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Runnable.pm:
>>>>>
>>>>>   #$self->delete_files;
>>>>>
>>>>> You probably need to change your parameters in the analysis table 
>>>>> of your reference database. We use WU blast at the moment.
>>>>>
>>>>> Also, the parameters for blast should be "-cpus 1 -hitdist 40" 
>>>>> instead of "-cpus => 1, -hitdist => 40"
>>>>>
>>>>> Regards
>>>>> Thibaut
>>>>>
>>>> I think the problem is that the blastall string is mal-formatted. 
>>>> It should be
>>>>
>>>> blastall -i input.fasta -d database -p blastn
>>>>
>>>> So it failed to determine which blast program to use. 
>>>> Interestingly, it works fine for protein-protein blast, but fails 
>>>> in this protein-dna configuration. Hence my question whether this 
>>>> may be a problem in the BlastGenscanDna module. I can try wublast 
>>>> also, but I think I had serious trouble getting that to work. Are 
>>>> you guys calling your executables wublastp, wublastn etc? Because 
>>>> the only thing I could find was blastp, blastn etc. I assume this 
>>>> would still work if I specify these binary names in the configs..? 
>>>> Gave up at some point because it keep whining about something, so 
>>>> went the ncbi route..
>>> Maybe blast look at the database to know which type of search it 
>>> will do by default.
>>> For the moment you need to change the parameters and add '-p blastn' 
>>> and it should work.
>>>
>>> If your blastn is similar to blastall -p blastn then you can change 
>>> your program to be blastn and you don't need to add '-p blastn' to 
>>> your parameters
>>>>
>>>> Oh and thanks for pointing out the parameter issue, I actually took 
>>>> those from the documentation, sooo... ;) But will update my scripts.
>>> Unfortunately the documentation available is a bit old and updating 
>>> it take a lot of time.
>>>
>>> Regards
>>> Thibaut
>>>
>>>>
>>>> All the best,
>>>>
>>>> Marc
>>>>
>>>>
>>>>>>
>>>>>> And here the config for the unigene search:
>>>>>>
>>>>>> [unigene]
>>>>>> db=unigene
>>>>>> db_file=/data2/projects/annotation/EnsEMBL/chicken/refseqs/unigene.fa 
>>>>>>
>>>>>> program=blastall
>>>>>> program_file=blastall
>>>>>> parameters=-cpus => 1, -hitdist => 40
>>>>>> module=BlastGenscanDNA
>>>>>> input_id_type=CONTIG
>>>>>>
>>>>>> (Blast.pm is configured to use 'ncbi' as default type, so unigene 
>>>>>> should inherit that, no?)
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Marc P. Hoeppner, PhD
>>>>>> Department of Medical Biochemistry and Microbiology
>>>>>> Uppsala University, Sweden
>>>>>> marc.hoeppner at imbim.uu.se
>>>>>>
>>>>>> _______________________________________________
>>>>>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>>>> Posting guidelines and subscribe/unsubscribe info: 
>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing listDev at ensembl.org
>>>>> Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog:http://www.ensembl.info/
>>>>
>>>> _______________________________________________
>>>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>> Posting guidelines and subscribe/unsubscribe info: 
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing listDev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog:http://www.ensembl.info/
>>
>
>
> -- The Wellcome Trust Sanger Institute is operated by Genome Research 
> Limited, a charity registered in England with number 1021457 and a 
> company registered in England with number 2742969, whose registered 
> office is 215 Euston Road, London, NW1 2BE.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20130829/9cad19b0/attachment.html>