[ensembl-dev] Gene build pipeline - WU-Blast versus NCBI Blast

Thibaut Hourlier th3 at sanger.ac.uk
Thu Aug 29 11:11:40 BST 2013


Hi Marc,

On 29 Aug 2013, at 08:09, Marc Hoeppner <mphoeppner at gmail.com> wrote:

> Hi Thibaut et al, 
> 
> sorry to bother you guys again - last question(s), I promise. 
> 
> 1) Using WU-Blast
> 
> I have been trying to switch to wublast to remove a bit of overhead from setting up the pipeline with NCBI. However, when following the documentation, I get this as a command line call:
> 
> wublastp /references/databases/uniprot_swissprot/uniprot.fasta /tmp/seq.9741.42380.fa -gi -cpus=1
> 
> The thing is that my version of Wu-blast (BLASTP 2.0a19MP-WashU [14-Jul-1998] [Build linux-x86 18:51:39 30-Jul-1998]) does not have a '-cpus' flag and I didn't set it. Hence, it fails.
> 
> The flag is set within within the Blast module(s) automatically, like in Analysis::Runnable::Blast.pm:
> 
> $self->options('-cpus=1') if(!$self->options);
This is probably a bug, You can simply delete/comment this line. I will have a look.
> 
> So I am guessing you guys must be using some other version of Wu-Blast? I can't find any other than the one I got, since the original code from http://blast.wustl.edu/ is now hosted elsewhere and the only legacy version of their blast suite (Wu-blast) is the one I downloaded. 
> 
> Any idea where I can get the version used by EnsEMBL? Or is it the same and the error lies elsewhere?
> 
> 2) Having to specify '-p blastn' as a parameter for DNA-Peptide blasts when running the NCBI suite.
> (This is from a conversation with Thibaut directly - the pipeline fails to determine the right blastall program to use when running DNA/Peptide blasts)
I don't see it as problem, when you prepare your analysis you just need to specify it in the config file.

[unigene] 
db=unigene 
db_file=/data2/projects/annotation/EnsEMBL/chicken/refseqs/unigene.fa 
program=blastall 
program_file=blastall 
parameters=-p blastn -A 40 -a 1
module=BlastGenscanDNA 
input_id_type=CONTIG 

The line parameters contains program specific parameters. I'm not sure of the -A 40 but it seems to me to be similar to -hitdist 40

I suppose that we were using NCBI blast for the protein alignment and WU blast for the dna alignment but I cannot say for sure.

You can modify our code to suit your needs.

I will look into it later.

Regards
Thibaut

> 
> So I had a look at the code, since I still didn't understand why BlastGenscanPep.pm would automatically figure out which blastall programm (blastall -p blastp ) to use and BlastGenscanDna.pm wouldn't . Seems like this descision process is coded into the BlastGenscanPep.pm module, but was left out of BlastGenscanDna.pm. I wonder why that is? Makes using NCBI for the pipeline a little awkward (which I wouldn't mind if I could get the stupid Wu-Blast to work ;))
> 
> From BlastGenscanPep:
> 
> if ( $blast{-type}=~m/ncbi/ ) {
>     if ( $options{-query_type}=~m/pep/ ) {
>       if ( $options{-database_type}=~m/pep/ ) {
>            $options_string = '-p blastp' ;  
>       } elsif ( $options{-database_type}=~m/dna/ ) {
>          $options_string = '-p tblastn' ;
>       }
>     }
> 
>     if ( $options{-query_type}=~m/dna/ ) {
>       if ( $options{-database_type}=~m/dna/ ) {
>            $options_string = '-p blastn' ;
>       }elsif ( $options{-database_type}=~m/pep/ ) {
>            $options_string = '-p blastx' ;
>       }
>     }
>   }
> 
> 
> 
> Cheers,
> 
> Marc
>> Hi Marc,
>> 
>> On 28 Aug 2013, at 14:35, Marc Hoeppner <mphoeppner at gmail.com> wrote:
>> 
>>> Hi Thibaut,
>>> 
>>> thanks for the help, been able to fix my problems. 
>>> 
>>> Turns out that yes, the coordinate system rank seems to matter - but I think the problem was that Pmatch was configured to write to the GENEWISE_DB, which existed and had tables and everything, but wasn't loaded with any data (such as a coordinate system). 
>> Whenever you write in a database using the Genebuild pipeline you need to check that some tables are in sync with your reference. Here are the tables:
>> analysis
>> assembly
>> assembly_exception
>> coord_system
>> seq_region
>> seq_region_attrib
>> meta
>> attrib_type
>> seq_region_synonym
>> 
>> 
>>> 
>>> For the unigene blast, adding the blastall program option manually into the parameter string solved that problem, but I still wonder if there is a bug in the code (it should do this automatically).
>> As it is an blastall option and can be set in the analysis configuration I don't see it as a bug
>> 
>> Regards
>> Thibaut
>>  
>>> 
>>> Anyway, thanks again!
>>> 
>>> /Marc
>>>> Hi Marc,
>>>> 
>>>> On 27 Aug 2013, at 13:53, Marc Hoeppner <mphoeppner at gmail.com> wrote:
>>>> 
>>>>> Hi Thibaut,
>>>>> 
>>>>> thanks for the feedback. Answers to your comments in line:
>>>>> 
>>>>> 
>>>>>> Hi Marc,
>>>>>> 
>>>>>> On 26 Aug 2013, at 10:59, Marc Hoeppner <mphoeppner at gmail.com> wrote:
>>>>>> 
>>>>>>> Hi EnsEMBL team, 
>>>>>>> 
>>>>>>> been playing with the pipeline again, but am having problems (again). Please see below for details - am happy about any suggestions. 
>>>>>>> 
>>>>>>> Cheers, 
>>>>>>> 
>>>>>>> Marc 
>>>>>>> 
>>>>>>> ######## 
>>>>>>> 1) Pmatch 
>>>>>>> ######## 
>>>>>>> 
>>>>>>> I set up a pmatch analysis as by the documentation and it runs fine on my test dataset (small chicken chromosome) when I try it with test_RunnableDB. However, when I run the pipeline, I get this: 
>>>>>>> 
>>>>>>> TARGET  0.064u 0.008s 0+0k 0pf 0sw 
>>>>>>> BUILD   0.116u 0.040s 0+0k 0pf 0sw 
>>>>>>> SEARCH  22.949u 0.172s 0+0k 0pf 0sw 
>>>>>>> WARN: For multiple species use species attribute in DBAdaptor->new() 
>>>>>>> WRITING: Lost the will to live Error 
>>>>>>> Job 1198 failed: [ 
>>>>>>> -------------------- EXCEPTION -------------------- 
>>>>>>> MSG: Problems for Pmatch writing output for chromosome:vchicken_test:10:1:19911089:1 [Can't call method "version" on an undefined value at /opt/bioinformatics/ensembl-70/ensembl/modules/Bio/EnsEMBL/DBSQL/MetaContainer.pm line 218. 
>>>>>>> ] 
>>>>>>> STACK Bio::EnsEMBL::Pipeline::Job::run_module /opt/bioinformatics/ensembl-70/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Job.pm:720
>>>>>>> STACK (eval) /opt/bioinformatics/ensembl-70/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/runner.pl:219
>>>>>>> STACK main::run_jobs_with_lsfcopy /opt/bioinformatics/ensembl-70/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/runner.pl:218
>>>>>>> STACK toplevel /opt/bioinformatics/ensembl-70/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/runner.pl:128
>>>>>>> Date (localtime)    = Fri Aug 23 14:53:27 2013 
>>>>>>> Ensembl API version = 70 
>>>>>>> 
>>>>>> We would need to see how your coord_system and meta tables are populated.
>>>>>> The API complains that it can't find the version of your assembly. Your coord_system table should look like this one:
>>>>>> +-----------------------+----------------+------------------+------------+-------+--------------------------------+
>>>>>> | coord_system_id | species_id | name               | version   | rank | attrib                         |
>>>>>> +-----------------------+----------------+------------------+------------+-------+--------------------------------+
>>>>>> |                             1 |                   1 | contig              | NULL      |       3 | default_version,sequence_level |
>>>>>> |                             2 |                   1 | scaffold           | oryCun2 |       2 | default_version                |
>>>>>> |                             3 |                   1 | chromosome | oryCun2 |       1 | default_version                |
>>>>>>> 
>>>>> This is my coord_system table:
>>>>> 
>>>>> +-----------------+------------+------------+---------------+------+--------------------------------+
>>>>> | coord_system_id | species_id | name       | version       | rank | attrib                         |
>>>>> +-----------------+------------+------------+---------------+------+--------------------------------+
>>>>> |               1 |          1 | chromosome | vchicken_test |    1 | default_version                |
>>>>> |               2 |          1 | contig     | vchicken_test |    3 | default_version,sequence_level |
>>>>> +-----------------+------------+------------+---------------+------+--------------------------------+
>>>>> 
>>>>> 
>>>>> I don't have a supercontig layer, since I am faking contigs from assembled sequences for testing purposes. I think I had that discussed over this mailing list as well and was told that the API code should be able to deal with a contig-chromosome setup. Anything suspicious here?
>>>> I'm not sure that this is the problem but you should change the rank of your contig, set it to 2. The API might be looking for the coordinate system with the rank 2 and fails to find it at the moment.
>>>> 
>>>>>>> ########## 
>>>>>>> 2) Unigene 
>>>>>>> ########## 
>>>>>>> 
>>>>>>> This one really bothers me I think everything is set up correctly (downloaded the unigene file, header seems to comply with the reference formatting in Blast.pm etc), bit I cannot for the life of me get it to work. Specifically, I am trying to use ncbi blast and the command just looks off - seems like it tries to do a mix of Wublast and Ncbi blast (works fine with Uniprot though - so perhaps something with the BlastGenscanDna module?). 
>>>>>>> 
>>>>>>> Running job 1791 
>>>>>>> Module is BlastGenscanDNA 
>>>>>>> Input id is contig:vchicken_test:10_68:1:50000:1 
>>>>>>> Analysis is unigene 
>>>>>>> Files are /data2/projects/annotation/EnsEMBL/chicken/output//unigene/0/contig:vchicken_test:10_114:1:50000:1.unigene.55.retry2.out /data2/projects/annotation/EnsEMBL/chicken/output//unigene/0/contig:vchicken_test:10_114:1:50000:1.unige$ 
>>>>>>> 
>>>>>>> -------------------- WARNING ---------------------- 
>>>>>>> MSG: Error running Blast cmd </usr/bin/blastall -d /data2/projects/annotation/EnsEMBL/chicken/refseqs/unigene.fa -i /tmp/seq.22305.24863.fa -cpus=1 2>&1 > /tmp/unigene.fa.22305.5651.blast.out>. Returned error 256 BLAST EXIT: '1', SIGNA$ 
>>>>>>> FILE: Analysis/Runnable/Blast.pm LINE: 380 
>>>>>>> CALLED BY: EnsEMBL/Analysis/Runnable.pm  LINE: 729 
>>>>>>> Date (localtime)    = Fri Aug 23 14:54:47 2013 
>>>>>>> Ensembl API version = 70 
>>>>>> Have you tried to run the command by itself to see if it works? The error message you have seems to be from the ncbi blast program.
>>>>>> As the module dies the temporary file containing your chicken sequence should still exists. If not, you will need to comment a line in the run method of ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Runnable.pm:
>>>>>> 
>>>>>>   #$self->delete_files;
>>>>>> 
>>>>>> You probably need to change your parameters in the analysis table of your reference database. We use WU blast at the moment.
>>>>>> 
>>>>>> Also, the parameters for blast should be "-cpus 1 -hitdist 40" instead of "-cpus => 1, -hitdist => 40"
>>>>>> 
>>>>>> Regards
>>>>>> Thibaut
>>>>>> 
>>>>> I think the problem is that the blastall string is mal-formatted. It should be
>>>>> 
>>>>> blastall -i input.fasta -d database -p blastn 
>>>>> 
>>>>> So it failed to determine which blast program to use. Interestingly, it works fine for protein-protein blast, but fails in this protein-dna configuration. Hence my question whether this may be a problem in the BlastGenscanDna module. I can try wublast also, but I think I had serious trouble getting that to work. Are you guys calling your executables wublastp, wublastn etc? Because the only thing I could find was blastp, blastn etc. I assume this would still work if I specify these binary names in the configs..? Gave up at some point because it keep whining about something, so went the ncbi route..
>>>> Maybe blast look at the database to know which type of search it will do by default.
>>>> For the moment you need to change the parameters and add '-p blastn' and it should work.
>>>> 
>>>> If your blastn is similar to blastall -p blastn then you can change your program to be blastn and you don't need to add '-p blastn' to your parameters
>>>>> 
>>>>> Oh and thanks for pointing out the parameter issue, I actually took those from the documentation, sooo... ;) But will update my scripts. 
>>>> Unfortunately the documentation available is a bit old and updating it take a lot of time.
>>>> 
>>>> Regards
>>>> Thibaut
>>>> 
>>>>> 
>>>>> All the best,
>>>>> 
>>>>> Marc
>>>>> 
>>>>> 
>>>>>>> 
>>>>>>> And here the config for the unigene search: 
>>>>>>> 
>>>>>>> [unigene] 
>>>>>>> db=unigene 
>>>>>>> db_file=/data2/projects/annotation/EnsEMBL/chicken/refseqs/unigene.fa 
>>>>>>> program=blastall 
>>>>>>> program_file=blastall 
>>>>>>> parameters=-cpus => 1, -hitdist => 40 
>>>>>>> module=BlastGenscanDNA 
>>>>>>> input_id_type=CONTIG 
>>>>>>> 
>>>>>>> (Blast.pm is configured to use 'ncbi' as default type, so unigene should inherit that, no?)
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Marc P. Hoeppner, PhD 
>>>>>>> Department of Medical Biochemistry and Microbiology 
>>>>>>> Uppsala University, Sweden 
>>>>>>> marc.hoeppner at imbim.uu.se 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>> 
>>>>> _______________________________________________
>>>>> Dev mailing list    Dev at ensembl.org
>>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>> 
>> 
>> -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20130829/b74a3ca1/attachment.html>


More information about the Dev mailing list