[ensembl-dev] Gene build pipeline - 2 issues

Tue Aug 27 13:53:17 BST 2013

Hi Thibaut,

thanks for the feedback. Answers to your comments in line:

> Hi Marc,
>
> On 26 Aug 2013, at 10:59, Marc Hoeppner <mphoeppner at gmail.com 
> <mailto:mphoeppner at gmail.com>> wrote:
>
>> Hi EnsEMBL team,
>>
>> been playing with the pipeline again, but am having problems (again). 
>> Please see below for details - am happy about any suggestions.
>>
>> Cheers,
>>
>> Marc
>>
>> ########
>> 1) Pmatch
>> ########
>>
>> I set up a pmatch analysis as by the documentation and it runs fine 
>> on my test dataset (small chicken chromosome) when I try it with 
>> test_RunnableDB. However, when I run the pipeline, I get this:
>>
>> TARGET  0.064u 0.008s 0+0k 0pf 0sw
>> BUILD   0.116u 0.040s 0+0k 0pf 0sw
>> SEARCH  22.949u 0.172s 0+0k 0pf 0sw
>> WARN: For multiple species use species attribute in DBAdaptor->new()
>> WRITING: Lost the will to live Error
>> Job 1198 failed: [
>> -------------------- EXCEPTION --------------------
>> MSG: Problems for Pmatch writing output for 
>> chromosome:vchicken_test:10:1:19911089:1 [Can't call method "version" 
>> on an undefined value at 
>> /opt/bioinformatics/ensembl-70/ensembl/modules/Bio/EnsEMBL/DBSQL/MetaContainer.pm 
>> line 218.
>> ]
>> STACK Bio::EnsEMBL::Pipeline::Job::run_module 
>> /opt/bioinformatics/ensembl-70/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Job.pm:720
>> STACK (eval) 
>> /opt/bioinformatics/ensembl-70/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/runner.pl:219
>> STACK main::run_jobs_with_lsfcopy 
>> /opt/bioinformatics/ensembl-70/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/runner.pl:218
>> STACK toplevel 
>> /opt/bioinformatics/ensembl-70/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/runner.pl:128
>> Date (localtime)    = Fri Aug 23 14:53:27 2013
>> Ensembl API version = 70
>>
> We would need to see how your coord_system and meta tables are populated.
> The API complains that it can't find the version of your assembly. 
> Your coord_system table should look like this one:
> +-----------------------+----------------+------------------+------------+-------+--------------------------------+
> | coord_system_id | species_id | name               | version   | rank 
> | attrib                         |
> +-----------------------+----------------+------------------+------------+-------+--------------------------------+
> |                             1 |                   1 | contig         
>      | NULL      |       3 | default_version,sequence_level |
> |                             2 |                   1 | scaffold       
>     | oryCun2 |       2 | default_version            |
> |                             3 |                   1 | chromosome | 
> oryCun2 |       1 | default_version    |
>>
This is my coord_system table:

+-----------------+------------+------------+---------------+------+--------------------------------+
| coord_system_id | species_id | name       | version       | rank | 
attrib                         |
+-----------------+------------+------------+---------------+------+--------------------------------+
|               1 |          1 | chromosome | vchicken_test |    1 | 
default_version                |
|               2 |          1 | contig     | vchicken_test |    3 | 
default_version,sequence_level |
+-----------------+------------+------------+---------------+------+--------------------------------+

I don't have a supercontig layer, since I am faking contigs from 
assembled sequences for testing purposes. I think I had that discussed 
over this mailing list as well and was told that the API code should be 
able to deal with a contig-chromosome setup. Anything suspicious here?
>> ##########
>> 2) Unigene
>> ##########
>>
>> This one really bothers me I think everything is set up correctly 
>> (downloaded the unigene file, header seems to comply with the 
>> reference formatting in Blast.pm etc), bit I cannot for the life of 
>> me get it to work. Specifically, I am trying to use ncbi blast and 
>> the command just looks off - seems like it tries to do a mix of 
>> Wublast and Ncbi blast (works fine with Uniprot though - so perhaps 
>> something with the BlastGenscanDna module?).
>>
>> Running job 1791
>> Module is BlastGenscanDNA
>> Input id is contig:vchicken_test:10_68:1:50000:1
>> Analysis is unigene
>> Files are 
>> /data2/projects/annotation/EnsEMBL/chicken/output//unigene/0/contig:vchicken_test:10_114:1:50000:1.unigene.55.retry2.out 
>> /data2/projects/annotation/EnsEMBL/chicken/output//unigene/0/contig:vchicken_test:10_114:1:50000:1.unige$ 
>>
>>
>> -------------------- WARNING ----------------------
>> MSG: Error running Blast cmd </usr/bin/blastall -d 
>> /data2/projects/annotation/EnsEMBL/chicken/refseqs/unigene.fa -i 
>> /tmp/seq.22305.24863.fa -cpus=1 2>&1 > 
>> /tmp/unigene.fa.22305.5651.blast.out>. Returned error 256 BLAST EXIT: 
>> '1', SIGNA$
>> FILE: Analysis/Runnable/Blast.pm LINE: 380
>> CALLED BY: EnsEMBL/Analysis/Runnable.pm  LINE: 729
>> Date (localtime)    = Fri Aug 23 14:54:47 2013
>> Ensembl API version = 70
> Have you tried to run the command by itself to see if it works? The 
> error message you have seems to be from the ncbi blast program.
> As the module dies the temporary file containing your chicken sequence 
> should still exists. If not, you will need to comment a line in the 
> run method of ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Runnable.pm:
>
>   #$self->delete_files;
>
> You probably need to change your parameters in the analysis table of 
> your reference database. We use WU blast at the moment.
>
> Also, the parameters for blast should be "-cpus 1 -hitdist 40" instead 
> of "-cpus => 1, -hitdist => 40"
>
> Regards
> Thibaut
>
I think the problem is that the blastall string is mal-formatted. It 
should be

blastall -i input.fasta -d database -p blastn

So it failed to determine which blast program to use. Interestingly, it 
works fine for protein-protein blast, but fails in this protein-dna 
configuration. Hence my question whether this may be a problem in the 
BlastGenscanDna module. I can try wublast also, but I think I had 
serious trouble getting that to work. Are you guys calling your 
executables wublastp, wublastn etc? Because the only thing I could find 
was blastp, blastn etc. I assume this would still work if I specify 
these binary names in the configs..? Gave up at some point because it 
keep whining about something, so went the ncbi route..

Oh and thanks for pointing out the parameter issue, I actually took 
those from the documentation, sooo... ;) But will update my scripts.

All the best,

Marc

>>
>> And here the config for the unigene search:
>>
>> [unigene]
>> db=unigene
>> db_file=/data2/projects/annotation/EnsEMBL/chicken/refseqs/unigene.fa
>> program=blastall
>> program_file=blastall
>> parameters=-cpus => 1, -hitdist => 40
>> module=BlastGenscanDNA
>> input_id_type=CONTIG
>>
>> (Blast.pm is configured to use 'ncbi' as default type, so unigene 
>> should inherit that, no?)
>>
>>
>> -- 
>> Marc P. Hoeppner, PhD
>> Department of Medical Biochemistry and Microbiology
>> Uppsala University, Sweden
>> marc.hoeppner at imbim.uu.se
>>
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>> Posting guidelines and subscribe/unsubscribe info: 
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20130827/cedb8009/attachment.html>