[ensembl-dev] ensembl-funcgen SubmitPeaks error

Wed May 21 12:50:19 BST 2014

Hi Nathan,

Sorry I was not clear enough. I mean, in my particular case column 
replicate number is not there, in spite of being present in the fastq 
directory structure as discussed below.

The error message what I got via beekeeper suggests that the new 
InputSet for the given peak analysis cannot be added to my funcgen DB 
due to this missing bit of information. See error msg:

~~~~~~~~~~~~~~~~~~~~~~~~~~
Storing new InputSet:   piPSC_H3K27me3_Xiao
DBD::mysql::st execute failed: Column 'replicate' cannot be null at 
/groups2/avian_genomes/software/src/ensembl/ens72/ensembl-functgenomics/modules/Bio/EnsEMBL/Funcgen/DBSQL/InputSetAdaptor.pm 
line 380.

job 1 : died in status 'RUN' for the following reason: DBD::mysql::st 
execute failed: Column 'replicate' cannot be null at 
/groups2/avian_genomes/software/src/ensembl/ens72/ensembl-functgenomics/modules/Bio/EnsEMBL/Funcgen/DBSQL/InputSetAdaptor.pm 
line 380.
~~~~~~~~~~~~~~~~~~~~~~~~~~

When I query the content of lel_sus_scrofa_funcgen_72_102.input_set it 
returns with an empty set. Should the replicate information be fetched 
via the AddPeakDataSets call or following that step when beekeeper is 
called?

Thanks,
Lel

On 05/21/2014 11:57 AM, njohnson wrote:
> Hi Lel
>
> The pipeline is getting the replicate numbers, it's just not doing an awful lot with them. They are expected in the old  hive RunnableDB/SetUpAlignments.pm module. But we moved away from this before we started doing anything useful with the replicate specifications. Hence you will find that all the outputs are merged, and the input_subset records in the DB will reference merged sam/bam files with a replicate value of 0, which is our shorthand for merged.
>
>
> Nathan Johnson
>
> Ensembl Regulation
> European Bioinformatics Institute (EMBL-EBI)
> European Molecular Biology Laboratory
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> United Kingdom
>
> http://www.ensembl.info/
> http://twitter.com/#!/ensembl
> https://www.facebook.com/Ensembl.org
>
> On 21 May 2014, at 11:27, Lel Eory <lel.eory at roslin.ed.ac.uk> wrote:
>
>> Hi Nathan,
>>
>> So why is the pipeline not getting the rep numbers? Can you give me any pointers to what shall I check? (I know it is a pain as v72 was release such a long time ago, sorry about that.)
>>
>> (Re alignment file names: I miss-typed, I got two sets of alignment files - sam and bam - ending in samse.sam.gz and .samse.bam.)
>>
>> Thanks,
>> Lel
>>
>> On 05/21/2014 09:42 AM, njohnson wrote:
>>> Hi Lel
>>>
>>> You don't need to append the rep number to the file name, that was just my example file name. wrt the alignments, this also looks good and you are right. The old pipeline only uses the replicate numbers during the alignment step, and merges all replicates, so you get a single alignment file named after the experiment.  I would however expect the alignment file to have a bam or sam suffix.
>>>
>>> Nathan Johnson
>>>
>>> Ensembl Regulation
>>> European Bioinformatics Institute (EMBL-EBI)
>>> European Molecular Biology Laboratory
>>> Wellcome Trust Genome Campus
>>> Hinxton
>>> Cambridge CB10 1SD
>>> United Kingdom
>>>
>>> http://www.ensembl.info/
>>> http://twitter.com/#!/ensembl
>>> https://www.facebook.com/Ensembl.org
>>>
>>> On 20 May 2014, at 16:37, Lel Eory <lel.eory at roslin.ed.ac.uk> wrote:
>>>
>>>> Hi Nathan,
>>>>
>>>> In this case I have the numbered directories (.../1/...) similar to the example you gave e.g.:
>>>>
>>>> $WD/fastq/sus_scrofa/Xiao/piPSC_H2AZ/1/SRR414964.fastq.gz
>>>>
>>>>
>>>> The only difference is that there is no '_1' appended to the name of the fastq files. E.g the name I have is SRR414964.fastq.gz instead SRR414964_1.fastq.gz. Do I need to append '_1'?
>>>>
>>>> After I run the raw read alignment steps (AddAlignmentDataSets & SubmitAlignments) the alignments are stored by default in the alignment folder, but without the replication numbers in the folder names e.g.
>>>> $WD/alignments/sus_scrofa/Sscrofa10.2/Xiao/piPSC_H2AZ_Xiao.samse.gz. Is this correct? Does the code fetches the replication numbers from the $WD/fastq folder-names or from the $WD/alignment folder-names?
>>>> (I assume that the missing numbers from $WD/alignment is because the mapped datasets are already merged in the alignment folder?)
>>>>
>>>> (Looking forward to try the new funcgen pipeline with release 76!)
>>>>
>>>> Thanks,
>>>> Lel
>>>>
>>>> On 05/20/2014 03:46 PM, njohnson wrote:
>>>>> Hi Lel
>>>>>
>>>>> In this version of the pipeline, the replicate definition was done by the subdirectory naming. IN the experiment directory, you would need to create numbered directories, each with the relevant fastq file in.  e.g.
>>>>>
>>>>> 	experiment_dir/1/replicate_1.fastq.gz
>>>>> 	experiment_dir/2/replicate_2.fastq.gz
>>>>>
>>>>> This might be your problem?
>>>>>
>>>>> We have since moved away from this in favour of a tracking DB, which is much richer in meta data. We are currently finishing development of this in parallel with an entirely new analysis pipeline which will be used for release 76. (I know you know this Lel, but for others out there)
>>>>>
>>>>>
>>>>> Nathan Johnson
>>>>>
>>>>> Ensembl Regulation
>>>>> European Bioinformatics Institute (EMBL-EBI)
>>>>> European Molecular Biology Laboratory
>>>>> Wellcome Trust Genome Campus
>>>>> Hinxton
>>>>> Cambridge CB10 1SD
>>>>> United Kingdom
>>>>>
>>>>> http://www.ensembl.info/
>>>>> http://twitter.com/#!/ensembl
>>>>> https://www.facebook.com/Ensembl.org
>>>>>
>>>>> On 20 May 2014, at 15:03, Lel Eory <lel.eory at roslin.ed.ac.uk> wrote:
>>>>>
>>>>>> Dear All,
>>>>>>
>>>>>> I try to run some ChIP-seq analyses with the ensembl-functgenomics pipeline (version 72) using ensembl-hive (version lg4_pre_rel72_20130423).
>>>>>> In the efg sequencing environment I have added the peak datasets with AddPeakDataSets and try to run beekeeper.pl to set-up the peak analysis pipelines.
>>>>>> But the setup_pipeline step fails with the following (full beekeeper.pl output is at the end of this e-mail):
>>>>>>
>>>>>> beekeeper.pl -url $DBURL/leory2_peaks_lel_sus_scrofa_funcgen_72_102 -hive_log_dir hive_log_dir -run
>>>>>>
>>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>> Storing new InputSet:   piPSC_H3K27me3_Xiao
>>>>>> DBD::mysql::st execute failed: Column 'replicate' cannot be null at /groups2/avian_genomes/software/src/ensembl/ens72/ensembl-functgenomics/modules/Bio/EnsEMBL/Funcgen/DBSQL/InputSetAdaptor.pm line 380.
>>>>>>
>>>>>> job 1 : died in status 'RUN' for the following reason: DBD::mysql::st execute failed: Column 'replicate' cannot be null at /groups2/avian_genomes/software/src/ensembl/ens72/ensembl-functgenomics/modules/Bio/EnsEMBL/Funcgen/DBSQL/InputSetAdaptor.pm line 380.
>>>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>>
>>>>>> Where should I specify the number of replicates (in this case it is just 1) for setup_pipeline within the peak analysis step or where does setup-pipeline get this value from?
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> Kind regards,
>>>>>> Lel
>>>>>>
>>>>>>
>>>>>>
>>>>>> beekeeper.pl -url $DBURL/leory2_peaks_lel_sus_scrofa_funcgen_72_102 -hive_log_dir hive_log_dir -run
>>>>>>
>>>>>> ~~~~~~~~~~~~~ beekeeper.pl output ~~~~~~~~~~~~~
>>>>>>
>>>>>>        ======= beekeeper loop ** 1 **==========
>>>>>>       GarbageCollector:       Checking for lost Workers...
>>>>>>       GarbageCollector:       [Queen:] we have 0 Workers alive.
>>>>>>       setup_pipeline             ( 1)     LOADING jobs(Sem:0, Rdy:3, InProg:0, Done+Pass:0, Fail:0)=3 Ave_msec:0, workers(Running:0, Reqired:0)   h.cap:1  a.cap:-  (sync'd 1400493652 sec ago)
>>>>>>       run_peaks                  ( 2)       EMPTY jobs(Sem:0, Rdy:0, InProg:0, Done+Pass:0, Fail:0)=0 Ave_msec:0, workers(Running:0, Reqired:0)   h.cap:10  a.cap:-  (sync'd 1400493652 sec ago)
>>>>>>       run_peaks_wide             ( 3)       EMPTY jobs(Sem:0, Rdy:0, InProg:0, Done+Pass:0, Fail:0)=0 Ave_msec:0, workers(Running:0, Reqired:0)   h.cap:10  a.cap:-  (sync'd 1400493652 sec ago)
>>>>>>       run_macs                   ( 4)       EMPTY jobs(Sem:0, Rdy:0, InProg:0, Done+Pass:0, Fail:0)=0 Ave_msec:0, workers(Running:0, Reqired:0)   h.cap:10  a.cap:-  (sync'd 1400493652 sec ago)
>>>>>>
>>>>>>         ===== Stats of live Workers according to the Queen: ======
>>>>>>                ======= TOTAL ======= : 0 workers
>>>>>>
>>>>>>       setup_pipeline             ( 1)       READY jobs(Sem:0, Rdy:3, InProg:0, Done+Pass:0, Fail:0)=3 Ave_msec:0, workers(Running:0, Reqired:3)   h.cap:1  a.cap:-  (sync'd 0 sec ago)
>>>>>>       Before checking the Valley for pending jobs, Scheduler allocated 1 x LOCAL:default extra workers for 'setup_pipeline' [0.0000 hive_load remaining]
>>>>>>       Scheduler is going to submit 1 x LOCAL:default workers
>>>>>>       Submitting 1 workers (rc_name=default) to LOCAL/ris-lx10
>>>>>>       SUBMITTING_CMD:         runWorker.pl -url '$DBURL/leory2_peaks_lel_sus_scrofa_funcgen_72_102' -rc_name default &
>>>>>>       hive 0.000% complete (< 0.000 CPU_hrs) (3 todo + 0 done + 0 failed = 3 total)
>>>>>>       The Beekeeper has stopped because the number of loops was limited by 1 and this limit expired
>>>>>>       dbc 0 disconnect cycles
>>>>>>       Queen picked analysis with dbID=1 for the worker
>>>>>>       Worker: meadow=LOCAL/ris-lx10, process=35314 at ris-lx10.roslin.ed.ac.uk, resource_class_id=1, last_check_in=2014-05-19 11:00:52, analysis=setup_pipeline(1)
>>>>>>               batch_size = 1
>>>>>>               life_span  = 3600
>>>>>>               worker_log_dir = STDOUT/STDERR
>>>>>>       Setting name at /groups2/avian_genomes/software/src/ensembl/ens72/ensembl/modules/Bio/EnsEMBL/Utils/ConfigRegistry.pm line 344.
>>>>>>       :: Auto-selecting build 102 core DB as: anonymous at sus_scrofa_core_75_102:ensembldb.ensembl.org:5306
>>>>>>       ParamWarning: value for param('set_name') is used before having been initialized!
>>>>>>       ParamWarning: value for param('group') is used before having been initialized!
>>>>>>       ParamWarning: value for param('input_dir') is used before having been initialized!
>>>>>>       ParamWarning: value for param('data_file') is used before having been initialized!
>>>>>>
>>>>>>       ------------------ DEPRECATED ---------------------
>>>>>>       Deprecated method call in file /groups2/avian_genomes/software/src/ensembl/ens72/ensembl-functgenomics/modules/Bio/EnsEMBL/Funcgen/RunnableDB/SetupPeaksPipeline.pm line 41.
>>>>>>       Method Bio::EnsEMBL::Funcgen::DBSQL::DBAdaptor::fetch_group_details is deprecated.
>>>>>>       Please use ExperimentalGroupAdaptor
>>>>>>       Ensembl API version = 72
>>>>>>       ---------------------------------------------------
>>>>>>       Preprocess cmd: gzip -dc /groups2/pig_project/ensembl_funcgen/xiao_chip_seq/alignments/sus_scrofa/Sscrofa10.2/Xiao/piPSC_Pig-IgG_Xiao.samse.sam.gz | grep -vE '^[^[:space:]]+[[:blank:]][^[:space:]]+[[:blank:]][^[:space:]]+:[^[:space:]]+:MT:' | grep -v '^MT' | grep -v '^chrM' | /groups2/avian_genomes/software/bin/ensembl-funcgen/samtools view -uSh -t /groups2/pig_project/ensembl_funcgen/xiao_chip_seq/sam_header/sus_scrofa/sus_scrofa_male_Sscrofa10.2_unmasked.fasta.fai -F 4 - | /groups2/avian_genomes/software/bin/ensembl-funcgen/samtools sort - /groups2/pig_project/ensembl_funcgen/xiao_chip_seq/alignments/sus_scrofa/Sscrofa10.2/Xiao/piPSC_Pig-IgG_Xiao.samse.sam.gz_tmp ; /groups2/avian_genomes/software/bin/ensembl-funcgen/samtools rmdup -s /groups2/pig_project/ensembl_funcgen/xiao_chip_seq/alignments/sus_scrofa/Sscrofa10.2/Xiao/piPSC_Pig-IgG_Xiao.samse.sam.gz_tmp.bam - | /groups2/avian_genomes/software/bin/ensembl-funcgen/samtools view -h - | gzip -c > /groups2/pig_pro
>> je
>>>> c!
>>>>>   t/ensembl_funcgen/xiao_chip_seq/output/lel_sus_scrofa_funcgen_72_102/peaks/results/Xiao/piPSC_Pig-IgG_Xiao.samse.sam.gz ; rm -f /groups2/pig_project/ensembl_funcgen/xiao_chip_seq/alignments/sus_scrofa/Sscrofa10.2/Xiao/piPSC_Pig-IgG_Xiao.samse.sam.gz_tmp.bam
>>>>>>       Storing new InputSet:   piPSC_H3K27me3_Xiao
>>>>>>       DBD::mysql::st execute failed: Column 'replicate' cannot be null at /groups2/avian_genomes/software/src/ensembl/ens72/ensembl-functgenomics/modules/Bio/EnsEMBL/Funcgen/DBSQL/InputSetAdaptor.pm line 380.
>>>>>>
>>>>>>       job 1 : died in status 'RUN' for the following reason: DBD::mysql::st execute failed: Column 'replicate' cannot be null at /groups2/avian_genomes/software/src/ensembl/ens72/ensembl-functgenomics/modules/Bio/EnsEMBL/Funcgen/DBSQL/InputSetAdaptor.pm line 380.
>>>>>>
>>>>>> -- 
>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>> Scotland, with registration number SC005336.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>> _______________________________________________
>>>>> Dev mailing list    Dev at ensembl.org
>>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>> -- 
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>> -- 
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.