[ensembl-dev] too many jobs for my PairAlignement...

Tue Nov 1 15:05:00 GMT 2011

Hi

On Tue, 1 Nov 2011 20:41:30 +0800, Zhang Di <aureliano.jz at gmail.com>
wrote:
> Thank you, Thibaut.
> 
> I have 75k scaffolds because I don't discard any contig from
ABySS-SSPACE
> pipeline. Now I realize that it is a problem.
> You mean Zebra fish is a better projection reference for fish genome?
I'll
> try it.

It depends, if in the taxonomic tree your fish is really closer to
stickleback than zebrafish, using stickleback is more relevant. Otherwise,
as it exists more data on zebrafish (proteins, cdna,...), zebrafish might
be an option. 

> 
> stats of my genome:
> 
> all:
>   n:200   n:N50  min    median   mean   N50            max        sum
>   75084   164    200    256          7573   1010578    5780275  568.6e6
> 
> scaffolds longer than 1000bp:
>   n:1000  n:N50  min   median   mean     N50         max        sum
>   2602    154    1000   16652    210525  1052979  5780275  547.7e6
> 
> I think it is ok.

We do not assemble genomes so I can't really tell. The number you have
seems a bit odd to me (like your median for all compare to your N50) but I
don't have enough experience to give a good judgement.

> 
> Another question: should I run RepeatMasker for my  core database before
> the 2X alignment pipeline?

Yes it's a good thing to run RepeatMasker on your core database.

Regards
Thibaut

> 
> 
> On Tue, Nov 1, 2011 at 7:48 PM, Thibaut Hourlier <th3 at sanger.ac.uk>
wrote:
> 
>> Hi,
>> Can you check the numbers you have for the assembly because it seems a
>> little bit odd that you have a N50 around 1M and only 2600 scaffold
>> longer
>> than 1000bp while you have around 75k scaffolds.
>> There are different things you can do:
>>  - you can use the zebrafish assembly, it might be far away in the
>> taxonomy but the assembly and the annotation is more comprehensive, it
>> includes models from RNA-Seq data and manual annotations.
>>  - as your scaffolds are small they shouldn't take a long time to run
so
>> you can use a high batch_size. Just test it with few sequences, see how
>> long it takes to run the analysis with 1000 scaffolds for example and
>> adjust the batch size accordingly. In our environment we try to have
jobs
>> running for 1h when we have a lot of jobs.
>>
>> You should use the 2X alignment documentation for this step.
>>
>> Also, if your assembly is too fragmented, the pipeline will take a
really
>> long time for a result which probably won't be good. In that case it
>> might
>> better to wait to have more data.
>>
>> Regards
>> Thibaut
>>
>>
>> On Mon, 31 Oct 2011 14:29:43 +0800, Zhang Di <aureliano.jz at gmail.com>
>> wrote:
>> > Hi,
>> >   As described previously, I'm trying to run the low coverage
>> >   annotation
>> > pipeline for our Illumina GAII sequenced fish genome (~800m).
>> >   The doc low_coverage_gen_build.txt tells me to prepare my own
compara
>> db,
>> > so I go to encembl-compara.
>> >   For my fish genome, I have ~75k scaffolds (length >= 200bp, N50
~1M),
>> > among which 2600 scaffolds are longer than 1000bp. my ref genome is
>> > stickleback, and I followed the README-pairaligner doc.
>> >   As the ref genome has ~2000 chunks (size 1M), there will be 2000 X
>> 75000
>> > = 150M pairaligner jobs. too many to run in my institute.
>> >   here are my questions:
>> >   1. should I only use these scaffolds longer than 1000bp?
>> >   2. am I followed the right doc? Which doc should I read to produce
>> such a
>> > alignment that: 'each bp in the target genome should be represented
>> >   at most once' (cited from low_coverage_gene_build.txt). I don't
quite
>> > understand the README-2xalignment and
>> README-low-coverage-genome-aligner.
>> >
>> > Thank you
>> >
>> > Best reguards
>>
>>
>> --
>>  The Wellcome Trust Sanger Institute is operated by Genome Research
>>  Limited, a charity registered in England with number 1021457 and a
>>  company registered in England with number 2742969, whose registered
>>  office is 215 Euston Road, London, NW1 2BE.
>>