[ensembl-dev] too many jobs for my PairAlignement...

Zhang Di aureliano.jz at gmail.com
Tue Nov 1 12:41:30 GMT 2011


Thank you, Thibaut.

I have 75k scaffolds because I don't discard any contig from ABySS-SSPACE
pipeline. Now I realize that it is a problem.
You mean Zebra fish is a better projection reference for fish genome? I'll
try it.

stats of my genome:

all:
  n:200   n:N50  min    median   mean   N50            max        sum
  75084   164    200    256          7573   1010578    5780275  568.6e6

scaffolds longer than 1000bp:
  n:1000  n:N50  min   median   mean     N50         max        sum
  2602    154    1000   16652    210525  1052979  5780275  547.7e6

I think it is ok.

Another question: should I run RepeatMasker for my  core database before
the 2X alignment pipeline?


On Tue, Nov 1, 2011 at 7:48 PM, Thibaut Hourlier <th3 at sanger.ac.uk> wrote:

> Hi,
> Can you check the numbers you have for the assembly because it seems a
> little bit odd that you have a N50 around 1M and only 2600 scaffold longer
> than 1000bp while you have around 75k scaffolds.
> There are different things you can do:
>  - you can use the zebrafish assembly, it might be far away in the
> taxonomy but the assembly and the annotation is more comprehensive, it
> includes models from RNA-Seq data and manual annotations.
>  - as your scaffolds are small they shouldn't take a long time to run so
> you can use a high batch_size. Just test it with few sequences, see how
> long it takes to run the analysis with 1000 scaffolds for example and
> adjust the batch size accordingly. In our environment we try to have jobs
> running for 1h when we have a lot of jobs.
>
> You should use the 2X alignment documentation for this step.
>
> Also, if your assembly is too fragmented, the pipeline will take a really
> long time for a result which probably won't be good. In that case it might
> better to wait to have more data.
>
> Regards
> Thibaut
>
>
> On Mon, 31 Oct 2011 14:29:43 +0800, Zhang Di <aureliano.jz at gmail.com>
> wrote:
> > Hi,
> >   As described previously, I'm trying to run the low coverage annotation
> > pipeline for our Illumina GAII sequenced fish genome (~800m).
> >   The doc low_coverage_gen_build.txt tells me to prepare my own compara
> db,
> > so I go to encembl-compara.
> >   For my fish genome, I have ~75k scaffolds (length >= 200bp, N50 ~1M),
> > among which 2600 scaffolds are longer than 1000bp. my ref genome is
> > stickleback, and I followed the README-pairaligner doc.
> >   As the ref genome has ~2000 chunks (size 1M), there will be 2000 X
> 75000
> > = 150M pairaligner jobs. too many to run in my institute.
> >   here are my questions:
> >   1. should I only use these scaffolds longer than 1000bp?
> >   2. am I followed the right doc? Which doc should I read to produce
> such a
> > alignment that: 'each bp in the target genome should be represented
> >   at most once' (cited from low_coverage_gene_build.txt). I don't quite
> > understand the README-2xalignment and
> README-low-coverage-genome-aligner.
> >
> > Thank you
> >
> > Best reguards
>
>
> --
>  The Wellcome Trust Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
>



-- 
Zhang Di
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20111101/888f3f47/attachment.html>


More information about the Dev mailing list