[ensembl-dev] too many jobs for my PairAlignement...
Zhang Di
aureliano.jz at gmail.com
Tue Nov 1 12:41:30 GMT 2011
Thank you, Thibaut.
I have 75k scaffolds because I don't discard any contig from ABySS-SSPACE
pipeline. Now I realize that it is a problem.
You mean Zebra fish is a better projection reference for fish genome? I'll
try it.
stats of my genome:
all:
n:200 n:N50 min median mean N50 max sum
75084 164 200 256 7573 1010578 5780275 568.6e6
scaffolds longer than 1000bp:
n:1000 n:N50 min median mean N50 max sum
2602 154 1000 16652 210525 1052979 5780275 547.7e6
I think it is ok.
Another question: should I run RepeatMasker for my core database before
the 2X alignment pipeline?
On Tue, Nov 1, 2011 at 7:48 PM, Thibaut Hourlier <th3 at sanger.ac.uk> wrote:
> Hi,
> Can you check the numbers you have for the assembly because it seems a
> little bit odd that you have a N50 around 1M and only 2600 scaffold longer
> than 1000bp while you have around 75k scaffolds.
> There are different things you can do:
> - you can use the zebrafish assembly, it might be far away in the
> taxonomy but the assembly and the annotation is more comprehensive, it
> includes models from RNA-Seq data and manual annotations.
> - as your scaffolds are small they shouldn't take a long time to run so
> you can use a high batch_size. Just test it with few sequences, see how
> long it takes to run the analysis with 1000 scaffolds for example and
> adjust the batch size accordingly. In our environment we try to have jobs
> running for 1h when we have a lot of jobs.
>
> You should use the 2X alignment documentation for this step.
>
> Also, if your assembly is too fragmented, the pipeline will take a really
> long time for a result which probably won't be good. In that case it might
> better to wait to have more data.
>
> Regards
> Thibaut
>
>
> On Mon, 31 Oct 2011 14:29:43 +0800, Zhang Di <aureliano.jz at gmail.com>
> wrote:
> > Hi,
> > As described previously, I'm trying to run the low coverage annotation
> > pipeline for our Illumina GAII sequenced fish genome (~800m).
> > The doc low_coverage_gen_build.txt tells me to prepare my own compara
> db,
> > so I go to encembl-compara.
> > For my fish genome, I have ~75k scaffolds (length >= 200bp, N50 ~1M),
> > among which 2600 scaffolds are longer than 1000bp. my ref genome is
> > stickleback, and I followed the README-pairaligner doc.
> > As the ref genome has ~2000 chunks (size 1M), there will be 2000 X
> 75000
> > = 150M pairaligner jobs. too many to run in my institute.
> > here are my questions:
> > 1. should I only use these scaffolds longer than 1000bp?
> > 2. am I followed the right doc? Which doc should I read to produce
> such a
> > alignment that: 'each bp in the target genome should be represented
> > at most once' (cited from low_coverage_gene_build.txt). I don't quite
> > understand the README-2xalignment and
> README-low-coverage-genome-aligner.
> >
> > Thank you
> >
> > Best reguards
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
>
--
Zhang Di
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20111101/888f3f47/attachment.html>
More information about the Dev
mailing list