[ensembl-dev] too many jobs for my PairAlignement...

Thibaut Hourlier th3 at sanger.ac.uk
Tue Nov 1 11:48:23 GMT 2011


Hi,
Can you check the numbers you have for the assembly because it seems a
little bit odd that you have a N50 around 1M and only 2600 scaffold longer
than 1000bp while you have around 75k scaffolds.
There are different things you can do:
 - you can use the zebrafish assembly, it might be far away in the
taxonomy but the assembly and the annotation is more comprehensive, it
includes models from RNA-Seq data and manual annotations.
 - as your scaffolds are small they shouldn't take a long time to run so
you can use a high batch_size. Just test it with few sequences, see how
long it takes to run the analysis with 1000 scaffolds for example and
adjust the batch size accordingly. In our environment we try to have jobs
running for 1h when we have a lot of jobs.

You should use the 2X alignment documentation for this step.

Also, if your assembly is too fragmented, the pipeline will take a really
long time for a result which probably won't be good. In that case it might
better to wait to have more data.

Regards
Thibaut


On Mon, 31 Oct 2011 14:29:43 +0800, Zhang Di <aureliano.jz at gmail.com>
wrote:
> Hi,
>   As described previously, I'm trying to run the low coverage annotation
> pipeline for our Illumina GAII sequenced fish genome (~800m).
>   The doc low_coverage_gen_build.txt tells me to prepare my own compara
db,
> so I go to encembl-compara.
>   For my fish genome, I have ~75k scaffolds (length >= 200bp, N50 ~1M),
> among which 2600 scaffolds are longer than 1000bp. my ref genome is
> stickleback, and I followed the README-pairaligner doc.
>   As the ref genome has ~2000 chunks (size 1M), there will be 2000 X
75000
> = 150M pairaligner jobs. too many to run in my institute.
>   here are my questions:
>   1. should I only use these scaffolds longer than 1000bp?
>   2. am I followed the right doc? Which doc should I read to produce
such a
> alignment that: 'each bp in the target genome should be represented
>   at most once' (cited from low_coverage_gene_build.txt). I don't quite
> understand the README-2xalignment and
README-low-coverage-genome-aligner.
> 
> Thank you
> 
> Best reguards




More information about the Dev mailing list