[ensembl-dev] whole chromosome alignment with LastZ

Matthieu Muffato muffato at ebi.ac.uk
Fri Mar 27 17:05:05 GMT 2020


Dear Alice,

This post-processing is required to handle alignment blocks that will be 
present twice in the output file due to the splitting the input regions 
with some overlap. It simply consists in getting the coordinates of 
every block and removing the extra copies of blocks that have the same 
coordinates.

We have computed an alignement of /Triticum turgidum subsp. durum/vs 
/Triticum dicoccoides/ and it will be released in the next version of 
Ensembl (due in a couple of weeks I believe). I have attached a 
screenshot of the alignment statistics.

Best regards,
Matthieu

On 27/03/2020 10:29, Alice Iob wrote:
> Dear Matthieu,
>
> Thank you very much for your swift reply.
>
> With post-processing do you mean generate a consensus alignment from 
> sam files (if I split the query, align it and then set the output in 
> sam)? or something else?
>
> Also, I was wondering if you are planning to release a whole genome 
> alignment of /Triticum durum/ and/T. dicoccoides/ in the future.
>
> Thank you for your help.
>
> Alice Iob
> PhD student
> Plant and Animal Genomics Program
> CRAG, Centre for Research in Agricultural Genomics
> Campus UAB - CRAG Building | 08193 Cerdanyola | BARCELONA
> Office: 3.01
> Tel. +34 935636600 ext 3351
> ------------------------------------------------------------------------
> *Da:* Matthieu Muffato [muffato at ebi.ac.uk]
> *Inviato:* giovedì 26 marzo 2020 10.10
> *A:* Ensembl developers list; Alice Iob
> *Oggetto:* Re: [ensembl-dev] whole chromosome alignment with LastZ
>
> Dear Alice,
>
> I've seen that you have sent this question to the lastz-users list and 
> Bob replied with some tips.
>
> We don't get this error we get because we have engineered a workflow 
> around LastZ that does what Bob suggests, i.e. the masking, the 
> splitting with overlap and its post-processing, and some chaining.
>
> It's not easily usable by external people, though, because it heavily 
> relies on an infrastructure that's built in Ensembl. But Bob gave some 
> options to run those steps by yourself, which should address the issue.
>
> Best,
> Matthieu
>
> On 25/03/2020 11:36, Alice Iob wrote:
>> Good morning,
>>
>> I am a Phd student woking on plant genomics and I am stuggling with 
>> an issue regarding LastZ.
>> I choose to use LastZ because it was used for plant genome alignments 
>> in Ensembl Plants, so I hope that someone who succesfully used it 
>> before can help me with this.
>> I am trying to align two references genomes from very close species: 
>> I have two FASTA files,
>> representing the same chromosome in the two species, each around 
>> 800Mb long, with at least one long repetitive region.
>>
>> the command I am using:
>>
>> lastz target.fasta query.fasta --notransition --step=20 
>> --maxwordcount=70 ‑‑exact=20 --chain --gapped --ambiguous=iupac 
>> --rdotplot=plot --format=differences > alignment.differences
>>
>> I always get the same error:
>>
>> FAILURE: in add_segment()
>> table size (4,869,542,152 for 101,448,794 segments) exceeds 
>> allocation limit of 4,294,967,279;
>> consider raising scoring threshold (--hspthresh or --exact) or 
>> breaking your target sequence into smaller pieces.
>>
>> I tried several strategies to overcome this issue:
>> increasing values of exact up to 100
>> using values of hspthresh up to 10 000
>> adding --seed=match12
>> dividing my target sequence in two (one multiFASTA file)
>> working with just half chromosome (400Mb)
>> set the parameters as they were set to align /T. aestivum/ and /A. 
>> tauschii/ (https://plants.ensembl.org/mlss.html?mlss=9814)
>>
>> Still, I get the same error.
>>
>> Just a few times I was able to get an output (e.g. when exact=100), 
>> but it is always more than 700Gb big, thus, even if the file is 
>> generated, I run out of memory and I can not work on it.
>>
>> I also used LastZ_32 but the process gets killed without giving me 
>> any info.
>>
>> I was wondering if you can help me with this issue, maybe I am not 
>> using properly some of the options, or give me some advice on how to 
>> proberly deal with this alignment.
>>
>> Thank you.
>>
>> Alice Iob
>> PhD student
>> Plant and Animal Genomics Program
>> CRAG, Centre for Research in Agricultural Genomics
>> Campus UAB - CRAG Building | 08193 Cerdanyola | BARCELONA
>> Office: 3.01
>> Tel. +34 935636600 ext 3351
>>
>> _______________________________________________
>> Dev mailing listDev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:https://lists.ensembl.org/mailman/listinfo/dev_ensembl.org
>> Ensembl Blog:http://www.ensembl.info/
> -- 
> Matthieu Muffato, Ph.D.
> Ensembl Compara Principal Developer
> European Bioinformatics Institute (EMBL-EBI)
> European Molecular Biology Laboratory
> Wellcome Trust Genome Campus, Hinxton
> Cambridge, CB10 1SD, United Kingdom
> Room  A3-123
> Phone + 44 (0) 1223 49 4631
> Fax   + 44 (0) 1223 49 4468

-- 
Matthieu Muffato, Ph.D.
Ensembl Compara Principal Developer
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus, Hinxton
Cambridge, CB10 1SD, United Kingdom
Room  A3-123
Phone + 44 (0) 1223 49 4631
Fax   + 44 (0) 1223 49 4468

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20200327/b1f4a3a0/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lastz_tdic_ttur.png
Type: image/png
Size: 103347 bytes
Desc: not available
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20200327/b1f4a3a0/attachment.png>


More information about the Dev mailing list