[ensembl-dev] Compara: Incosistent results in LastZ alignments?

Matthieu Muffato muffato at ebi.ac.uk
Thu Dec 22 08:50:37 GMT 2016

Hi Marc

I've actually made a mistake and I can see 175 GenomicAlignBlocks when 
using mouse as a query. 1 one of them is the original region, another 
one is the same region 4Mb off, etc: 

When using AlignSlice, the blocks are sorted by first their position on 
the query genome and then their position on the other genome, and only 
one block is returned. In this case, it happens to the 
3:60856379:60856605:1, but it could have been on another human chromosome.

This happens because the way we've built the alignment, the mouse patch 
is in a 1-to-many relationship with human, and the API can't tell which 
human region is the true ortholog. I agree that we need to process the 
mouse patches differently.


On 21/12/16 13:05, Marc P Hoeppner wrote:
> Dear Matthieu,
> I actually used the website to track down this issue:
> Human-> Mouse:
> http://www.ensembl.org/Homo_sapiens/Location/Compara_Alignments/Image?align=677&db=core&g=ENSG00000200222&r=3%3A64071807-64072019&t=ENST00000363352
> And then the same in reverse, hitting a different human snoRNA:
> http://www.ensembl.org/Mus_musculus/Location/Compara_Alignments/Image?align=677&db=core&g=ENSMUSG00000097846&r=CHR_MG153_PATCH%3A4078298-4078512&t=ENSMUST00000181407
> I am using AlignSlice and AlignSlice:Slice objects to scan each locus;
> so maybe that is hiding some of the underlying details (as compared to
> genomic_align_blocks)?
> Snippet:, using the human-mouse LASTZ_NET MLSS:
> my $gene_adaptor = $registry->get_adaptor("human", "Core", "Gene");
> my $gene = $gene_adaptor->fetch_by_stable_id("ENSG00000200222");
> my $slice = $gene->feature_Slice;
> my $align_slice =
> $align_slice_adaptor->fetch_by_Slice_MethodLinkSpeciesSet(
>           $slice,
>           $method_link_species_set,
>           "expanded"
>       );
> # The slices making up this AlignSlice
> my $sub_slices = $align_slice->get_all_Slices;
> foreach my $slice (@$sub_slices) {
>     print $slice->genome_db->name . "\n";
>     foreach my $sg (@{$slice->get_all_Genes_by_source("RFAM")}) {
>         print "\t" . $sg->stable_id . "\n";
>     }
> }
> On 21.12.2016 13:03, Matthieu Muffato wrote:
>> Hi Marc,
>> Can you please clarify something ? When I try to get the human
>> alignments for the mouse region CHR_MG153_PATCH: 4,078,298-4,078,512 I
>> get 175 alignment blocks in total. Only 1 of them is on human chr 3,
>> but at the correct position: 3:64071783-64072021
>> (I've used this script
>> http://www.ebi.ac.uk/~muffato/workshops/2016_06_Cambridge/solutions_compara/gab1.pl
>> )
>> In theory, the (initial) pairwise alignment of any two species only
>> includes the primary assembly. Then, when patches are released, we
>> top-up the alignments with the patches. This process isn't great for
>> mouse patches on human vs mouse because human is the reference genome
>> for our lastz pipeline. If all the sequences were considered at the
>> same time, you'd indeed expect the chaining and netting steps to
>> filter out such small alignments to only keep the principal ones, but
>> here, because the pairwise alignment is done first without any mouse
>> patches, and then only with the mouse patches, it can't filter out the
>> secondary alignments properly.
>> Perhaps in these cases, we should redo the human->mouse alignment
>> entirely
>> Matthieu
>> On 21/12/16 07:23, Marc P Hoeppner wrote:
>>> Dear EnsEMBL team,
>>> I have been using LastZ alignments to check for locus conservation
>>> between human and mouse. However, I have come across an issue that seems
>>> to be related to the inclusion of ALT loci in the alignments.
>>> Specifically, when comparing human<->mouse for the human snoRNA
>>> ENSG00000200222, I get the mouse U3 snoRNA ENSMUSG00000097846 as the
>>> matching locus. However, this annotation sits on an ALT assembly. When I
>>> do the comparison in reverse (mouse<->human), the mouse U3 snoRNA aligns
>>> to the human locus  ENSG00000212211 (same chromosome as the original
>>> human U3 query, but 4 Mbp off).
>>> I suppose that shouldn't happen and may be related to these snoRNAs
>>> being repetetive sequences. Still, I qould have guessed that the gneomic
>>> context (i.e. neighboring coding genes) should provide some guidance to
>>> how these loci ought to be aligned? Is this a LastZ problem? Wouldn't it
>>> perhaps be more sensible to exclude ALT assemblies until these
>>> alignments can be represented as graphs rather than flattened pairwise
>>> comparisons?
>>> Kind regards,
>>> Marc
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

Matthieu Muffato, Ph.D.
Ensembl Compara and TreeFam Project Leader
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus, Hinxton
Cambridge, CB10 1SD, United Kingdom
Room  A3-145
Phone + 44 (0) 1223 49 4631
Fax   + 44 (0) 1223 49 4468

More information about the Dev mailing list