[ensembl-dev] API question about Multi Alignments (technical!)

Mon Jan 28 16:31:33 GMT 2013

Hi Marc, in this case you will need to retrieve all the 
genomic_align_block objects which make up the reference slice region you 
are interested in, for example, using human chr19:9825113-9825154 which 
spans two genomic_align_blocks :

###########

my $method_link_species_set_adaptor = Bio::EnsEMBL::Registry->get_adaptor(
     'Multi', 'compara', 'MethodLinkSpeciesSet');

my $method_link_species_set =
     $method_link_species_set_adaptor->fetch_by_method_link_type_species_set_name($alignment_type,$set_of_species);

my $slice_adaptor = Bio::EnsEMBL::Registry->get_adaptor($species, 'core', 'Slice');

my $query_slice = $slice_adaptor->fetch_by_region('chromosome','19', 9825113, 9825154);

my $genomic_align_block_adaptor = 
Bio::EnsEMBL::Registry->get_adaptor("Multi","compara","GenomicAlignBlock");

foreach my $gab (@{ $genomic_align_block_adaptor->fetch_all_by_MethodLinkSpeciesSet_Slice( 
$method_link_species_set, $query_slice ) }){
  my $restricted_gab = $gab->restrict_between_reference_positions($query_slice->start, $query_slice->end);
  my $ref_ga = $restricted_gab->reference_genomic_align;
  my $dnafrag = $ref_ga->dnafrag;
  print join(":", "REF:",$dnafrag->genome_db->name, $dnafrag->name, $ref_ga->dnafrag_start,
   $ref_ga->dnafrag_end, $ref_ga->dnafrag_strand, $ref_ga->cigar_line), "\n";
  foreach my $non_ref_ga(@{ $restricted_gab->get_all_non_reference_genomic_aligns }){
   $dnafrag = $non_ref_ga->dnafrag;
   my $cigar_line = $non_ref_ga->cigar_line;
   next unless($cigar_line=~/M/); # only show alignments with sequence
   print join(":", $dnafrag->genome_db->name, $dnafrag->name, $non_ref_ga->dnafrag_start, $non_ref_ga->dnafrag_end,
    $non_ref_ga->dnafrag_strand, $cigar_line), "\n";
  }
  print "########\n";
}

output:
REF::homo_sapiens:19:9825113:9825133:1:21M
pan_troglodytes:19:9902944:9902964:1:21M
gorilla_gorilla:19:9883391:9883411:1:21M
pongo_abelii:19:9912188:9912208:1:21M
########
REF::homo_sapiens:19:9825134:9825154:1:21M
pan_troglodytes:19:9902965:9902985:1:21M
gorilla_gorilla:19:9883412:9883432:1:21M
pongo_abelii:19:9912209:9912229:1:21M
########

Cheers,
Stephen.

On Fri, 25 Jan 2013, Marc Hoeppner wrote:

> HI Steven,
>  
> thanks for the explanation! I think what this still doesn’t address is how exactly individual slices are aligned
> against the reference in case of partial alignments. Getting genomic coordinates for aligned species is very
> straight-forward, but the challenge seems to be in determining the specific regions for which pairwise alignments
> exist.
>  
> Say we have the human region x:1000-10000 (our genomic_align_block, or AlignSlice) and 18 aligned species - it could
> be that Platypus has 2 aligned regions; one from x:1500-50000 and Another from x:60000-80000  (human genomic
> coordinates).
>  
> So my desired output would be:
>  
> X 15000 5000 platypus_contig platypus_genomic_start platypus_genomic_end score strand
> X 60000 80000 platypus_contig platypus_genomic_start platypus_genomic_end score strand
>  
> Does that make sense? It’s essentially about breaking the multi-alignment down into pairwise alignments, then into
> the underlying genomic slices and translating that into a kind of synteny map.
>  
> Incedentally, after looking at the output from my attached script some more, it seems to be doing that after all, but
> I am just not 100% sure I am using the different objects in the correct way. hm. A bit difficult to verify the
> output.
>  
> Anyway, thanks again for giving it a go!
>  
> Marc
>  
> Skickat från Windows E-post
>  
> Från: Stephen Fitzgerald <stephenf at ebi.ac.uk>
> Skickat: den 25 januari 2013 14:06
> Till: Ensembl developers list <dev at ensembl.org>
> Ämne: Re: [ensembl-dev] API question about Multi Alignments (technical!)
>  
> Hi Marc, you should be able to get that informion directly from the slice object (obtained from the
> $align_slice_slice object, below).
> 
> for example (I'm looking at a small region on human chr7)
> 
> #######
> 
> my $method_link_species_set =
>  $method_link_species_set_adaptor->fetch_by_method_link_type_species_set_name($alignment_type,$set_of_species);
> 
> my $slice_adaptor = Bio::EnsEMBL::Registry->get_adaptor($species, 'core', 'Slice');
> 
> my $align_slice_adaptor = Bio::EnsEMBL::Registry->get_adaptor("Multi","compara","AlignSlice");
> 
> my $query_slice = $slice_adaptor->fetch_by_region('chromosome','7', 113726353, 113726692 );
> 
> my $alignSlice = $align_slice_adaptor->fetch_by_Slice_MethodLinkSpeciesSet( $query_slice, $method_link_species_set );
> 
> foreach my $align_slice_slice(@{ $alignSlice->get_all_Slices }){
>  my $seq = $align_slice_slice->seq;
>  next unless $seq=~/\w/;
>  my($slice, $position) = $align_slice_slice->get_original_seq_region_position();
>  print $slice->name, "\n";
> }
> 
> ########
> 
> output:
> chromosome:GRCh37:7:113726353:113726692:1
> chromosome:GRCm38:6:14901447:14901787:1
> chromosome:oryCun2:7:31354211:31354521:-1
> chromosome:CHIMP2.1.4:7:115561357:115561698:1
> chromosome:gorGor3.1:7:111957332:111957674:1
> chromosome:PPYG2:7:110575906:110576246:1
> chromosome:MMUL_1:3:151447053:151447375:1
> chromosome:C_jacchus3.2.1:8:82905928:82906267:1
> chromosome:CanFam3.1:14:53311047:53311423:1
> chromosome:EquCab2:4:71503687:71504047:1
> chromosome:BROADO5:8:168670777:168671257:1
> 
> Hope that helps,
> Stephen.
> 
> 
> 
> On Fri, 25 Jan 2013, Marc Hoeppner wrote:
>
>       Dear EnsEMBL team,
>
>        I am currently trying to convert the 19 vertebrate PECAN alignment into something resembling a pairwise
>       synteny map of sorts (I know there are ways to get the actual
>       pairwise synteny, but I have something else I wish to test on this data). To elaborate:
>
>       The goal would be to get information on which regions are aligned between human and any of the other 18
>       species – with base-precision for the start and stop of these
>       alignments in gemomic coordinates. The output should look like:
>
>       Human_chr human_chr_start human_chr_end target_chr target_chr_start target_chr_end score
>       target_chr_strand
>
>        However, as it turns out this is apparently not as straight-forward as I had hoped.  The main steps, I
>       suspect, would be:
>
>       Deconstruct the aligned sequences (AlignSlice) into their underlying AlignSlice::Slices
>
>       Determine the genomic coordinates of the underlying genomic slices (many genomic slices can make up one
>       AlignSlice::Slice, possibly separated by larger GAP slices)
>
>       Convert these genomic coordinates back to the AlignSlice::Slice (to get their position in the fake
>       alignment coordinate system)
>
>       Translate these alignment coordinates to human genomic coordinates via the human AlignSlice::Slice
>
>       However, turns out that for each genomic slice, I bascially seem to end up with the global coordinates of
>       the AlignSlice::Slice I started from. This would make sense if
>       each AlignSlice::Slice would consist of exactly one genomic_align and this genomic_align would align
>       full-length to the human query. Doubtful, I think.
>
>       Clearly, I am missing something, somewhere ;) Perhaps there is some padding going on that I can’t seem to
>       get rid of?
>
>       I have attached the script if anyone has an idea of how to do this correctly.
>
>        
>
>       Cheers,
>
>        
>
>       Marc
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
> 
> 
>