[ensembl-dev] genomic_align_blocks sort order

Thu Jan 30 13:37:36 GMT 2014

Hi Daniel, the lastz-nets are identified by a "group_id" column in the 
genomic_align_block table in the compara db. Each net can be made from 
several blocks (genomic_align_blocks).
Looking at the web page at this high resolution it is impossible to 
distinguish gaps within the blocks and the individual blocks which comprise a net.

So, if we look at the underlying data in the database we can see that 
within the region in mouse (X:105500254-105500638) defined by your url 
there are 5 nets made from 2 blocks each.

> select gab.group_id, group_concat(ga.genomic_align_block_id) from genomic_align 
ga inner join genomic_align_block gab on gab.genomic_align_block_id = 
ga.genomic_align_block_id where ga.method_link_species_set_id =601 and 
dnafrag_id = 13705548 and dnafrag_start < 105500638 and dnafrag_end > 
105500254 group by gab.group_id;
+---------------+-----------------------------------------+
| group_id      | group_concat(ga.genomic_align_block_id) |
+---------------+-----------------------------------------+
| 6010004581617 | 6010000658807,6010000658802             |
| 6010004581620 | 6010000658796,6010000658798             |
| 6010004587353 | 6010000659724,6010000659722             |
| 6010007433644 | 6010001490015,6010001490021             |
| 6010007433647 | 6010001490065,6010001490059             |
+---------------+-----------------------------------------+

Hope this helps,
Stephen.

On Thu, 30 Jan 2014, Caffrey, Daniel wrote:

> Hi Javier,
> Thanks for the quick and helpful response. Part of the reason I want the "best" alignment is because I will need to choose one region when there
> are duplications. For my test example, there are 6 blocks that belong to level 2 and 4 that belong to level3 (pasted below).  Here, I could discard
> the level 3 alignments and  develop my own metrics to identify the "best" level2 alignment.
> 
> Interestingly, the region in detail web page (version 73) appears to show 5 lastz alignments (tracks with clear and pink regions) whereas my code
> reports 10 blocks. It is hard to tell how these 5 tracks relate to the 10 blocks that I get with v73 of the API.  Can you point me to the code that
> generates the region in detail page?
> 
> http://sep2013.archive.ensembl.org/Mus_musculus/Location/Multi?db=core;r=X:105500254-105500638;r1=X:150684356-15068474
> 
> Thanks,
> Daniel 
> 
> homo_sapiens/X 150684357 150684616 236 385 perceAligned:0.612987012987013 score:13262 levelId:2
> homo_sapiens/X 145891632 145891730 99 385 perceAligned:0.257142857142857 score:29716 levelId:3
> homo_sapiens/X 145895821 145895919 99 385 perceAligned:0.257142857142857 score:29716 levelId:3
> homo_sapiens/HG1459_PATCH 145896018 145896116 99 385 perceAligned:0.257142857142857 score:29716 levelId:2
> homo_sapiens/HG1459_PATCH 145891829 145891927 99 385 perceAligned:0.257142857142857 score:29716 levelId:2
> homo_sapiens/X 145891846 145891927 80 385 perceAligned:0.207792207792208 score:29716 levelId:3
> homo_sapiens/X 145895624 145895705 80 385 perceAligned:0.207792207792208 score:29716 levelId:3
> homo_sapiens/HG1459_PATCH 145895821 145895902 80 385 perceAligned:0.207792207792208 score:29716 levelId:2
> homo_sapiens/HG1459_PATCH 145892043 145892124 80 385 perceAligned:0.207792207792208 score:29716 levelId:2
> homo_sapiens/X 150684688 150684739 52 385 perceAligned:0.135064935064935 score:13262 levelId:2
> 
> 
> 
> On Jan 30, 2014, at 2:32 AM, "Javier Herrero (TGAC)" <Javier.Herrero at tgac.ac.uk>
>  wrote:
>
>       Hi Daniel
> Indeed, the alignment scores are not very helpful for your purpose. It is (I believe) the simple addition of the individual match/mismatch
> scores and gap penalties.
> 
> As a general rule, data coming from the Ensembl API are not sorted (there are only a few exceptions to this rule).
> 
> I would probably consider one additional bit of information: the level_id. These alignments are down with LastZ, then chained and finally
> netted. The chaining process links all blocks that are compatible in order and orientation into a virtual structure called chain. The netting
> process consists on the selection of the best chain for every location in one of the two genomes (called the reference genome). This genome
> is typically the human genome in the Ensembl pairwise alignments. The netting process selects the best chain first (level_id = 1) and then
> look for chains (or bits of chains) that fill in the gaps left by the first chain (level_d = 2). This nesting of chains has no limit although
> typically you don’t find any alignment with a level_id higher than 3 or 4.
> 
> In other words, the blocks with a level_id = 1 usually represent the most likely orthologous region and the blocks with a higher level_id
> represent small rearrangements or maybe transposable elements. Note that there will be case where this is not as simple. Imagine a region
> that is duplicated in mouse, it is possible that the level_id =1 net points at bits of one of the two copies and the level_id = 2 points at
> bits of the other copy.
> 
> Alternatively, you may want to consider using the EPO alignments, where paralogous copies are together in the same regions and small
> rearrangements are not considered (as these are global alignments). Although you will find many more species in the EPO alignments, you can
> always ignore them and focus on human and mouse only.
> 
> I hope this helps
> 
> Javier 
> 
> On 29 Jan 2014, at 18:16, Caffrey, Daniel <Daniel.Caffrey at umassmed.edu> wrote:
>
>       Dear Ensembl developers,
> I'm interested in identifying the best pairwise LASTZ alignment within a list of genomic_align_blocks (mouse v human).  I'm using a
> modified version of the example provided at :
> 
> http://useast.ensembl.org/info/docs/api/compara/compara_tutorial.html
> 
> 1) The score() method is not particularly useful as it does not appear to take the alignment length into account. Is there another
> method that returns something similar to an E-value? 
> 
> 2) Based on crude metrics (% nucleotides aligned and eyeballing), the alignments appear to be ordered in the list from best to worst .
> Can anyone confirm how/if the alignments are ordered?
> 
> 
> Thanks,
> Daniel 
> 
> command line options (API V73):
> --seq_region X --seq_region_start 105500254 --seq_region_end 105500638 --output_file out.aln --species mouse
> 
> 
> 
> Relevant code snippet:
> 
> foreach my $this_genomic_align_block (@$genomic_align_blocks) {
>     my $restricted_gab = $this_genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);    
>    my  $this_align = $restricted_gab->get_SimpleAlign;
>     print $alignIO  $this_align;
>     #my $pid= $this_align->percentage_identity; #non-gap columns
>     my $overallPid= $this_align->overall_percentage_identity; #all columns
>     my $seq1=$this_align->get_seq_by_pos(1);
>     my $seq2=$this_align->get_seq_by_pos(2);
>     my $seqName2=$seq2->display_id();
>     my $start=$seq2->start();
>     my $end=$seq2->end();
> 
>     my $seqStr1=$seq1->seq();
>     my $seqStr2=$seq2->seq();
>     my $length=length($seqStr1);
>     my $aligned=0;
>     for (my $i=0;$i<$length;$i++){
>         my $nuc1=substr($seqStr1,$i,1);
>         my $nuc2=substr($seqStr2,$i,1);
>         if ($nuc1 ne '-' && $nuc2 ne '-'){
>         $aligned++;
>         }
>     }
>     my $score=$this_genomic_align_block->score();
>     my $rscore=$restricted_gab->score();
>     my $percentAligned=$aligned/$sliceLength;
>     print STDERR "$seqName2 $start $end $aligned $sliceLength perceAligned:$percentAligned score:$score restricted score:$rscore\n";
> }
> 
> STDERR:
> 
> homo_sapiens/X 150684357 150684616 236 385 perceAligned:0.612987012987013 score:13262 restricted score:
> homo_sapiens/X 145891632 145891730 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
> homo_sapiens/X 145895821 145895919 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
> homo_sapiens/HG1459_PATCH 145896018 145896116 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
> homo_sapiens/HG1459_PATCH 145891829 145891927 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
> homo_sapiens/X 145891846 145891927 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
> homo_sapiens/X 145895624 145895705 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
> homo_sapiens/HG1459_PATCH 145895821 145895902 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
> homo_sapiens/HG1459_PATCH 145892043 145892124 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
> homo_sapiens/X 150684688 150684739 52 385 perceAligned:0.135064935064935 score:13262 restricted score:
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
> 
> 
> --
> Javier Herrero, PhD
> Comparative Genomics Project Leader
> TGAC, Norwich Research Park
> Norwich, NR4 7UH, UK
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
> 
> 
> 
>