[ensembl-dev] genomic_align_blocks sort order

Caffrey, Daniel Daniel.Caffrey at umassmed.edu
Thu Jan 30 15:31:01 GMT 2014


Hi Stephen, Javier,
This makes more sense now and presumably explains why I was seeing the same scores for blocks in the same net. 

On the web page, it looks like it has chosen to display green shading between the 5th net  X:150684357-150684739 and the other genome. Do you know the websites criteria for choosing the 5th net for green shading?  This appears to be the "best" alignment and the same criteria would probably help me solve my original problem - identifying the best alignment.

Thanks again,

Daniel 




On Jan 30, 2014, at 8:37 AM, Stephen Fitzgerald <stephenf at ebi.ac.uk>
 wrote:

> Hi Daniel, the lastz-nets are identified by a "group_id" column in the genomic_align_block table in the compara db. Each net can be made from several blocks (genomic_align_blocks).
> Looking at the web page at this high resolution it is impossible to distinguish gaps within the blocks and the individual blocks which comprise a net.
> 
> So, if we look at the underlying data in the database we can see that within the region in mouse (X:105500254-105500638) defined by your url there are 5 nets made from 2 blocks each.
> 
>> select gab.group_id, group_concat(ga.genomic_align_block_id) from genomic_align 
> ga inner join genomic_align_block gab on gab.genomic_align_block_id = ga.genomic_align_block_id where ga.method_link_species_set_id =601 and dnafrag_id = 13705548 and dnafrag_start < 105500638 and dnafrag_end > 105500254 group by gab.group_id;
> +---------------+-----------------------------------------+
> | group_id      | group_concat(ga.genomic_align_block_id) |
> +---------------+-----------------------------------------+
> | 6010004581617 | 6010000658807,6010000658802             |
> | 6010004581620 | 6010000658796,6010000658798             |
> | 6010004587353 | 6010000659724,6010000659722             |
> | 6010007433644 | 6010001490015,6010001490021             |
> | 6010007433647 | 6010001490065,6010001490059             |
> +---------------+-----------------------------------------+
> 
> Hope this helps,
> Stephen.
> 
> 
> On Thu, 30 Jan 2014, Caffrey, Daniel wrote:
> 
>> Hi Javier,
>> Thanks for the quick and helpful response. Part of the reason I want the "best" alignment is because I will need to choose one region when there
>> are duplications. For my test example, there are 6 blocks that belong to level 2 and 4 that belong to level3 (pasted below).  Here, I could discard
>> the level 3 alignments and  develop my own metrics to identify the "best" level2 alignment.
>> Interestingly, the region in detail web page (version 73) appears to show 5 lastz alignments (tracks with clear and pink regions) whereas my code
>> reports 10 blocks. It is hard to tell how these 5 tracks relate to the 10 blocks that I get with v73 of the API.  Can you point me to the code that
>> generates the region in detail page?
>> http://sep2013.archive.ensembl.org/Mus_musculus/Location/Multi?db=core;r=X:105500254-105500638;r1=X:150684356-15068474
>> Thanks,
>> Daniel 
>> homo_sapiens/X 150684357 150684616 236 385 perceAligned:0.612987012987013 score:13262 levelId:2
>> homo_sapiens/X 145891632 145891730 99 385 perceAligned:0.257142857142857 score:29716 levelId:3
>> homo_sapiens/X 145895821 145895919 99 385 perceAligned:0.257142857142857 score:29716 levelId:3
>> homo_sapiens/HG1459_PATCH 145896018 145896116 99 385 perceAligned:0.257142857142857 score:29716 levelId:2
>> homo_sapiens/HG1459_PATCH 145891829 145891927 99 385 perceAligned:0.257142857142857 score:29716 levelId:2
>> homo_sapiens/X 145891846 145891927 80 385 perceAligned:0.207792207792208 score:29716 levelId:3
>> homo_sapiens/X 145895624 145895705 80 385 perceAligned:0.207792207792208 score:29716 levelId:3
>> homo_sapiens/HG1459_PATCH 145895821 145895902 80 385 perceAligned:0.207792207792208 score:29716 levelId:2
>> homo_sapiens/HG1459_PATCH 145892043 145892124 80 385 perceAligned:0.207792207792208 score:29716 levelId:2
>> homo_sapiens/X 150684688 150684739 52 385 perceAligned:0.135064935064935 score:13262 levelId:2
>> On Jan 30, 2014, at 2:32 AM, "Javier Herrero (TGAC)" <Javier.Herrero at tgac.ac.uk>
>>  wrote:
>> 
>>      Hi Daniel
>> Indeed, the alignment scores are not very helpful for your purpose. It is (I believe) the simple addition of the individual match/mismatch
>> scores and gap penalties.
>> As a general rule, data coming from the Ensembl API are not sorted (there are only a few exceptions to this rule).
>> I would probably consider one additional bit of information: the level_id. These alignments are down with LastZ, then chained and finally
>> netted. The chaining process links all blocks that are compatible in order and orientation into a virtual structure called chain. The netting
>> process consists on the selection of the best chain for every location in one of the two genomes (called the reference genome). This genome
>> is typically the human genome in the Ensembl pairwise alignments. The netting process selects the best chain first (level_id = 1) and then
>> look for chains (or bits of chains) that fill in the gaps left by the first chain (level_d = 2). This nesting of chains has no limit although
>> typically you don’t find any alignment with a level_id higher than 3 or 4.
>> In other words, the blocks with a level_id = 1 usually represent the most likely orthologous region and the blocks with a higher level_id
>> represent small rearrangements or maybe transposable elements. Note that there will be case where this is not as simple. Imagine a region
>> that is duplicated in mouse, it is possible that the level_id =1 net points at bits of one of the two copies and the level_id = 2 points at
>> bits of the other copy.
>> Alternatively, you may want to consider using the EPO alignments, where paralogous copies are together in the same regions and small
>> rearrangements are not considered (as these are global alignments). Although you will find many more species in the EPO alignments, you can
>> always ignore them and focus on human and mouse only.
>> I hope this helps
>> Javier 
>> On 29 Jan 2014, at 18:16, Caffrey, Daniel <Daniel.Caffrey at umassmed.edu> wrote:
>> 
>>      Dear Ensembl developers,
>> I'm interested in identifying the best pairwise LASTZ alignment within a list of genomic_align_blocks (mouse v human).  I'm using a
>> modified version of the example provided at :
>> http://useast.ensembl.org/info/docs/api/compara/compara_tutorial.html
>> 1) The score() method is not particularly useful as it does not appear to take the alignment length into account. Is there another
>> method that returns something similar to an E-value? 
>> 2) Based on crude metrics (% nucleotides aligned and eyeballing), the alignments appear to be ordered in the list from best to worst .
>> Can anyone confirm how/if the alignments are ordered?
>> Thanks,
>> Daniel 
>> command line options (API V73):
>> --seq_region X --seq_region_start 105500254 --seq_region_end 105500638 --output_file out.aln --species mouse
>> Relevant code snippet:
>> foreach my $this_genomic_align_block (@$genomic_align_blocks) {
>>     my $restricted_gab = $this_genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);    
>>    my  $this_align = $restricted_gab->get_SimpleAlign;
>>     print $alignIO  $this_align;
>>     #my $pid= $this_align->percentage_identity; #non-gap columns
>>     my $overallPid= $this_align->overall_percentage_identity; #all columns
>>     my $seq1=$this_align->get_seq_by_pos(1);
>>     my $seq2=$this_align->get_seq_by_pos(2);
>>     my $seqName2=$seq2->display_id();
>>     my $start=$seq2->start();
>>     my $end=$seq2->end();
>>     my $seqStr1=$seq1->seq();
>>     my $seqStr2=$seq2->seq();
>>     my $length=length($seqStr1);
>>     my $aligned=0;
>>     for (my $i=0;$i<$length;$i++){
>>         my $nuc1=substr($seqStr1,$i,1);
>>         my $nuc2=substr($seqStr2,$i,1);
>>         if ($nuc1 ne '-' && $nuc2 ne '-'){
>>         $aligned++;
>>         }
>>     }
>>     my $score=$this_genomic_align_block->score();
>>     my $rscore=$restricted_gab->score();
>>     my $percentAligned=$aligned/$sliceLength;
>>     print STDERR "$seqName2 $start $end $aligned $sliceLength perceAligned:$percentAligned score:$score restricted score:$rscore\n";
>> }
>> STDERR:
>> homo_sapiens/X 150684357 150684616 236 385 perceAligned:0.612987012987013 score:13262 restricted score:
>> homo_sapiens/X 145891632 145891730 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
>> homo_sapiens/X 145895821 145895919 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
>> homo_sapiens/HG1459_PATCH 145896018 145896116 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
>> homo_sapiens/HG1459_PATCH 145891829 145891927 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
>> homo_sapiens/X 145891846 145891927 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
>> homo_sapiens/X 145895624 145895705 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
>> homo_sapiens/HG1459_PATCH 145895821 145895902 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
>> homo_sapiens/HG1459_PATCH 145892043 145892124 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
>> homo_sapiens/X 150684688 150684739 52 385 perceAligned:0.135064935064935 score:13262 restricted score:
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>> --
>> Javier Herrero, PhD
>> Comparative Genomics Project Leader
>> TGAC, Norwich Research Park
>> Norwich, NR4 7UH, UK
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list