[ensembl-dev] genomic_align_blocks sort order
Stephen Fitzgerald
stephenf at ebi.ac.uk
Thu Jan 30 13:37:36 GMT 2014
Hi Daniel, the lastz-nets are identified by a "group_id" column in the
genomic_align_block table in the compara db. Each net can be made from
several blocks (genomic_align_blocks).
Looking at the web page at this high resolution it is impossible to
distinguish gaps within the blocks and the individual blocks which comprise a net.
So, if we look at the underlying data in the database we can see that
within the region in mouse (X:105500254-105500638) defined by your url
there are 5 nets made from 2 blocks each.
> select gab.group_id, group_concat(ga.genomic_align_block_id) from genomic_align
ga inner join genomic_align_block gab on gab.genomic_align_block_id =
ga.genomic_align_block_id where ga.method_link_species_set_id =601 and
dnafrag_id = 13705548 and dnafrag_start < 105500638 and dnafrag_end >
105500254 group by gab.group_id;
+---------------+-----------------------------------------+
| group_id | group_concat(ga.genomic_align_block_id) |
+---------------+-----------------------------------------+
| 6010004581617 | 6010000658807,6010000658802 |
| 6010004581620 | 6010000658796,6010000658798 |
| 6010004587353 | 6010000659724,6010000659722 |
| 6010007433644 | 6010001490015,6010001490021 |
| 6010007433647 | 6010001490065,6010001490059 |
+---------------+-----------------------------------------+
Hope this helps,
Stephen.
On Thu, 30 Jan 2014, Caffrey, Daniel wrote:
> Hi Javier,
> Thanks for the quick and helpful response. Part of the reason I want the "best" alignment is because I will need to choose one region when there
> are duplications. For my test example, there are 6 blocks that belong to level 2 and 4 that belong to level3 (pasted below). Here, I could discard
> the level 3 alignments and develop my own metrics to identify the "best" level2 alignment.
>
> Interestingly, the region in detail web page (version 73) appears to show 5 lastz alignments (tracks with clear and pink regions) whereas my code
> reports 10 blocks. It is hard to tell how these 5 tracks relate to the 10 blocks that I get with v73 of the API. Can you point me to the code that
> generates the region in detail page?
>
> http://sep2013.archive.ensembl.org/Mus_musculus/Location/Multi?db=core;r=X:105500254-105500638;r1=X:150684356-15068474
>
> Thanks,
> Daniel
>
> homo_sapiens/X 150684357 150684616 236 385 perceAligned:0.612987012987013 score:13262 levelId:2
> homo_sapiens/X 145891632 145891730 99 385 perceAligned:0.257142857142857 score:29716 levelId:3
> homo_sapiens/X 145895821 145895919 99 385 perceAligned:0.257142857142857 score:29716 levelId:3
> homo_sapiens/HG1459_PATCH 145896018 145896116 99 385 perceAligned:0.257142857142857 score:29716 levelId:2
> homo_sapiens/HG1459_PATCH 145891829 145891927 99 385 perceAligned:0.257142857142857 score:29716 levelId:2
> homo_sapiens/X 145891846 145891927 80 385 perceAligned:0.207792207792208 score:29716 levelId:3
> homo_sapiens/X 145895624 145895705 80 385 perceAligned:0.207792207792208 score:29716 levelId:3
> homo_sapiens/HG1459_PATCH 145895821 145895902 80 385 perceAligned:0.207792207792208 score:29716 levelId:2
> homo_sapiens/HG1459_PATCH 145892043 145892124 80 385 perceAligned:0.207792207792208 score:29716 levelId:2
> homo_sapiens/X 150684688 150684739 52 385 perceAligned:0.135064935064935 score:13262 levelId:2
>
>
>
> On Jan 30, 2014, at 2:32 AM, "Javier Herrero (TGAC)" <Javier.Herrero at tgac.ac.uk>
> wrote:
>
> Hi Daniel
> Indeed, the alignment scores are not very helpful for your purpose. It is (I believe) the simple addition of the individual match/mismatch
> scores and gap penalties.
>
> As a general rule, data coming from the Ensembl API are not sorted (there are only a few exceptions to this rule).
>
> I would probably consider one additional bit of information: the level_id. These alignments are down with LastZ, then chained and finally
> netted. The chaining process links all blocks that are compatible in order and orientation into a virtual structure called chain. The netting
> process consists on the selection of the best chain for every location in one of the two genomes (called the reference genome). This genome
> is typically the human genome in the Ensembl pairwise alignments. The netting process selects the best chain first (level_id = 1) and then
> look for chains (or bits of chains) that fill in the gaps left by the first chain (level_d = 2). This nesting of chains has no limit although
> typically you don’t find any alignment with a level_id higher than 3 or 4.
>
> In other words, the blocks with a level_id = 1 usually represent the most likely orthologous region and the blocks with a higher level_id
> represent small rearrangements or maybe transposable elements. Note that there will be case where this is not as simple. Imagine a region
> that is duplicated in mouse, it is possible that the level_id =1 net points at bits of one of the two copies and the level_id = 2 points at
> bits of the other copy.
>
> Alternatively, you may want to consider using the EPO alignments, where paralogous copies are together in the same regions and small
> rearrangements are not considered (as these are global alignments). Although you will find many more species in the EPO alignments, you can
> always ignore them and focus on human and mouse only.
>
> I hope this helps
>
> Javier
>
> On 29 Jan 2014, at 18:16, Caffrey, Daniel <Daniel.Caffrey at umassmed.edu> wrote:
>
> Dear Ensembl developers,
> I'm interested in identifying the best pairwise LASTZ alignment within a list of genomic_align_blocks (mouse v human). I'm using a
> modified version of the example provided at :
>
> http://useast.ensembl.org/info/docs/api/compara/compara_tutorial.html
>
> 1) The score() method is not particularly useful as it does not appear to take the alignment length into account. Is there another
> method that returns something similar to an E-value?
>
> 2) Based on crude metrics (% nucleotides aligned and eyeballing), the alignments appear to be ordered in the list from best to worst .
> Can anyone confirm how/if the alignments are ordered?
>
>
> Thanks,
> Daniel
>
> command line options (API V73):
> --seq_region X --seq_region_start 105500254 --seq_region_end 105500638 --output_file out.aln --species mouse
>
>
>
> Relevant code snippet:
>
> foreach my $this_genomic_align_block (@$genomic_align_blocks) {
> my $restricted_gab = $this_genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);
> my $this_align = $restricted_gab->get_SimpleAlign;
> print $alignIO $this_align;
> #my $pid= $this_align->percentage_identity; #non-gap columns
> my $overallPid= $this_align->overall_percentage_identity; #all columns
> my $seq1=$this_align->get_seq_by_pos(1);
> my $seq2=$this_align->get_seq_by_pos(2);
> my $seqName2=$seq2->display_id();
> my $start=$seq2->start();
> my $end=$seq2->end();
>
> my $seqStr1=$seq1->seq();
> my $seqStr2=$seq2->seq();
> my $length=length($seqStr1);
> my $aligned=0;
> for (my $i=0;$i<$length;$i++){
> my $nuc1=substr($seqStr1,$i,1);
> my $nuc2=substr($seqStr2,$i,1);
> if ($nuc1 ne '-' && $nuc2 ne '-'){
> $aligned++;
> }
> }
> my $score=$this_genomic_align_block->score();
> my $rscore=$restricted_gab->score();
> my $percentAligned=$aligned/$sliceLength;
> print STDERR "$seqName2 $start $end $aligned $sliceLength perceAligned:$percentAligned score:$score restricted score:$rscore\n";
> }
>
> STDERR:
>
> homo_sapiens/X 150684357 150684616 236 385 perceAligned:0.612987012987013 score:13262 restricted score:
> homo_sapiens/X 145891632 145891730 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
> homo_sapiens/X 145895821 145895919 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
> homo_sapiens/HG1459_PATCH 145896018 145896116 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
> homo_sapiens/HG1459_PATCH 145891829 145891927 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
> homo_sapiens/X 145891846 145891927 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
> homo_sapiens/X 145895624 145895705 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
> homo_sapiens/HG1459_PATCH 145895821 145895902 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
> homo_sapiens/HG1459_PATCH 145892043 145892124 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
> homo_sapiens/X 150684688 150684739 52 385 perceAligned:0.135064935064935 score:13262 restricted score:
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
> --
> Javier Herrero, PhD
> Comparative Genomics Project Leader
> TGAC, Norwich Research Park
> Norwich, NR4 7UH, UK
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
>
>
More information about the Dev
mailing list