[ensembl-dev] genomic_align_blocks sort order

Thu Jan 30 07:32:44 GMT 2014

Hi Daniel

Indeed, the alignment scores are not very helpful for your purpose. It is (I believe) the simple addition of the individual match/mismatch scores and gap penalties.

As a general rule, data coming from the Ensembl API are not sorted (there are only a few exceptions to this rule).

I would probably consider one additional bit of information: the level_id. These alignments are down with LastZ, then chained and finally netted. The chaining process links all blocks that are compatible in order and orientation into a virtual structure called chain. The netting process consists on the selection of the best chain for every location in one of the two genomes (called the reference genome). This genome is typically the human genome in the Ensembl pairwise alignments. The netting process selects the best chain first (level_id = 1) and then look for chains (or bits of chains) that fill in the gaps left by the first chain (level_d = 2). This nesting of chains has no limit although typically you don’t find any alignment with a level_id higher than 3 or 4.

In other words, the blocks with a level_id = 1 usually represent the most likely orthologous region and the blocks with a higher level_id represent small rearrangements or maybe transposable elements. Note that there will be case where this is not as simple. Imagine a region that is duplicated in mouse, it is possible that the level_id =1 net points at bits of one of the two copies and the level_id = 2 points at bits of the other copy.

Alternatively, you may want to consider using the EPO alignments, where paralogous copies are together in the same regions and small rearrangements are not considered (as these are global alignments). Although you will find many more species in the EPO alignments, you can always ignore them and focus on human and mouse only.

I hope this helps

Javier

On 29 Jan 2014, at 18:16, Caffrey, Daniel <Daniel.Caffrey at umassmed.edu<mailto:Daniel.Caffrey at umassmed.edu>> wrote:

Dear Ensembl developers,

I'm interested in identifying the best pairwise LASTZ alignment within a list of genomic_align_blocks (mouse v human).  I'm using a modified version of the example provided at :

http://useast.ensembl.org/info/docs/api/compara/compara_tutorial.html

1) The score() method is not particularly useful as it does not appear to take the alignment length into account. Is there another method that returns something similar to an E-value?

2) Based on crude metrics (% nucleotides aligned and eyeballing), the alignments appear to be ordered in the list from best to worst . Can anyone confirm how/if the alignments are ordered?

Thanks,
Daniel

command line options (API V73):
--seq_region X --seq_region_start 105500254 --seq_region_end 105500638 --output_file out.aln --species mouse

Relevant code snippet:

foreach my $this_genomic_align_block (@$genomic_align_blocks) {
    my $restricted_gab = $this_genomic_align_block->restrict_between_reference_positions($seq_region_start, $seq_region_end);
   my  $this_align = $restricted_gab->get_SimpleAlign;
    print $alignIO  $this_align;
    #my $pid= $this_align->percentage_identity; #non-gap columns
    my $overallPid= $this_align->overall_percentage_identity; #all columns
    my $seq1=$this_align->get_seq_by_pos(1);
    my $seq2=$this_align->get_seq_by_pos(2);
    my $seqName2=$seq2->display_id();
    my $start=$seq2->start();
    my $end=$seq2->end();

    my $seqStr1=$seq1->seq();
    my $seqStr2=$seq2->seq();
    my $length=length($seqStr1);
    my $aligned=0;
    for (my $i=0;$i<$length;$i++){
        my $nuc1=substr($seqStr1,$i,1);
        my $nuc2=substr($seqStr2,$i,1);
        if ($nuc1 ne '-' && $nuc2 ne '-'){
        $aligned++;
        }
    }
    my $score=$this_genomic_align_block->score();
    my $rscore=$restricted_gab->score();
    my $percentAligned=$aligned/$sliceLength;
    print STDERR "$seqName2 $start $end $aligned $sliceLength perceAligned:$percentAligned score:$score restricted score:$rscore\n";
}

STDERR:

homo_sapiens/X 150684357 150684616 236 385 perceAligned:0.612987012987013 score:13262 restricted score:
homo_sapiens/X 145891632 145891730 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
homo_sapiens/X 145895821 145895919 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
homo_sapiens/HG1459_PATCH 145896018 145896116 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
homo_sapiens/HG1459_PATCH 145891829 145891927 99 385 perceAligned:0.257142857142857 score:29716 restricted score:
homo_sapiens/X 145891846 145891927 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
homo_sapiens/X 145895624 145895705 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
homo_sapiens/HG1459_PATCH 145895821 145895902 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
homo_sapiens/HG1459_PATCH 145892043 145892124 80 385 perceAligned:0.207792207792208 score:29716 restricted score:
homo_sapiens/X 150684688 150684739 52 385 perceAligned:0.135064935064935 score:13262 restricted score:

_______________________________________________
Dev mailing list    Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/

--
Javier Herrero, PhD
Comparative Genomics Project Leader
TGAC, Norwich Research Park
Norwich, NR4 7UH, UK

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140130/05356ea6/attachment.html>