[ensembl-dev] possible 'off by one error' in ensembl-functgenomics/scripts/miscellaneous/sam2bed.pl ?

Hans-Rudolf Hotz hrh at fmi.ch
Thu Aug 30 14:05:10 BST 2012


Hi

I am struggling with the sam2bed.pl script, and I wonder whether it has 
one of those famous 'off by one error' bugs?


SAM files (like GFF files) use the 1-based coordinate system and are end 
inclusive. BED files use the o-based coordinate system and are end 
exclusive (see:  the SAM spec http://samtools.sourceforge.net/SAM1.pdf 
or http://genome.ucsc.edu/FAQ/FAQformat.html)


Now,I look at the following script:

~/ensembl-67/ensembl-functgenomics/scripts/miscellaneous/sam2bed.pl


I get the position in line 120, ie:

my ($name, $flag, $slice_name, $pos, $mapq, undef, undef, undef, undef, 
$read) = split("\t");


The $pos variable is not modified and directly used in line 130

push @cache, join("\t", ($seq_region_name, $pos, ($pos +length($read) 
-1), $name, $mapq, $strand));


Shouldn't this rather be written like:

push @cache, join("\t", ($seq_region_name, ($pos -1), ($pos 
+length($read) -1), $name, $mapq, $strand));


for the end coordinate: ($pos +length($read) is correct (ie 
half-closed-half-open interval or end exclusive regions used in BED files) .


Is this a oversight in the script?


Thank you very much for any clarification

Regards, Hans



-- 



Hans-Rudolf Hotz, PhD
Bioinformatics Support

Friedrich Miescher Institute for Biomedical Research
Maulbeerstrasse 66
4058 Basel/Switzerland




More information about the Dev mailing list