[ensembl-dev] possible 'off by one error' in ensembl-functgenomics/scripts/miscellaneous/sam2bed.pl ?
Hans-Rudolf Hotz
hrh at fmi.ch
Thu Aug 30 14:05:10 BST 2012
Hi
I am struggling with the sam2bed.pl script, and I wonder whether it has
one of those famous 'off by one error' bugs?
SAM files (like GFF files) use the 1-based coordinate system and are end
inclusive. BED files use the o-based coordinate system and are end
exclusive (see: the SAM spec http://samtools.sourceforge.net/SAM1.pdf
or http://genome.ucsc.edu/FAQ/FAQformat.html)
Now,I look at the following script:
~/ensembl-67/ensembl-functgenomics/scripts/miscellaneous/sam2bed.pl
I get the position in line 120, ie:
my ($name, $flag, $slice_name, $pos, $mapq, undef, undef, undef, undef,
$read) = split("\t");
The $pos variable is not modified and directly used in line 130
push @cache, join("\t", ($seq_region_name, $pos, ($pos +length($read)
-1), $name, $mapq, $strand));
Shouldn't this rather be written like:
push @cache, join("\t", ($seq_region_name, ($pos -1), ($pos
+length($read) -1), $name, $mapq, $strand));
for the end coordinate: ($pos +length($read) is correct (ie
half-closed-half-open interval or end exclusive regions used in BED files) .
Is this a oversight in the script?
Thank you very much for any clarification
Regards, Hans
--
Hans-Rudolf Hotz, PhD
Bioinformatics Support
Friedrich Miescher Institute for Biomedical Research
Maulbeerstrasse 66
4058 Basel/Switzerland
More information about the Dev
mailing list