[ensembl-dev] Retrieve sequence coverage data for strain (resequencing results)

Tue Oct 9 20:48:09 BST 2012

Hello,

I'm attempting to build alignments of resequencing results from the 1000 
genomes project.  The end result I desire should closely resemble this 
view from the genome browser 
http://browser.1000genomes.org/Homo_sapiens/Location/SequenceAlignment?db=core;g=ENSG00000099984;gene=ENSG00000099984;r=22:24322339-24326106 
I am using Perl API version 65 to match the database.

How do I obtain information on whether sequence data exists for a 
particular strain?  Using code like this

my $sa = $registry->get_adaptor("human", "core", "slice");
my $slice = $sa->fetch_by_region('chromosome', $chrom, $start, $end);
my $str_slice = $slice->get_by_strain("AK1");
my $indiv_seq = $str_slice -> seq ();

simply returns the reference sequence for the region even though the 
online genome browser shows missing data for that particular 
individual.  It also doesn't complement variations on the -1 strand, so 
I'm assuming the seq() method is just a bulk application of 
SeqEdit::apply_edit() or AlleleFeature::apply_edit() methods to the 
reference sequence rather than a function to actually retrieve strain 
sequence data.

So my question is twofold.  First, is going through the variations API 
and adding allele features to the reference the correct approach for 
reconstructing strain data?  Or is there a pre-existing method to do 
this that I am not aware of?  I am relatively new to the API and the 
presence of a view identical to my desired result in the online genome 
browser suggests that this could be the case.  Second, if this is the 
correct approach and I have to remove missing sequence data from my 
alignment myself, how do I determine where the missing data is for a 
given strain?

Thanks in advance,
Jason Caravas