[ensembl-dev] Affymetrix Probesets

Sat Jul 2 23:00:42 BST 2011

Hi,

I'm running into a number of transcripts that have so many probe_features
that my mysql service stalls while "Copying to tmp table".  Here is a
specific example:

my @probe_features = @{$probe_feature_adaptor->fetch_all_by_external_name('
ENSMUST00000102049')};

This one line of code can take 5 minutes to run. Multiply that by thousands
and you can see why it takes over a week to gather probe information for all
genes in the mouse genome.

I'm running my script and mysql on a powerful cluster and we've tried
cranking a few parameters to avoid the 'Copying to tmp table' step, but no
luck. Specifically, we tried increasing tmp_table_size and
max_heap_table_size to 4GB each.

Any tips on the mysql parameters you run at Ensembl?

Any alternative suggestions for how to get the probe/probeset IDs per gene
(like we used to be able to from core using get_all_DBEntries)?

 - Alex

On 6/27/11 3:11 AM, "Nathan Johnson" <njohnson at ebi.ac.uk> wrote:

> Hi Alex
> 
> This code looks like is is trying to get the probe/set xref info from the core
> DB. These data were moved to the funcgen DB quite some time ago.
> 
> For some example on how to retrieve PorbeSet level annotation see this doc:
> 
> http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-functgenomics/scripts/examp
> les/microarray_annotation_example.pl?revision=1.3&root=ensembl&view=markup
> 
> Thanks
> 
> Nath
> 
> On 14 Jun 2011, at 15:32, Alex Kalderimis wrote:
> 
>> Dear Listizens, 
>> 
>> In trying to debug why code for getting Affymetrix Probeset
>> information had stopped working, I added some debug statements and it
>> seems that the data is no longer modelled as we expected it to be. The
>> code is below:
>> 
>> 69  for my $slice (@slices) {
>> 70  my @genes = @{ $slice->get_all_Genes };
>> 71  $self->debug("Processing " . scalar(@genes) . " genes");
>> 72  my $processed_genes = 0;
>> 73  for my $gene (@genes) {
>> 74      my @transcripts = @{ $gene->get_all_Transcripts };
>> 75      for my $transcript (@transcripts) {
>> 76          my @xrefs = @{ $transcript->get_all_DBEntries };
>> 77          for my $xref (@xrefs) {
>> 78              $xref_types{$xref->dbname} = 1;
>> 79              if ( $xref->dbname eq $db_name ) {
>> 80                  my @probe_features = @{
>> $self->get_feature_adaptor->fetch_all_by_probeset( $xref->display_id ) };
>> 81                  for my $probe_feature (@probe_features) {
>> 82                      my $line = join("\t",
>> 83                          $gene->stable_id,
>> 84                          $transcript->stable_id,
>> 85                          $xref->display_id,
>> 86                          $probe_feature->seq_region_name,
>> 87                          $probe_feature->seq_region_start,
>> 88                          $probe_feature->seq_region_end);
>> 89                      $self->debug($line);
>> 90                      print $out $line, "\n";
>> 91                  }
>> 92              }
>> 93          }
>> 94      }
>> 95      $processed_genes++;
>> 96      if ($processed_genes % 100 == 0) {
>> 97          $self->debug("Processed $processed_genes genes, with the
>> following XREF types: " . join(", ", sort keys %xref_types));
>> 98      }
>> 99  }
>> 
>> The dbnames "AFFY_Drosophila_1" and "AFFY_Drosphila_2" (which are what I
>> am looking for) never appear. How can I better structure my code to
>> get the information I am after?
>> 
>> Alex.
>> 
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe):
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> Nathan Johnson
> Senior Scientific Programmer
> Ensembl Regulation
> European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> 
> http://www.ensembl.info/
> http://twitter.com/#!/ensembl
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/