[ensembl-dev] Affymetrix Probesets

Alexander Pico apico at gladstone.ucsf.edu
Sun Jul 3 03:47:10 BST 2011


Hello again,

I flipped the problem around and am now retrieving all gene IDs per
probe/probeset. This turns out to be much faster: ~3 minutes per array for
the mouse genome (versus 5 minutes to retrieve the probe_features for a
single mouse gene, in some cases).

Here's a code example that others may find helpful. It's different from the
other examples found here:
http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-functgenomics/scripts/exa
mples/microarray_annotation_example.pl?revision=1.1&root=ensembl&view=markup

#fetch example array
my $array_name = 'MOE430B';
my $array = $array_adaptor->fetch_by_name_vendor($array_name);
my $vendor = $array->vendor();

#get all probes/probesets
my @plist = ();
  if ($vendor  =~ /^\'AFFY/i) {
    @plist = @{$array->get_all_ProbeSets()};
  } else {
    @plist = @{$array->get_all_Probes()};
  }

#get probe/probeset name
foreach my $p (@plist) {
  my $p_name = '';
  if ($vendor  =~ /^\'AFFY/i) {
     $p_name = $p->name();
  } else {
     $p_name = $p->get_probename($array_name());
  }

  #extract gene from dbentries
  my @dbeList = @{$p->get_all_DBEntries()};
  my $dbe_dbname = $dbe->dbname();
  my $gene = '';
  if ($dbe_dbname =~ /core_Transcript$/){
     $gene = 
$gene_adaptor->fetch_by_transcript_stable_id($dbe->primary_id());
  } else {
    next;
  }

Works with both probes and probesets, and you end up with a Gene object.
Not exactly straightforward, but it seems to work. Any advice on
improvements?

It'd also be great if there was a method that returned an array of all the
"Arrays", either by vendor or a complete list.  Without this method, I'm
left with hardcoding a list of arrays to collect and updating each release,
rather than just collecting everything you guys have put into the release.

 - Alex


On 7/2/11 3:00 PM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:

> Hi,
> 
> I'm running into a number of transcripts that have so many probe_features
> that my mysql service stalls while "Copying to tmp table".  Here is a
> specific example:
> 
> my @probe_features = @{$probe_feature_adaptor->fetch_all_by_external_name('
> ENSMUST00000102049')};
> 
> This one line of code can take 5 minutes to run. Multiply that by thousands
> and you can see why it takes over a week to gather probe information for all
> genes in the mouse genome.
> 
> I'm running my script and mysql on a powerful cluster and we've tried
> cranking a few parameters to avoid the 'Copying to tmp table' step, but no
> luck. Specifically, we tried increasing tmp_table_size and
> max_heap_table_size to 4GB each.
> 
> Any tips on the mysql parameters you run at Ensembl?
> 
> Any alternative suggestions for how to get the probe/probeset IDs per gene
> (like we used to be able to from core using get_all_DBEntries)?
> 
>  - Alex
> 
> 
> On 6/27/11 3:11 AM, "Nathan Johnson" <njohnson at ebi.ac.uk> wrote:
> 
>> Hi Alex
>> 
>> This code looks like is is trying to get the probe/set xref info from the
>> core
>> DB. These data were moved to the funcgen DB quite some time ago.
>> 
>> For some example on how to retrieve PorbeSet level annotation see this doc:
>> 
>> 
http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-functgenomics/scripts/exam>>
p
>> les/microarray_annotation_example.pl?revision=1.3&root=ensembl&view=markup
>> 
>> Thanks
>> 
>> Nath
>> 
>> On 14 Jun 2011, at 15:32, Alex Kalderimis wrote:
>> 
>>> Dear Listizens,
>>> 
>>> In trying to debug why code for getting Affymetrix Probeset
>>> information had stopped working, I added some debug statements and it
>>> seems that the data is no longer modelled as we expected it to be. The
>>> code is below:
>>> 
>>> 69  for my $slice (@slices) {
>>> 70  my @genes = @{ $slice->get_all_Genes };
>>> 71  $self->debug("Processing " . scalar(@genes) . " genes");
>>> 72  my $processed_genes = 0;
>>> 73  for my $gene (@genes) {
>>> 74      my @transcripts = @{ $gene->get_all_Transcripts };
>>> 75      for my $transcript (@transcripts) {
>>> 76          my @xrefs = @{ $transcript->get_all_DBEntries };
>>> 77          for my $xref (@xrefs) {
>>> 78              $xref_types{$xref->dbname} = 1;
>>> 79              if ( $xref->dbname eq $db_name ) {
>>> 80                  my @probe_features = @{
>>> $self->get_feature_adaptor->fetch_all_by_probeset( $xref->display_id ) };
>>> 81                  for my $probe_feature (@probe_features) {
>>> 82                      my $line = join("\t",
>>> 83                          $gene->stable_id,
>>> 84                          $transcript->stable_id,
>>> 85                          $xref->display_id,
>>> 86                          $probe_feature->seq_region_name,
>>> 87                          $probe_feature->seq_region_start,
>>> 88                          $probe_feature->seq_region_end);
>>> 89                      $self->debug($line);
>>> 90                      print $out $line, "\n";
>>> 91                  }
>>> 92              }
>>> 93          }
>>> 94      }
>>> 95      $processed_genes++;
>>> 96      if ($processed_genes % 100 == 0) {
>>> 97          $self->debug("Processed $processed_genes genes, with the
>>> following XREF types: " . join(", ", sort keys %xref_types));
>>> 98      }
>>> 99  }
>>> 
>>> The dbnames "AFFY_Drosophila_1" and "AFFY_Drosphila_2" (which are what I
>>> am looking for) never appear. How can I better structure my code to
>>> get the information I am after?
>>> 
>>> Alex.
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe):
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>> 
>> Nathan Johnson
>> Senior Scientific Programmer
>> Ensembl Regulation
>> European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton
>> Cambridge CB10 1SD
>> 
>> http://www.ensembl.info/
>> http://twitter.com/#!/ensembl
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe):
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/






More information about the Dev mailing list