[ensembl-dev] Affymetrix Probesets
Alexander Pico
apico at gladstone.ucsf.edu
Sat Jul 2 23:00:42 BST 2011
Hi,
I'm running into a number of transcripts that have so many probe_features
that my mysql service stalls while "Copying to tmp table". Here is a
specific example:
my @probe_features = @{$probe_feature_adaptor->fetch_all_by_external_name('
ENSMUST00000102049')};
This one line of code can take 5 minutes to run. Multiply that by thousands
and you can see why it takes over a week to gather probe information for all
genes in the mouse genome.
I'm running my script and mysql on a powerful cluster and we've tried
cranking a few parameters to avoid the 'Copying to tmp table' step, but no
luck. Specifically, we tried increasing tmp_table_size and
max_heap_table_size to 4GB each.
Any tips on the mysql parameters you run at Ensembl?
Any alternative suggestions for how to get the probe/probeset IDs per gene
(like we used to be able to from core using get_all_DBEntries)?
- Alex
On 6/27/11 3:11 AM, "Nathan Johnson" <njohnson at ebi.ac.uk> wrote:
> Hi Alex
>
> This code looks like is is trying to get the probe/set xref info from the core
> DB. These data were moved to the funcgen DB quite some time ago.
>
> For some example on how to retrieve PorbeSet level annotation see this doc:
>
> http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-functgenomics/scripts/examp
> les/microarray_annotation_example.pl?revision=1.3&root=ensembl&view=markup
>
> Thanks
>
> Nath
>
> On 14 Jun 2011, at 15:32, Alex Kalderimis wrote:
>
>> Dear Listizens,
>>
>> In trying to debug why code for getting Affymetrix Probeset
>> information had stopped working, I added some debug statements and it
>> seems that the data is no longer modelled as we expected it to be. The
>> code is below:
>>
>> 69 for my $slice (@slices) {
>> 70 my @genes = @{ $slice->get_all_Genes };
>> 71 $self->debug("Processing " . scalar(@genes) . " genes");
>> 72 my $processed_genes = 0;
>> 73 for my $gene (@genes) {
>> 74 my @transcripts = @{ $gene->get_all_Transcripts };
>> 75 for my $transcript (@transcripts) {
>> 76 my @xrefs = @{ $transcript->get_all_DBEntries };
>> 77 for my $xref (@xrefs) {
>> 78 $xref_types{$xref->dbname} = 1;
>> 79 if ( $xref->dbname eq $db_name ) {
>> 80 my @probe_features = @{
>> $self->get_feature_adaptor->fetch_all_by_probeset( $xref->display_id ) };
>> 81 for my $probe_feature (@probe_features) {
>> 82 my $line = join("\t",
>> 83 $gene->stable_id,
>> 84 $transcript->stable_id,
>> 85 $xref->display_id,
>> 86 $probe_feature->seq_region_name,
>> 87 $probe_feature->seq_region_start,
>> 88 $probe_feature->seq_region_end);
>> 89 $self->debug($line);
>> 90 print $out $line, "\n";
>> 91 }
>> 92 }
>> 93 }
>> 94 }
>> 95 $processed_genes++;
>> 96 if ($processed_genes % 100 == 0) {
>> 97 $self->debug("Processed $processed_genes genes, with the
>> following XREF types: " . join(", ", sort keys %xref_types));
>> 98 }
>> 99 }
>>
>> The dbnames "AFFY_Drosophila_1" and "AFFY_Drosphila_2" (which are what I
>> am looking for) never appear. How can I better structure my code to
>> get the information I am after?
>>
>> Alex.
>>
>>
>>
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org
>> List admin (including subscribe/unsubscribe):
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>
> Nathan Johnson
> Senior Scientific Programmer
> Ensembl Regulation
> European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
>
> http://www.ensembl.info/
> http://twitter.com/#!/ensembl
>
>
>
>
>
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
More information about the Dev
mailing list