[ensembl-dev] Affymetrix Probesets
Alexander Pico
apico at gladstone.ucsf.edu
Sun Jul 3 03:47:10 BST 2011
Hello again,
I flipped the problem around and am now retrieving all gene IDs per
probe/probeset. This turns out to be much faster: ~3 minutes per array for
the mouse genome (versus 5 minutes to retrieve the probe_features for a
single mouse gene, in some cases).
Here's a code example that others may find helpful. It's different from the
other examples found here:
http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-functgenomics/scripts/exa
mples/microarray_annotation_example.pl?revision=1.1&root=ensembl&view=markup
#fetch example array
my $array_name = 'MOE430B';
my $array = $array_adaptor->fetch_by_name_vendor($array_name);
my $vendor = $array->vendor();
#get all probes/probesets
my @plist = ();
if ($vendor =~ /^\'AFFY/i) {
@plist = @{$array->get_all_ProbeSets()};
} else {
@plist = @{$array->get_all_Probes()};
}
#get probe/probeset name
foreach my $p (@plist) {
my $p_name = '';
if ($vendor =~ /^\'AFFY/i) {
$p_name = $p->name();
} else {
$p_name = $p->get_probename($array_name());
}
#extract gene from dbentries
my @dbeList = @{$p->get_all_DBEntries()};
my $dbe_dbname = $dbe->dbname();
my $gene = '';
if ($dbe_dbname =~ /core_Transcript$/){
$gene =
$gene_adaptor->fetch_by_transcript_stable_id($dbe->primary_id());
} else {
next;
}
Works with both probes and probesets, and you end up with a Gene object.
Not exactly straightforward, but it seems to work. Any advice on
improvements?
It'd also be great if there was a method that returned an array of all the
"Arrays", either by vendor or a complete list. Without this method, I'm
left with hardcoding a list of arrays to collect and updating each release,
rather than just collecting everything you guys have put into the release.
- Alex
On 7/2/11 3:00 PM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:
> Hi,
>
> I'm running into a number of transcripts that have so many probe_features
> that my mysql service stalls while "Copying to tmp table". Here is a
> specific example:
>
> my @probe_features = @{$probe_feature_adaptor->fetch_all_by_external_name('
> ENSMUST00000102049')};
>
> This one line of code can take 5 minutes to run. Multiply that by thousands
> and you can see why it takes over a week to gather probe information for all
> genes in the mouse genome.
>
> I'm running my script and mysql on a powerful cluster and we've tried
> cranking a few parameters to avoid the 'Copying to tmp table' step, but no
> luck. Specifically, we tried increasing tmp_table_size and
> max_heap_table_size to 4GB each.
>
> Any tips on the mysql parameters you run at Ensembl?
>
> Any alternative suggestions for how to get the probe/probeset IDs per gene
> (like we used to be able to from core using get_all_DBEntries)?
>
> - Alex
>
>
> On 6/27/11 3:11 AM, "Nathan Johnson" <njohnson at ebi.ac.uk> wrote:
>
>> Hi Alex
>>
>> This code looks like is is trying to get the probe/set xref info from the
>> core
>> DB. These data were moved to the funcgen DB quite some time ago.
>>
>> For some example on how to retrieve PorbeSet level annotation see this doc:
>>
>>
http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-functgenomics/scripts/exam>>
p
>> les/microarray_annotation_example.pl?revision=1.3&root=ensembl&view=markup
>>
>> Thanks
>>
>> Nath
>>
>> On 14 Jun 2011, at 15:32, Alex Kalderimis wrote:
>>
>>> Dear Listizens,
>>>
>>> In trying to debug why code for getting Affymetrix Probeset
>>> information had stopped working, I added some debug statements and it
>>> seems that the data is no longer modelled as we expected it to be. The
>>> code is below:
>>>
>>> 69 for my $slice (@slices) {
>>> 70 my @genes = @{ $slice->get_all_Genes };
>>> 71 $self->debug("Processing " . scalar(@genes) . " genes");
>>> 72 my $processed_genes = 0;
>>> 73 for my $gene (@genes) {
>>> 74 my @transcripts = @{ $gene->get_all_Transcripts };
>>> 75 for my $transcript (@transcripts) {
>>> 76 my @xrefs = @{ $transcript->get_all_DBEntries };
>>> 77 for my $xref (@xrefs) {
>>> 78 $xref_types{$xref->dbname} = 1;
>>> 79 if ( $xref->dbname eq $db_name ) {
>>> 80 my @probe_features = @{
>>> $self->get_feature_adaptor->fetch_all_by_probeset( $xref->display_id ) };
>>> 81 for my $probe_feature (@probe_features) {
>>> 82 my $line = join("\t",
>>> 83 $gene->stable_id,
>>> 84 $transcript->stable_id,
>>> 85 $xref->display_id,
>>> 86 $probe_feature->seq_region_name,
>>> 87 $probe_feature->seq_region_start,
>>> 88 $probe_feature->seq_region_end);
>>> 89 $self->debug($line);
>>> 90 print $out $line, "\n";
>>> 91 }
>>> 92 }
>>> 93 }
>>> 94 }
>>> 95 $processed_genes++;
>>> 96 if ($processed_genes % 100 == 0) {
>>> 97 $self->debug("Processed $processed_genes genes, with the
>>> following XREF types: " . join(", ", sort keys %xref_types));
>>> 98 }
>>> 99 }
>>>
>>> The dbnames "AFFY_Drosophila_1" and "AFFY_Drosphila_2" (which are what I
>>> am looking for) never appear. How can I better structure my code to
>>> get the information I am after?
>>>
>>> Alex.
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe):
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>
>> Nathan Johnson
>> Senior Scientific Programmer
>> Ensembl Regulation
>> European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton
>> Cambridge CB10 1SD
>>
>> http://www.ensembl.info/
>> http://twitter.com/#!/ensembl
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Dev mailing list Dev at ensembl.org
>> List admin (including subscribe/unsubscribe):
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
More information about the Dev
mailing list