[ensembl-dev] Best way to collect probesets (excluding all exon arrays)? -> workaround found

Fri Jun 17 03:53:26 BST 2011

And here is my original strategy, which uses ProbeFeature and leads to the
tmp table issue:
my $gene_adaptor = $registry->get_adaptor($species, "core", "gene");
my $probe_feature_adaptor = $registry->get_adaptor($species, "funcgen",
"ProbeFeature");

my $genes = $gene_adaptor->fetch_all();
 while (my $gene = pop(@$genes)) {
     my @probe_features =
@{$probe_feature_adaptor->fetch_all_by_linked_transcript_Gene($gene)};

     foreach my  $pf (@probe_features){
        my $probe = $pf->probe();
         my $array_list = $probe->get_all_Arrays();
         foreach my $array (@$array_list){
         ...collect probe (or probeset) name
         ...collect array name
         }
     }
 }

 - Alex

On 6/16/11 6:34 PM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:

> Hi Nath,
> 
> Thanks for the follow-up. In turns out that my strategy of commenting out
> those 'from' and 'where' lines has led to incomplete extraction. I'm only
> seeing ~1/4 of the probesets per Affy chip as I loop through "all probes"
> (see code snippets below)
> 
> So, I'm back to the drawing board for how to extract the names of all the
> probes and probesets per array for each gene. Thought I don't need any
> feature details, I can't seem to get around ProbeFeature and the slow tmp
> table creation.
> 
> I'm having trouble following your suggestions below though the performance
> sounds amazing. Are you suggesting I add LOAD INDEX INTO CACHE into my perl
> program or adapt it in the API somewhere?
> 
> I also want to make sure I've communicated to overall goal clearly because I
> think this should be dead simple.
> 
> I want to generate the following table:
> 
> | gene | probe id | chip |
> ----------------------------
> | ENSMUSG00000029484 | Msa.4366.0_s_at | Mu11ksubB |
> 
> That's it. No other details are needed.  I want to fill the table with all
> probes (or probesets) for each gene in a given genome.
> 
> Here is my current strategy:
> my $gene_adaptor = $registry->get_adaptor($species, "core", "gene");
> my $probe_adaptor = $registry->get_adaptor($species, "funcgen", "Probe");
> my $probe_set_adaptor = $registry->get_adaptor($species, "funcgen",
> "Probeset");
> 
> my $genes = $gene_adaptor->fetch_all();
> while (my $gene = pop(@$genes)) {
>     my @probes = 
> @{$probe_adaptor->fetch_all_by_linked_transcript_Gene($gene)};
>     my @probe_sets =
> @{$probe_set_adaptor->fetch_all_by_linked_transcript_Gene($gene)};
>     my @all_probes = (@probes, @probe_sets);
> 
>     foreach my $probe (@all_probes) {
>         my $array_list = $probe->get_all_Arrays();
>         foreach my $array (@$array_list){
>         ...collect probe (or probeset) name
>         ...collect array name
>         }
>     }
> }
> 
> Thanks!
>  - Alex
> 
> 
> 
> On 6/9/11 9:22 AM, "Nathan Johnson" <njohnson at ebi.ac.uk> wrote:
> 
>> Hi Alexander
>> 
>> I just wanting to come back to you on this as I've been playing around with a
>> few different ideas.
>> 
>> Firstly, your suggestion of removing the join to the ensembl_object tables
>> does work, but it doesn't seem to have anywhere near the impact I was
>> expecting.  It does roughly half the time it takes to run the DBEntry query,
>> 30ms -> 15 ms on my machine. Not bad, but the majority of the time is
>> actually
>> taken up by querying for the ProbeFeatures themselves (I am still testing on
>> your original code), which clock in at ~600ms/query.  I have implemented a
>> few
>> bits of streamlining in the API which should remove some redundancy issues.
>> These will appear under the hood in v63, out later this month.
>> 
>> However, by far the biggest performance jump I saw was by pre loading the
>> indexes you want to use.  We have the key_buffer_size set to 4GB which will
>> easily take all of the following:
>> 
>> my $sql = 'LOAD INDEX INTO CACHE object_xref, xref, external_db,
>> external_synonym, probe_feature, probe, homo_sapiens_core_62_37g.transcript,
>> homo_sapiens_core_62_37g.transcript_stable_id';
>> 
>> $efgdb->dbc->do($sql);
>> 
>> Assuming you are now only using probe xrefs, you can remove the probe_feature
>> table from this. You might want to add the core gene and gene_stable_id table
>> to this, or any other tables you are using in you dump script, so long as
>> they
>> fit in your key_buffer_size. Use 'show table status' to get the index sizes.
>> 
>> For me this index load takes roughly 40 seconds, but reduces the ProbeFeature
>> query time to ~80ms, and also improves all the other querys used in the
>> method. The result for the whole of chromosome 5 for ~300000 ProbeFeatures
>> returned, I get a drop from ~1500s to just ~70s. That's 95% faster!
>> 
>> Is that fast enough? ;)
>> 
>> Nath
>>  
>> 
>> 
>> 
>> 
>> 
>> On 26 May 2011, at 17:04, Nathan Johnson wrote:
>> 
>>> Hi Alexander
>>> 
>>> Much apologies for the tardy reply, I've been having trouble with my dev
>>> list
>>> mail filter.  
>>> 
>>> This is an issue I have been pondering for a while, but have not made much
>>> progress with as some of the solutions involve a major overhaul of some of
>>> the underlying code.  However, it looks like I may be able to update the
>>> code
>>> underlying fetch_all_by_linked_transcript_Gene. But this is targeting the
>>> inter DB and DBEntry queries, not the apparently problematic probe feature
>>> query.
>>> 
>>> wrt to commenting out the 'from' and 'where' statements in the
>>> DBEntryAdaptor. This appears to be a valid fix.  However it does make an
>>> assumption that there aren't any old object_xref table i.e. that no longer
>>> have a probe feature valid link. The funcgen release process clears away all
>>> old data, and also performs an orphaned object_xref check, so this
>>> assumption
>>> is fairly safe.
>>> 
>>> wrt the ResultFeatureAdaptor comments, this is not related to the code
>>> snippet you specified. And in fact the support for probe_feature based
>>> result
>>> features is being removed for v63 as it is not used anymore.
>>> 
>>> I'm a but confused as to where the slow down is coming from exactly so I
>>> will
>>> run the script on our local DBs to see where the bottle neck is coming from.
>>> It maybe that we can turn it on it's head and start with an array restricted
>>> probe feature query.  This will allow us to filter out the affy ST arrays up
>>> front, with the caveat that there will be probe features with no xrefs.
>>> 
>>> Can you also send me the full sql which is causing the tmp table to be
>>> created?
>>> 
>>> One thing I also like to point out is that you are fetching ProbeFeature
>>> xref
>>> data, for Affy arrays the associated probe set may actually fail our
>>> transcript mapping pipeline.  The transcript xrefs are actually stored at
>>> the
>>> Probe or ProbeSet level, not the feature level.
>>> 
>>> Thanks
>>> 
>>> Nath
>>> 
>>> 
>>> 
>>> On 26 May 2011, at 03:37, Alexander Pico wrote:
>>> 
>>>> I found a workaround that dramatically improves performance that others
>>>> might find useful. It involves commenting out lines that add unnecessary
>>>> (and slow) table query parameters for Probe and ProbeSet queries based on
>>>> transcript.  I had to skip querying ProbeFeature altogether due to the
>>>> massive table size in human and mouse. With these minor edits, I can
>>>> retrieve all probe and probeset annotations (though not feature details)
>>>> for
>>>> every human gene in a few hours, rather than weeks!
>>>> 
>>>> Comment out the following lines:
>>>> ensembl-functgenomics/modules/Bio/EnsEMBL/Funcgen/DBSQL/DBEntryAdaptor.pm
>>>> 359,360c359,360
>>>> <       $from_sql  = 'probe p, ';
>>>> <       $where_sql = qq( p.probe_id = oxr.ensembl_id AND );
>>>> ---
>>>>> #     $from_sql  = 'probe p, ';
>>>>> #     $where_sql = qq( p.probe_id = oxr.ensembl_id AND );
>>>> 363,364c363,364
>>>> <       $from_sql  = 'probe_set ps, ';
>>>> <       $where_sql = qq( ps.probe_set_id = oxr.ensembl_id AND );
>>>> ---
>>>>> #     $from_sql  = 'probe_set ps, ';
>>>>> #     $where_sql = qq( ps.probe_set_id = oxr.ensembl_id AND );
>>>> 
>>>> 
>>>> On 5/24/11 10:14 AM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:
>>>> 
>>>>> Looks like a known problem. The API code has the following comment notes:
>>>>> 
>>>>> Funcgen/DBSQL/ResultFeatureAdaptor.pm: line 1259
>>>>> #Not straight forward without creating tmp table
>>>>> 
>>>>> In version 60, the note in this area stated:
>>>>> #This join between sr and pf is causing the slow down.  Need to select
>>>>> right join for this.
>>>>> #just do two separate queries for now.
>>>>> 
>>>>> 
>>>>> Indeed, the tmp table triggered by the join is still causing a slow down.
>>>>> Let us know if you come up with any workarounds or solutions to this tmp
>>>>> table issue. Thanks!
>>>>> - Alex
>>>>> 
>>>>> 
>>>>> On 5/23/11 6:28 PM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I'm looking for a better way to get probe features. I'm currently using
>>>>>> 'fetch_all_by_linked_transcript_Gene()', but for species with all exon
>>>>>> arrays, this can take days...
>>>>>> 
>>>>>> Other than going in and deleting probesets from the funcgen databases
>>>>>> (local
>>>>>> copies), how can I get around processing certain arrays, like the all
>>>>>> exon
>>>>>> arrays, and just collect everything else?
>>>>>> 
>>>>>> 
>>>>>> Here's my current code snippet:
>>>>>> 
>>>>>> my $probe_adaptor = $registry->get_adaptor($species, "funcgen",
>>>>>> "ProbeFeature");
>>>>>> 
>>>>>> my $probe_features =
>>>>>> $probe_adaptor->fetch_all_by_linked_transcript_Gene($gene);
>>>>>> 
>>>>>> foreach my $pf (@$probe_features) {
>>>>>>   // do stuff
>>>>>> }
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>> List admin (including subscribe/unsubscribe):
>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Dev mailing list    Dev at ensembl.org
>>>>> List admin (including subscribe/unsubscribe):
>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> List admin (including subscribe/unsubscribe):
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>>> Nathan Johnson
>>> Senior Scientific Programmer
>>> Ensembl Regulation
>>> European Bioinformatics Institute
>>> Wellcome Trust Genome Campus
>>> Hinxton
>>> Cambridge CB10 1SD
>>> 
>>> http://www.ensembl.info/
>>> http://twitter.com/#!/ensembl
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe):
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>> 
>> Nathan Johnson
>> Senior Scientific Programmer
>> Ensembl Regulation
>> European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton
>> Cambridge CB10 1SD
>> 
>> http://www.ensembl.info/
>> http://twitter.com/#!/ensembl
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/