[ensembl-dev] Best way to collect probesets (excluding all exon arrays)? -> workaround found

Fri Jun 17 02:34:44 BST 2011

Hi Nath,

Thanks for the follow-up. In turns out that my strategy of commenting out
those 'from' and 'where' lines has led to incomplete extraction. I'm only
seeing ~1/4 of the probesets per Affy chip as I loop through "all probes"
(see code snippets below)

So, I'm back to the drawing board for how to extract the names of all the
probes and probesets per array for each gene. Thought I don't need any
feature details, I can't seem to get around ProbeFeature and the slow tmp
table creation.

I'm having trouble following your suggestions below though the performance
sounds amazing. Are you suggesting I add LOAD INDEX INTO CACHE into my perl
program or adapt it in the API somewhere?

I also want to make sure I've communicated to overall goal clearly because I
think this should be dead simple.

I want to generate the following table:

| gene | probe id | chip |
----------------------------
| ENSMUSG00000029484 | Msa.4366.0_s_at | Mu11ksubB |

That's it. No other details are needed.  I want to fill the table with all
probes (or probesets) for each gene in a given genome.

Here is my current strategy:
my $gene_adaptor = $registry->get_adaptor($species, "core", "gene");
my $probe_adaptor = $registry->get_adaptor($species, "funcgen", "Probe");
my $probe_set_adaptor = $registry->get_adaptor($species, "funcgen",
"Probeset");

my $genes = $gene_adaptor->fetch_all();
while (my $gene = pop(@$genes)) {
    my @probes = 
@{$probe_adaptor->fetch_all_by_linked_transcript_Gene($gene)};
    my @probe_sets =
@{$probe_set_adaptor->fetch_all_by_linked_transcript_Gene($gene)};
    my @all_probes = (@probes, @probe_sets);

    foreach my $probe (@all_probes) {
        my $array_list = $probe->get_all_Arrays();
        foreach my $array (@$array_list){
        ...collect probe (or probeset) name
        ...collect array name
        }
    }
}

Thanks!
 - Alex

On 6/9/11 9:22 AM, "Nathan Johnson" <njohnson at ebi.ac.uk> wrote:

> Hi Alexander
> 
> I just wanting to come back to you on this as I've been playing around with a
> few different ideas.
> 
> Firstly, your suggestion of removing the join to the ensembl_object tables
> does work, but it doesn't seem to have anywhere near the impact I was
> expecting.  It does roughly half the time it takes to run the DBEntry query,
> 30ms -> 15 ms on my machine. Not bad, but the majority of the time is actually
> taken up by querying for the ProbeFeatures themselves (I am still testing on
> your original code), which clock in at ~600ms/query.  I have implemented a few
> bits of streamlining in the API which should remove some redundancy issues.
> These will appear under the hood in v63, out later this month.
> 
> However, by far the biggest performance jump I saw was by pre loading the
> indexes you want to use.  We have the key_buffer_size set to 4GB which will
> easily take all of the following:
> 
> my $sql = 'LOAD INDEX INTO CACHE object_xref, xref, external_db,
> external_synonym, probe_feature, probe, homo_sapiens_core_62_37g.transcript,
> homo_sapiens_core_62_37g.transcript_stable_id';
> 
> $efgdb->dbc->do($sql);
> 
> Assuming you are now only using probe xrefs, you can remove the probe_feature
> table from this. You might want to add the core gene and gene_stable_id table
> to this, or any other tables you are using in you dump script, so long as they
> fit in your key_buffer_size. Use 'show table status' to get the index sizes.
> 
> For me this index load takes roughly 40 seconds, but reduces the ProbeFeature
> query time to ~80ms, and also improves all the other querys used in the
> method. The result for the whole of chromosome 5 for ~300000 ProbeFeatures
> returned, I get a drop from ~1500s to just ~70s. That's 95% faster!
> 
> Is that fast enough? ;)
> 
> Nath
>  
> 
> 
> 
> 
> 
> On 26 May 2011, at 17:04, Nathan Johnson wrote:
> 
>> Hi Alexander
>> 
>> Much apologies for the tardy reply, I've been having trouble with my dev list
>> mail filter.  
>> 
>> This is an issue I have been pondering for a while, but have not made much
>> progress with as some of the solutions involve a major overhaul of some of
>> the underlying code.  However, it looks like I may be able to update the code
>> underlying fetch_all_by_linked_transcript_Gene. But this is targeting the
>> inter DB and DBEntry queries, not the apparently problematic probe feature
>> query.
>> 
>> wrt to commenting out the 'from' and 'where' statements in the
>> DBEntryAdaptor. This appears to be a valid fix.  However it does make an
>> assumption that there aren't any old object_xref table i.e. that no longer
>> have a probe feature valid link. The funcgen release process clears away all
>> old data, and also performs an orphaned object_xref check, so this assumption
>> is fairly safe.
>> 
>> wrt the ResultFeatureAdaptor comments, this is not related to the code
>> snippet you specified. And in fact the support for probe_feature based result
>> features is being removed for v63 as it is not used anymore.
>> 
>> I'm a but confused as to where the slow down is coming from exactly so I will
>> run the script on our local DBs to see where the bottle neck is coming from.
>> It maybe that we can turn it on it's head and start with an array restricted
>> probe feature query.  This will allow us to filter out the affy ST arrays up
>> front, with the caveat that there will be probe features with no xrefs.
>> 
>> Can you also send me the full sql which is causing the tmp table to be
>> created?
>> 
>> One thing I also like to point out is that you are fetching ProbeFeature xref
>> data, for Affy arrays the associated probe set may actually fail our
>> transcript mapping pipeline.  The transcript xrefs are actually stored at the
>> Probe or ProbeSet level, not the feature level.
>> 
>> Thanks
>> 
>> Nath
>> 
>> 
>> 
>> On 26 May 2011, at 03:37, Alexander Pico wrote:
>> 
>>> I found a workaround that dramatically improves performance that others
>>> might find useful. It involves commenting out lines that add unnecessary
>>> (and slow) table query parameters for Probe and ProbeSet queries based on
>>> transcript.  I had to skip querying ProbeFeature altogether due to the
>>> massive table size in human and mouse. With these minor edits, I can
>>> retrieve all probe and probeset annotations (though not feature details) for
>>> every human gene in a few hours, rather than weeks!
>>> 
>>> Comment out the following lines:
>>> ensembl-functgenomics/modules/Bio/EnsEMBL/Funcgen/DBSQL/DBEntryAdaptor.pm
>>> 359,360c359,360
>>> <       $from_sql  = 'probe p, ';
>>> <       $where_sql = qq( p.probe_id = oxr.ensembl_id AND );
>>> ---
>>>> #     $from_sql  = 'probe p, ';
>>>> #     $where_sql = qq( p.probe_id = oxr.ensembl_id AND );
>>> 363,364c363,364
>>> <       $from_sql  = 'probe_set ps, ';
>>> <       $where_sql = qq( ps.probe_set_id = oxr.ensembl_id AND );
>>> ---
>>>> #     $from_sql  = 'probe_set ps, ';
>>>> #     $where_sql = qq( ps.probe_set_id = oxr.ensembl_id AND );
>>> 
>>> 
>>> On 5/24/11 10:14 AM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:
>>> 
>>>> Looks like a known problem. The API code has the following comment notes:
>>>> 
>>>> Funcgen/DBSQL/ResultFeatureAdaptor.pm: line 1259
>>>> #Not straight forward without creating tmp table
>>>> 
>>>> In version 60, the note in this area stated:
>>>> #This join between sr and pf is causing the slow down.  Need to select
>>>> right join for this.
>>>> #just do two separate queries for now.
>>>> 
>>>> 
>>>> Indeed, the tmp table triggered by the join is still causing a slow down.
>>>> Let us know if you come up with any workarounds or solutions to this tmp
>>>> table issue. Thanks!
>>>> - Alex
>>>> 
>>>> 
>>>> On 5/23/11 6:28 PM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I'm looking for a better way to get probe features. I'm currently using
>>>>> 'fetch_all_by_linked_transcript_Gene()', but for species with all exon
>>>>> arrays, this can take days...
>>>>> 
>>>>> Other than going in and deleting probesets from the funcgen databases
>>>>> (local
>>>>> copies), how can I get around processing certain arrays, like the all exon
>>>>> arrays, and just collect everything else?
>>>>> 
>>>>> 
>>>>> Here's my current code snippet:
>>>>> 
>>>>> my $probe_adaptor = $registry->get_adaptor($species, "funcgen",
>>>>> "ProbeFeature");
>>>>> 
>>>>> my $probe_features =
>>>>> $probe_adaptor->fetch_all_by_linked_transcript_Gene($gene);
>>>>> 
>>>>> foreach my $pf (@$probe_features) {
>>>>>   // do stuff
>>>>> }
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Dev mailing list    Dev at ensembl.org
>>>>> List admin (including subscribe/unsubscribe):
>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> List admin (including subscribe/unsubscribe):
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe):
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>> 
>> Nathan Johnson
>> Senior Scientific Programmer
>> Ensembl Regulation
>> European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton
>> Cambridge CB10 1SD
>> 
>> http://www.ensembl.info/
>> http://twitter.com/#!/ensembl
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe):
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> Nathan Johnson
> Senior Scientific Programmer
> Ensembl Regulation
> European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> 
> http://www.ensembl.info/
> http://twitter.com/#!/ensembl
> 
> 
> 
> 
> 
>