[ensembl-dev] Best way to collect probesets (excluding all exon arrays)? -> workaround found

Thu Jun 9 17:22:13 BST 2011

Hi Alexander

I just wanting to come back to you on this as I've been playing around with a few different ideas.

Firstly, your suggestion of removing the join to the ensembl_object tables does work, but it doesn't seem to have anywhere near the impact I was expecting.  It does roughly half the time it takes to run the DBEntry query, 30ms -> 15 ms on my machine. Not bad, but the majority of the time is actually taken up by querying for the ProbeFeatures themselves (I am still testing on your original code), which clock in at ~600ms/query.  I have implemented a few bits of streamlining in the API which should remove some redundancy issues.  These will appear under the hood in v63, out later this month. 

However, by far the biggest performance jump I saw was by pre loading the indexes you want to use.  We have the key_buffer_size set to 4GB which will easily take all of the following:

my $sql = 'LOAD INDEX INTO CACHE object_xref, xref, external_db, external_synonym, probe_feature, probe, homo_sapiens_core_62_37g.transcript, homo_sapiens_core_62_37g.transcript_stable_id';

$efgdb->dbc->do($sql);

Assuming you are now only using probe xrefs, you can remove the probe_feature table from this. You might want to add the core gene and gene_stable_id table to this, or any other tables you are using in you dump script, so long as they fit in your key_buffer_size. Use 'show table status' to get the index sizes.

For me this index load takes roughly 40 seconds, but reduces the ProbeFeature query time to ~80ms, and also improves all the other querys used in the method. The result for the whole of chromosome 5 for ~300000 ProbeFeatures returned, I get a drop from ~1500s to just ~70s. That's 95% faster!

Is that fast enough? ;)

Nath

On 26 May 2011, at 17:04, Nathan Johnson wrote:

> Hi Alexander
> 
> Much apologies for the tardy reply, I've been having trouble with my dev list mail filter.  
> 
> This is an issue I have been pondering for a while, but have not made much progress with as some of the solutions involve a major overhaul of some of the underlying code.  However, it looks like I may be able to update the code underlying fetch_all_by_linked_transcript_Gene. But this is targeting the inter DB and DBEntry queries, not the apparently problematic probe feature query.
> 
> wrt to commenting out the 'from' and 'where' statements in the DBEntryAdaptor. This appears to be a valid fix.  However it does make an assumption that there aren't any old object_xref table i.e. that no longer have a probe feature valid link. The funcgen release process clears away all old data, and also performs an orphaned object_xref check, so this assumption is fairly safe.
> 
> wrt the ResultFeatureAdaptor comments, this is not related to the code snippet you specified. And in fact the support for probe_feature based result features is being removed for v63 as it is not used anymore.
> 
> I'm a but confused as to where the slow down is coming from exactly so I will run the script on our local DBs to see where the bottle neck is coming from. It maybe that we can turn it on it's head and start with an array restricted probe feature query.  This will allow us to filter out the affy ST arrays up front, with the caveat that there will be probe features with no xrefs.
> 
> Can you also send me the full sql which is causing the tmp table to be created?
> 
> One thing I also like to point out is that you are fetching ProbeFeature xref data, for Affy arrays the associated probe set may actually fail our transcript mapping pipeline.  The transcript xrefs are actually stored at the Probe or ProbeSet level, not the feature level.
> 
> Thanks
> 
> Nath
> 
> 
> 
> On 26 May 2011, at 03:37, Alexander Pico wrote:
> 
>> I found a workaround that dramatically improves performance that others
>> might find useful. It involves commenting out lines that add unnecessary
>> (and slow) table query parameters for Probe and ProbeSet queries based on
>> transcript.  I had to skip querying ProbeFeature altogether due to the
>> massive table size in human and mouse. With these minor edits, I can
>> retrieve all probe and probeset annotations (though not feature details) for
>> every human gene in a few hours, rather than weeks!
>> 
>> Comment out the following lines:
>> ensembl-functgenomics/modules/Bio/EnsEMBL/Funcgen/DBSQL/DBEntryAdaptor.pm
>> 359,360c359,360
>> <       $from_sql  = 'probe p, ';
>> <       $where_sql = qq( p.probe_id = oxr.ensembl_id AND );
>> ---
>>> #     $from_sql  = 'probe p, ';
>>> #     $where_sql = qq( p.probe_id = oxr.ensembl_id AND );
>> 363,364c363,364
>> <       $from_sql  = 'probe_set ps, ';
>> <       $where_sql = qq( ps.probe_set_id = oxr.ensembl_id AND );
>> ---
>>> #     $from_sql  = 'probe_set ps, ';
>>> #     $where_sql = qq( ps.probe_set_id = oxr.ensembl_id AND );
>> 
>> 
>> On 5/24/11 10:14 AM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:
>> 
>>> Looks like a known problem. The API code has the following comment notes:
>>> 
>>> Funcgen/DBSQL/ResultFeatureAdaptor.pm: line 1259
>>> #Not straight forward without creating tmp table
>>> 
>>> In version 60, the note in this area stated:
>>> #This join between sr and pf is causing the slow down.  Need to select
>>> right join for this.
>>> #just do two separate queries for now.
>>> 
>>> 
>>> Indeed, the tmp table triggered by the join is still causing a slow down.
>>> Let us know if you come up with any workarounds or solutions to this tmp
>>> table issue. Thanks!
>>> - Alex
>>> 
>>> 
>>> On 5/23/11 6:28 PM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I'm looking for a better way to get probe features. I'm currently using
>>>> 'fetch_all_by_linked_transcript_Gene()', but for species with all exon
>>>> arrays, this can take days...
>>>> 
>>>> Other than going in and deleting probesets from the funcgen databases (local
>>>> copies), how can I get around processing certain arrays, like the all exon
>>>> arrays, and just collect everything else?
>>>> 
>>>> 
>>>> Here's my current code snippet:
>>>> 
>>>> my $probe_adaptor = $registry->get_adaptor($species, "funcgen",
>>>> "ProbeFeature");
>>>> 
>>>> my $probe_features =
>>>> $probe_adaptor->fetch_all_by_linked_transcript_Gene($gene);
>>>> 
>>>> foreach my $pf (@$probe_features) {
>>>>   // do stuff
>>>> }
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list    Dev at ensembl.org
>>>> List admin (including subscribe/unsubscribe):
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe):
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> Nathan Johnson
> Senior Scientific Programmer
> Ensembl Regulation
> European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> 
> http://www.ensembl.info/
> http://twitter.com/#!/ensembl
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

Nathan Johnson
Senior Scientific Programmer
Ensembl Regulation
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD

http://www.ensembl.info/
http://twitter.com/#!/ensembl