[ensembl-dev] Best way to collect probesets (excluding all exon arrays)? -> workaround found

Thu May 26 17:04:19 BST 2011

Hi Alexander

Much apologies for the tardy reply, I've been having trouble with my dev list mail filter.  

This is an issue I have been pondering for a while, but have not made much progress with as some of the solutions involve a major overhaul of some of the underlying code.  However, it looks like I may be able to update the code underlying fetch_all_by_linked_transcript_Gene. But this is targeting the inter DB and DBEntry queries, not the apparently problematic probe feature query.

wrt to commenting out the 'from' and 'where' statements in the DBEntryAdaptor. This appears to be a valid fix.  However it does make an assumption that there aren't any old object_xref table i.e. that no longer have a probe feature valid link. The funcgen release process clears away all old data, and also performs an orphaned object_xref check, so this assumption is fairly safe.

wrt the ResultFeatureAdaptor comments, this is not related to the code snippet you specified. And in fact the support for probe_feature based result features is being removed for v63 as it is not used anymore.

I'm a but confused as to where the slow down is coming from exactly so I will run the script on our local DBs to see where the bottle neck is coming from. It maybe that we can turn it on it's head and start with an array restricted probe feature query.  This will allow us to filter out the affy ST arrays up front, with the caveat that there will be probe features with no xrefs.

Can you also send me the full sql which is causing the tmp table to be created?

One thing I also like to point out is that you are fetching ProbeFeature xref data, for Affy arrays the associated probe set may actually fail our transcript mapping pipeline.  The transcript xrefs are actually stored at the Probe or ProbeSet level, not the feature level.

Thanks

Nath

On 26 May 2011, at 03:37, Alexander Pico wrote:

> I found a workaround that dramatically improves performance that others
> might find useful. It involves commenting out lines that add unnecessary
> (and slow) table query parameters for Probe and ProbeSet queries based on
> transcript.  I had to skip querying ProbeFeature altogether due to the
> massive table size in human and mouse. With these minor edits, I can
> retrieve all probe and probeset annotations (though not feature details) for
> every human gene in a few hours, rather than weeks!
> 
> Comment out the following lines:
> ensembl-functgenomics/modules/Bio/EnsEMBL/Funcgen/DBSQL/DBEntryAdaptor.pm
> 359,360c359,360
> <       $from_sql  = 'probe p, ';
> <       $where_sql = qq( p.probe_id = oxr.ensembl_id AND );
> ---
>> #     $from_sql  = 'probe p, ';
>> #     $where_sql = qq( p.probe_id = oxr.ensembl_id AND );
> 363,364c363,364
> <       $from_sql  = 'probe_set ps, ';
> <       $where_sql = qq( ps.probe_set_id = oxr.ensembl_id AND );
> ---
>> #     $from_sql  = 'probe_set ps, ';
>> #     $where_sql = qq( ps.probe_set_id = oxr.ensembl_id AND );
> 
> 
> On 5/24/11 10:14 AM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:
> 
>> Looks like a known problem. The API code has the following comment notes:
>> 
>> Funcgen/DBSQL/ResultFeatureAdaptor.pm: line 1259
>> #Not straight forward without creating tmp table
>> 
>> In version 60, the note in this area stated:
>> #This join between sr and pf is causing the slow down.  Need to select
>> right join for this.
>> #just do two separate queries for now.
>> 
>> 
>> Indeed, the tmp table triggered by the join is still causing a slow down.
>> Let us know if you come up with any workarounds or solutions to this tmp
>> table issue. Thanks!
>> - Alex
>> 
>> 
>> On 5/23/11 6:28 PM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:
>> 
>>> Hi,
>>> 
>>> I'm looking for a better way to get probe features. I'm currently using
>>> 'fetch_all_by_linked_transcript_Gene()', but for species with all exon
>>> arrays, this can take days...
>>> 
>>> Other than going in and deleting probesets from the funcgen databases (local
>>> copies), how can I get around processing certain arrays, like the all exon
>>> arrays, and just collect everything else?
>>> 
>>> 
>>> Here's my current code snippet:
>>> 
>>> my $probe_adaptor = $registry->get_adaptor($species, "funcgen",
>>> "ProbeFeature");
>>> 
>>> my $probe_features =
>>> $probe_adaptor->fetch_all_by_linked_transcript_Gene($gene);
>>> 
>>> foreach my $pf (@$probe_features) {
>>>    // do stuff
>>> }
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe):
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe):
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

Nathan Johnson
Senior Scientific Programmer
Ensembl Regulation
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD

http://www.ensembl.info/
http://twitter.com/#!/ensembl