[ensembl-dev] Best way to collect probesets (excluding all exon arrays)? -> workaround found

Tue May 31 15:20:19 BST 2011

Hi Alexander

This sounds like we are making a little head way here.

You can get non-truncated output from the mysql server by specifying:

	show full processlist;

If you don't need probe feature info, then we won't be needing the  
array restricted probe feature method, but fyi this does already  
exists, just in case you need it:

	ProbeFeatureAdaptor::fetch_all_by_Slice_Arrays

If you are now coming at this from a Probe/Set perspective you will  
probably be wanting to use:

	Array->get_all_Probes/Sets

And then perform the DBEntry queries on each Probe/Set returned.  If  
you haven't already seen it, there are also some good examples of the  
Array/Probe/Set API available here:

	ensembl-functgenomics/scripts/examples/microarray_annotation_example.pl

Am currently off tending to sick child, but will take a closer look at  
the underlying methods when I'm back in.

Nath

On 27 May 2011, at 23:34, Alexander Pico wrote:

> Thanks Nath!
>
>> I'm a but confused as to where the slow down is coming from exactly  
>> so I will
>> run the script on our local DBs to see where the bottle neck is  
>> coming from.
>> It maybe that we can turn it on it's head and start with an array  
>> restricted
>> probe feature query.  This will allow us to filter out the affy ST  
>> arrays up
>> front, with the caveat that there will be probe features with no  
>> xrefs.
>
> An array-restricted probe feature query would be nice.
>
>> Can you also send me the full sql which is causing the tmp table to  
>> be
>> created?
>
> Here is what is listed in 'show processlist', but it's truncated:
> | Copying to tmp table | SELECT  pf.probe_feature_id,  
> pf.seq_region_id,
> pf.seq_region_start, pf.seq_region_end, pf.seq_region |
>
> I think it's trigger by this query from Funcgen DBEntryAdaptor.pm  
> (line
> 399):
> SELECT oxr.ensembl_id
>       FROM probe_feature pf, external_db xdb,  xref x, object_xref  
> oxr,
> external_synonym syn
>      WHERE pf.probe_feature_id = oxr.ensembl_id AND xdb.db_name LIKE
> 'homo_sapiens_core_Transcript%' AND xdb.external_db_id =  
> x.external_db_id
> AND syn.synonym = ? AND
>             x.xref_id = oxr.xref_id AND
>             oxr.ensembl_object_type= ? AND
>             syn.xref_id = oxr.xref_id
>
>> One thing I also like to point out is that you are fetching  
>> ProbeFeature xref
>> data, for Affy arrays the associated probe set may actually fail our
>> transcript mapping pipeline.  The transcript xrefs are actually  
>> stored at the
>> Probe or ProbeSet level, not the feature level.
>
> Well, Affy probesets were coming through just fine via  
> ProbeFeatures.  But
> since the feature route was inefficient for just getting basic probe  
> info
> (don't need feature info), I'm now querying Probes and ProbeSets  
> instead of
> ProbeFeatures. This makes things a lot faster (as long as I comment  
> out the
> joins with probe and probe_set tables).
>
> - Alex
>