[ensembl-dev] Best way to collect probesets (excluding all exon arrays)? -> workaround found
Alexander Pico
apico at gladstone.ucsf.edu
Fri Jun 17 03:53:26 BST 2011
And here is my original strategy, which uses ProbeFeature and leads to the
tmp table issue:
my $gene_adaptor = $registry->get_adaptor($species, "core", "gene");
my $probe_feature_adaptor = $registry->get_adaptor($species, "funcgen",
"ProbeFeature");
my $genes = $gene_adaptor->fetch_all();
while (my $gene = pop(@$genes)) {
my @probe_features =
@{$probe_feature_adaptor->fetch_all_by_linked_transcript_Gene($gene)};
foreach my $pf (@probe_features){
my $probe = $pf->probe();
my $array_list = $probe->get_all_Arrays();
foreach my $array (@$array_list){
...collect probe (or probeset) name
...collect array name
}
}
}
- Alex
On 6/16/11 6:34 PM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:
> Hi Nath,
>
> Thanks for the follow-up. In turns out that my strategy of commenting out
> those 'from' and 'where' lines has led to incomplete extraction. I'm only
> seeing ~1/4 of the probesets per Affy chip as I loop through "all probes"
> (see code snippets below)
>
> So, I'm back to the drawing board for how to extract the names of all the
> probes and probesets per array for each gene. Thought I don't need any
> feature details, I can't seem to get around ProbeFeature and the slow tmp
> table creation.
>
> I'm having trouble following your suggestions below though the performance
> sounds amazing. Are you suggesting I add LOAD INDEX INTO CACHE into my perl
> program or adapt it in the API somewhere?
>
> I also want to make sure I've communicated to overall goal clearly because I
> think this should be dead simple.
>
> I want to generate the following table:
>
> | gene | probe id | chip |
> ----------------------------
> | ENSMUSG00000029484 | Msa.4366.0_s_at | Mu11ksubB |
>
> That's it. No other details are needed. I want to fill the table with all
> probes (or probesets) for each gene in a given genome.
>
> Here is my current strategy:
> my $gene_adaptor = $registry->get_adaptor($species, "core", "gene");
> my $probe_adaptor = $registry->get_adaptor($species, "funcgen", "Probe");
> my $probe_set_adaptor = $registry->get_adaptor($species, "funcgen",
> "Probeset");
>
> my $genes = $gene_adaptor->fetch_all();
> while (my $gene = pop(@$genes)) {
> my @probes =
> @{$probe_adaptor->fetch_all_by_linked_transcript_Gene($gene)};
> my @probe_sets =
> @{$probe_set_adaptor->fetch_all_by_linked_transcript_Gene($gene)};
> my @all_probes = (@probes, @probe_sets);
>
> foreach my $probe (@all_probes) {
> my $array_list = $probe->get_all_Arrays();
> foreach my $array (@$array_list){
> ...collect probe (or probeset) name
> ...collect array name
> }
> }
> }
>
> Thanks!
> - Alex
>
>
>
> On 6/9/11 9:22 AM, "Nathan Johnson" <njohnson at ebi.ac.uk> wrote:
>
>> Hi Alexander
>>
>> I just wanting to come back to you on this as I've been playing around with a
>> few different ideas.
>>
>> Firstly, your suggestion of removing the join to the ensembl_object tables
>> does work, but it doesn't seem to have anywhere near the impact I was
>> expecting. It does roughly half the time it takes to run the DBEntry query,
>> 30ms -> 15 ms on my machine. Not bad, but the majority of the time is
>> actually
>> taken up by querying for the ProbeFeatures themselves (I am still testing on
>> your original code), which clock in at ~600ms/query. I have implemented a
>> few
>> bits of streamlining in the API which should remove some redundancy issues.
>> These will appear under the hood in v63, out later this month.
>>
>> However, by far the biggest performance jump I saw was by pre loading the
>> indexes you want to use. We have the key_buffer_size set to 4GB which will
>> easily take all of the following:
>>
>> my $sql = 'LOAD INDEX INTO CACHE object_xref, xref, external_db,
>> external_synonym, probe_feature, probe, homo_sapiens_core_62_37g.transcript,
>> homo_sapiens_core_62_37g.transcript_stable_id';
>>
>> $efgdb->dbc->do($sql);
>>
>> Assuming you are now only using probe xrefs, you can remove the probe_feature
>> table from this. You might want to add the core gene and gene_stable_id table
>> to this, or any other tables you are using in you dump script, so long as
>> they
>> fit in your key_buffer_size. Use 'show table status' to get the index sizes.
>>
>> For me this index load takes roughly 40 seconds, but reduces the ProbeFeature
>> query time to ~80ms, and also improves all the other querys used in the
>> method. The result for the whole of chromosome 5 for ~300000 ProbeFeatures
>> returned, I get a drop from ~1500s to just ~70s. That's 95% faster!
>>
>> Is that fast enough? ;)
>>
>> Nath
>>
>>
>>
>>
>>
>>
>> On 26 May 2011, at 17:04, Nathan Johnson wrote:
>>
>>> Hi Alexander
>>>
>>> Much apologies for the tardy reply, I've been having trouble with my dev
>>> list
>>> mail filter.
>>>
>>> This is an issue I have been pondering for a while, but have not made much
>>> progress with as some of the solutions involve a major overhaul of some of
>>> the underlying code. However, it looks like I may be able to update the
>>> code
>>> underlying fetch_all_by_linked_transcript_Gene. But this is targeting the
>>> inter DB and DBEntry queries, not the apparently problematic probe feature
>>> query.
>>>
>>> wrt to commenting out the 'from' and 'where' statements in the
>>> DBEntryAdaptor. This appears to be a valid fix. However it does make an
>>> assumption that there aren't any old object_xref table i.e. that no longer
>>> have a probe feature valid link. The funcgen release process clears away all
>>> old data, and also performs an orphaned object_xref check, so this
>>> assumption
>>> is fairly safe.
>>>
>>> wrt the ResultFeatureAdaptor comments, this is not related to the code
>>> snippet you specified. And in fact the support for probe_feature based
>>> result
>>> features is being removed for v63 as it is not used anymore.
>>>
>>> I'm a but confused as to where the slow down is coming from exactly so I
>>> will
>>> run the script on our local DBs to see where the bottle neck is coming from.
>>> It maybe that we can turn it on it's head and start with an array restricted
>>> probe feature query. This will allow us to filter out the affy ST arrays up
>>> front, with the caveat that there will be probe features with no xrefs.
>>>
>>> Can you also send me the full sql which is causing the tmp table to be
>>> created?
>>>
>>> One thing I also like to point out is that you are fetching ProbeFeature
>>> xref
>>> data, for Affy arrays the associated probe set may actually fail our
>>> transcript mapping pipeline. The transcript xrefs are actually stored at
>>> the
>>> Probe or ProbeSet level, not the feature level.
>>>
>>> Thanks
>>>
>>> Nath
>>>
>>>
>>>
>>> On 26 May 2011, at 03:37, Alexander Pico wrote:
>>>
>>>> I found a workaround that dramatically improves performance that others
>>>> might find useful. It involves commenting out lines that add unnecessary
>>>> (and slow) table query parameters for Probe and ProbeSet queries based on
>>>> transcript. I had to skip querying ProbeFeature altogether due to the
>>>> massive table size in human and mouse. With these minor edits, I can
>>>> retrieve all probe and probeset annotations (though not feature details)
>>>> for
>>>> every human gene in a few hours, rather than weeks!
>>>>
>>>> Comment out the following lines:
>>>> ensembl-functgenomics/modules/Bio/EnsEMBL/Funcgen/DBSQL/DBEntryAdaptor.pm
>>>> 359,360c359,360
>>>> < $from_sql = 'probe p, ';
>>>> < $where_sql = qq( p.probe_id = oxr.ensembl_id AND );
>>>> ---
>>>>> # $from_sql = 'probe p, ';
>>>>> # $where_sql = qq( p.probe_id = oxr.ensembl_id AND );
>>>> 363,364c363,364
>>>> < $from_sql = 'probe_set ps, ';
>>>> < $where_sql = qq( ps.probe_set_id = oxr.ensembl_id AND );
>>>> ---
>>>>> # $from_sql = 'probe_set ps, ';
>>>>> # $where_sql = qq( ps.probe_set_id = oxr.ensembl_id AND );
>>>>
>>>>
>>>> On 5/24/11 10:14 AM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:
>>>>
>>>>> Looks like a known problem. The API code has the following comment notes:
>>>>>
>>>>> Funcgen/DBSQL/ResultFeatureAdaptor.pm: line 1259
>>>>> #Not straight forward without creating tmp table
>>>>>
>>>>> In version 60, the note in this area stated:
>>>>> #This join between sr and pf is causing the slow down. Need to select
>>>>> right join for this.
>>>>> #just do two separate queries for now.
>>>>>
>>>>>
>>>>> Indeed, the tmp table triggered by the join is still causing a slow down.
>>>>> Let us know if you come up with any workarounds or solutions to this tmp
>>>>> table issue. Thanks!
>>>>> - Alex
>>>>>
>>>>>
>>>>> On 5/23/11 6:28 PM, "Alexander Pico" <apico at gladstone.ucsf.edu> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm looking for a better way to get probe features. I'm currently using
>>>>>> 'fetch_all_by_linked_transcript_Gene()', but for species with all exon
>>>>>> arrays, this can take days...
>>>>>>
>>>>>> Other than going in and deleting probesets from the funcgen databases
>>>>>> (local
>>>>>> copies), how can I get around processing certain arrays, like the all
>>>>>> exon
>>>>>> arrays, and just collect everything else?
>>>>>>
>>>>>>
>>>>>> Here's my current code snippet:
>>>>>>
>>>>>> my $probe_adaptor = $registry->get_adaptor($species, "funcgen",
>>>>>> "ProbeFeature");
>>>>>>
>>>>>> my $probe_features =
>>>>>> $probe_adaptor->fetch_all_by_linked_transcript_Gene($gene);
>>>>>>
>>>>>> foreach my $pf (@$probe_features) {
>>>>>> // do stuff
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Dev mailing list Dev at ensembl.org
>>>>>> List admin (including subscribe/unsubscribe):
>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing list Dev at ensembl.org
>>>>> List admin (including subscribe/unsubscribe):
>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing list Dev at ensembl.org
>>>> List admin (including subscribe/unsubscribe):
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>> Nathan Johnson
>>> Senior Scientific Programmer
>>> Ensembl Regulation
>>> European Bioinformatics Institute
>>> Wellcome Trust Genome Campus
>>> Hinxton
>>> Cambridge CB10 1SD
>>>
>>> http://www.ensembl.info/
>>> http://twitter.com/#!/ensembl
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe):
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>
>> Nathan Johnson
>> Senior Scientific Programmer
>> Ensembl Regulation
>> European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton
>> Cambridge CB10 1SD
>>
>> http://www.ensembl.info/
>> http://twitter.com/#!/ensembl
>>
>>
>>
>>
>>
>>
>
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
More information about the Dev
mailing list