[ensembl-dev] Bulk download of Microarray Probe mapping via MySQL

Tue Jul 9 21:07:13 BST 2013

Hello,

I would like to bulk download the microarray probe annotations described in BMC Genomics 2010, "Consistent annotation of gene expression arrays".  I initially explored using the API interface to access the data, but found that this method would likely be too slow to be reasonably useful. Below is the main section of my test Perl script to pull only the human arrays, and to only retrieve the probes, not yet worrying about what they mapped to.

API version 72

my $array_adaptor = $registry->get_adaptor('homo_sapiens','funcgen','array');
my $array_adaptor = $registry->get_adaptor('homo_sapiens','funcgen','array');
my $probe_adaptor = $registry->get_adaptor('homo_sapiens','funcgen','probe');
foreach my $array ( @array ){
	print "\nArray:\t".$array->name ."\t" . "Vendor:\t".$array->vendor ."\t" . "ProbeCount: ".$array->probe_count(). "\t" . "time: ". time . "\n";

	print "Getting all probes...\n"; 
	my @probes = @{$probe_adaptor->fetch_all_by_Array($array)};
	print "Gotten!\t". time . "\n";
	print scalar(@probes) . "\n";
}

When I run the script, it successfully retrieves the probes for the first 2 arrays (HumanWG_6_V2 and HumanWG_6_V3), each with ~48,000 probes in approximately 12.5 minutes.  However, the next array HuEx-1_0-st-v2 contains 5,431,924 probes, and I became frustrated and canceled it after several hours.  

Output:
Array:	HumanWG_6_V2	Vendor:	ILLUMINA	ProbeCount: 48701	time: 1373385951
Getting all probes...
Gotten!	1373386705
48701

Array:	HumanWG_6_V3	Vendor:	ILLUMINA	ProbeCount: 48802	time: 1373386708
Getting all probes...
Gotten!	1373387464
48802

Array:	HuEx-1_0-st-v2	Vendor:	AFFY	ProbeCount: 5431924	time: 1373387502
Getting all probes...
^C

Based on the above, I think that the best way to accomplish a bulk download would be to do a direct MySQL call against the database, probably a private instance of the database spun up on an Amazon machine image.  I am confidant in spinning up an Amazon image, based on the instructions here: http://useast.ensembl.org/info/data/amazon_aws.html

However, I need some help parsing the where the probe annotations are stored within the MySQL database.  I have identified the array, array_chip, probe, probe_set tables, and joined on the probe_feature, analysis and seq_region tables.  However, I'm not clear on exactly how the next jump to genes should be made.  Additionally, within the original paper, there was logic on mapping Affy probesets to genes based on the individual probes. How would this be encoded within the database?  My MySQL code so far is below.  

USE homo_sapiens_funcgen_72_37;

SELECT 
array.name,
#array.*, array_chip.*,
IF ( !ISNULL(probe_set.name), probe_set.name, probe.name ) AS probe_or_probeSet_name,
probe_feature.*

FROM array
JOIN array_chip ON array_chip.array_id = array.array_id
JOIN probe ON probe.array_chip_id = array_chip.array_chip_id
LEFT JOIN probe_set ON probe.probe_set_id = probe_set.probe_set_id
JOIN probe_feature ON probe_feature.probe_id = probe.probe_id
JOIN analysis ON probe_feature.analysis_id = analysis.analysis_id
JOIN analysis_description ON analysis_description.analysis_id = analysis.analysis_id
JOIN seq_region ON seq_region.seq_region_id = probe_feature.seq_region_id

WHERE analysis.module = "ProbeAlign"
AND array.array_id=28  LIMIT 30;

Alternatively, I saw reference that the probe mapping information is available through biomart and that the biomart SQL tables might contain the data in a de-normalized format.  Would I be better off accessing martdb.ensembl.org, and if so, which tables should I be looking in?

Thank you,
Alex Holman

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20130709/4e1f626a/attachment.html>