[ensembl-dev] Bulk download of Microarray Probe mapping via MySQL

Thu Jul 11 12:38:22 BST 2013

Hi Alex

I think I need to understand exactly what you want before we can cater the appropriate solution for your needs.

Do you want:

a) All the alignments of each probe for a given array.

or

b) Just the results from our probe/set -> transcript annotation pipeline (i.e. the xrefs)

or 

c) both of the above?

The problems you are seeing with the affy ST arrays is due to the very large amount of probes they contain, hence you are right in thinking the APi might not be the right choice here. Mart may be useful to you, but the interface is not generally designed for downloading whole data sets, and you may encounter problems there.

Thanks

Nathan Johnson
Ensembl Regulation
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD

http://www.ensembl.info/
http://twitter.com/#!/ensembl

On 9 Jul 2013, at 21:07, Alex Holman <aholman at jimmy.harvard.edu> wrote:

> Hello,
> 
> I would like to bulk download the microarray probe annotations described in BMC Genomics 2010, "Consistent annotation of gene expression arrays".  I initially explored using the API interface to access the data, but found that this method would likely be too slow to be reasonably useful. Below is the main section of my test Perl script to pull only the human arrays, and to only retrieve the probes, not yet worrying about what they mapped to.
> 
> API version 72
> 
> my $array_adaptor = $registry->get_adaptor('homo_sapiens','funcgen','array');
> my $array_adaptor = $registry->get_adaptor('homo_sapiens','funcgen','array');
> my $probe_adaptor = $registry->get_adaptor('homo_sapiens','funcgen','probe');
> foreach my $array ( @array ){
> 	print "\nArray:\t".$array->name ."\t" . "Vendor:\t".$array->vendor ."\t" . "ProbeCount: ".$array->probe_count(). "\t" . "time: ". time . "\n";
> 
> 	print "Getting all probes...\n"; 
> 	my @probes = @{$probe_adaptor->fetch_all_by_Array($array)};
> 	print "Gotten!\t". time . "\n";
> 	print scalar(@probes) . "\n";
> }
> 
> When I run the script, it successfully retrieves the probes for the first 2 arrays (HumanWG_6_V2 and HumanWG_6_V3), each with ~48,000 probes in approximately 12.5 minutes.  However, the next array HuEx-1_0-st-v2 contains 5,431,924 probes, and I became frustrated and canceled it after several hours.  
> 
> Output:
> Array:	HumanWG_6_V2	Vendor:	ILLUMINA	ProbeCount: 48701	time: 1373385951
> Getting all probes...
> Gotten!	1373386705
> 48701
> 
> Array:	HumanWG_6_V3	Vendor:	ILLUMINA	ProbeCount: 48802	time: 1373386708
> Getting all probes...
> Gotten!	1373387464
> 48802
> 
> Array:	HuEx-1_0-st-v2	Vendor:	AFFY	ProbeCount: 5431924	time: 1373387502
> Getting all probes...
> ^C
> 
> Based on the above, I think that the best way to accomplish a bulk download would be to do a direct MySQL call against the database, probably a private instance of the database spun up on an Amazon machine image.  I am confidant in spinning up an Amazon image, based on the instructions here: http://useast.ensembl.org/info/data/amazon_aws.html
> 
> However, I need some help parsing the where the probe annotations are stored within the MySQL database.  I have identified the array, array_chip, probe, probe_set tables, and joined on the probe_feature, analysis and seq_region tables.  However, I'm not clear on exactly how the next jump to genes should be made.  Additionally, within the original paper, there was logic on mapping Affy probesets to genes based on the individual probes. How would this be encoded within the database?  My MySQL code so far is below.  
> 
> USE homo_sapiens_funcgen_72_37;
> 
> SELECT 
> array.name,
> #array.*, array_chip.*,
> IF ( !ISNULL(probe_set.name), probe_set.name, probe.name ) AS probe_or_probeSet_name,
> probe_feature.*
> 
> FROM array
> JOIN array_chip ON array_chip.array_id = array.array_id
> JOIN probe ON probe.array_chip_id = array_chip.array_chip_id
> LEFT JOIN probe_set ON probe.probe_set_id = probe_set.probe_set_id
> JOIN probe_feature ON probe_feature.probe_id = probe.probe_id
> JOIN analysis ON probe_feature.analysis_id = analysis.analysis_id
> JOIN analysis_description ON analysis_description.analysis_id = analysis.analysis_id
> JOIN seq_region ON seq_region.seq_region_id = probe_feature.seq_region_id
> 
> WHERE analysis.module = "ProbeAlign"
> AND array.array_id=28  LIMIT 30;
> 
> Alternatively, I saw reference that the probe mapping information is available through biomart and that the biomart SQL tables might contain the data in a de-normalized format.  Would I be better off accessing martdb.ensembl.org, and if so, which tables should I be looking in?
> 
> Thank you,
> Alex Holman
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/