[ensembl-dev] Bulk download of Microarray Probe mapping via MySQL

Wed Jul 17 10:20:03 BST 2013

Hi Alex

I have now checked in the following script and supporting module to the cvs HEAD:

	ensembl-functgenomics/scripts/export/dump_array_annotations.pl
	ensembl-functgenomics/modules/Bio/EnsEMBL/Funcgen/Utils/DBAdaptorHelper.pm

You should be able to drop these into your checkout of the code (unless you are using an archaic version of the API).

For HuGene-1_0-st-v1 the features dump took about 3 mins, and the xrefs dump took about 7 mins.

The -features dump is in bed format, although it does not use the score conventionally, this is actually the number of mismatches (hence usescore=0 in the track line). This could be changed with a mysql function to parse the cigar_line hash, but I though that was over-kill for now (and thought you  might appreciate access to this now).

The -xrefs dump is a simple tab delimited file of:

	probe/set name
 	array name(s)
	dbprimary_acc (i.e. the Ensembl stable ID)
	display_label (transcript display label e.g. HUGO etc),
	linkage_annotation (free text describing the quality of the annotation)

If the -merged option is specified the dumps will be non-reundant, and the (array/probe) name fields will contain a comma separated list where appropriate, one for each for each array. This is necessary due to the differing names some probe have across AFFY arrays and the redundancy of probesets across array designs within the same class.

Please get in touch if you have any questions.

Nathan Johnson
Ensembl Regulation
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD

http://www.ensembl.info/
http://twitter.com/#!/ensembl

On 15 Jul 2013, at 10:31, njohnson <njohnson at ebi.ac.uk> wrote:

> Hi Alex
> 
> I'd like to pre-fix my response by stating that  (as I'm sure you are aware), the safest way to access the data is via the API, as this will protect you against any schema changes that might occur in the future. However, the probe and xref tables are probably the most stable in the whole schema. So for your purpose I think this is probably the right choice.
> 
> I do get these specific questions occasionally, so I will write a script which does the necessary direct SQL queries and check this into our scripts folder. I'll let you know asap. 
> 
> 
> Nathan Johnson
> Ensembl Regulation
> European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> 
> http://www.ensembl.info/
> http://twitter.com/#!/ensembl
> 
> On 11 Jul 2013, at 21:56, Alex Holman <aholman at jimmy.harvard.edu> wrote:
> 
>> Hi Nathan, 
>> Thanks for the reply.
>> I'd like to get both the alignment coordinates (which I *think* I have by joining on the probe_feature table) as well as the probe/set -> transcript annotations.  
>> 
>> I'm part of the team working on an updated version of the MeV (Multiple Experiment Viewer) software for microarray analysis (http://mev-tm4.sourceforge.net/).  We incorporate the ability to display probe gene annotations as well as basic probe IDs, and I'd like to include the Ensembl mappings as well as those supplied directly from Affy, Illumina, etc.  To do this I'm hoping to mirror a local copy of the Ensembl mappings for all chips to incorporate into the tool as an annotation set.
>> 
>> Thanks,
>> Alex
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/