[ensembl-dev] Gene Synonyms through REST or Biomart

Thomas Danhorn danhornt at njhealth.org
Fri May 25 17:47:43 BST 2018


Hi Beat,

I do this regularly and have attached my Perl script -- feel free to adapt 
it for your purposes.  It uses the Ensembl Perl API, so you have to have 
the appropriate version installed. (I cloned the git repo and get from 
there whichever version I need; if you run into issues, feel free to ask 
me.)  The script takes a list of Ensembl gene IDs (with a header) and 
prints a table with the orginal IDs and the synonyms.  Be sure to specify 
the species with the -s option, unless you want the default, 'Mouse'.

One thing to note is that sometime between releases 84 and 90, the 
location of the synonyms in the databases switched from the "EntrezGene" 
DB to elsewhere, so if you use a newer release with the default parameters 
will likely find no synonyms -- use the option `-b all' to search through 
each DB (or specify one that you knnow has what you need).  This takes a 
while (days for an entire genome annotation), so if you have long lists, 
you may want to split them up and parallelize the process.

Hope this helps,

Thomas


On Tue, 22 May 2018, Premanand Achuthan wrote:

> Hi Beat Wolf,
>
> The synonyms is not always empty. If the external source has synonyms, then 
> it should be available, for example look at HGNC
>
> {"display_id": "BRCA2","primary_id": "HGNC:1101","version": 
> "0","description": "BRCA2, DNA repair associated","dbname": 
> "HGNC","synonyms": 
> ["BRCC2","FACD","FAD","FAD1","FANCD","FANCD1","XRCC11"],"info_text": 
> "Generated via ensembl_manual","info_type": "DIRECT","db_display_name": "HGNC 
> Symbol"}
>
> I am afraid that at the moment you can do it only gene by gene via the REST 
> endpoint or via the core API.
>
> Thanks
> Prem
>
>
> On 22/05/2018 10:23, Wolf Beat wrote:
>> Thank you for the quick answer.
>> 
>> 
>> I did not know about this approach, but i have two issues with it:
>> 
>> 1) The synonyms attribute seems to be always empty, so something seems 
>> wrong.
>> 
>> 2) Sadly i need all synonyms for all genes for a specific species. 
>> Downloading it through that endpoint would be too slow. Thats why i 
>> initially looked at biomart, because all i need is a list of Ensembl gene 
>> id + all synonyms.
>> 
>> 
>> Kind regards
>> 
>> 
>> Beat Wolf
>> 
>> ________________________________
>> From: Dev <dev-bounces at ensembl.org> on behalf of Premanand Achuthan 
>> <prem at ebi.ac.uk>
>> Sent: Tuesday, May 22, 2018 11:20:09 AM
>> To: dev at ensembl.org
>> Subject: Re: [ensembl-dev] Gene Synonyms through REST or Biomart
>> 
>> Hi Beat Wolf,
>> 
>> Please have a look at the /xrefs endpoint under "Cross References" and
>> look for "synonyms" in the attribute list.
>> 
>> http://rest.ensembl.org/xrefs/name/human/BRCA2?content-type=application/json
>> 
>> http://rest.ensembl.org/xrefs/id/ENSG00000157764?content-type=application/json
>> 
>> Hope it helps,
>> 
>> Best Regards
>> Prem
>> 
>> On 22/05/2018 10:09, Wolf Beat wrote:
>>> Hello,
>>> 
>>> 
>>> I'm looking for a complete list of all synonyms of a gene or a set of 
>>> genes. I can not find a way to get this information through biomart or the 
>>> REST interface. Is there a way to do it?
>>> 
>>> 
>>> Kind regards
>>> 
>>> 
>>> Beat Wolf
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: 
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: 
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: 
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>
>

NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
-------------- next part --------------
#!/usr/bin/env perl

use strict;
use warnings;
use Getopt::Long;
use Bio::EnsEMBL::Registry;

my $host = 'ensembldb.ensembl.org';
my $debug = 0;
my $species = 'Mouse';
my $dbname = 'EntrezGene';			# this used to have the most synonyms until v87
my $sep = ', ';
my $help = 0;
my $infile = '-';	# use STDIN by default
my $outfile = '-';	# use STDOUT by default

my $usage = "Usage: $0 [<options>] [inputfile [outputfile]]
Gzipped input and output files can be used as long as the have a .gz extension.
\nOptions:
\t-s: Species matching the input Ensembl IDs
\t-h: Specify server to query (default: ensembldb.ensembl.org)
\t-d: Debug database connection with verbose option on
\t-b: Databases to search for synonyms ('all' for all; default: EntrezGene)
\t-?: Print this message and exit\n
If no input or output file is specified, STDIN or STDOUT is used.
";

GetOptions(
	"s|species=s"	=> \$species,
	"h|host=s"	=> \$host,
	"d|debug"	=> \$debug,
	"b|dbname=s"	=> \$dbname,
	"help|?"	=> \$help
);

if ($help) {
	print "Prints gene name synonyms for given Ensembl genes.\n\n$usage";
	exit;
}

$host .= '.ensembl.org' unless $host =~ /\./;
$dbname = undef if $dbname eq 'all';

$infile = $ARGV[0] if @ARGV;
$outfile = $ARGV[1] if @ARGV >= 2;
die("Too many arguments.\n\n$usage") if @ARGV > 2;

my $infilestr = ($infile =~ /\.gz$/i) ? "zcat '$infile' |"
	: "<$infile";
my $outfilestr = ($outfile =~ /\.gz$/i) ? "| gzip >'$outfile'"
	: ">$outfile";
open(my $if, $infilestr) or die "Couldn't open input file $infile\n$!\n";
open(my $of, $outfilestr) or die "Couldn't open output file $outfile\n$!\n";

my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
	-host => $host,
	-user => 'anonymous',
	-VERBOSE => $debug,
);
## May or may not help with dropped connections during long queries:
$registry->set_disconnect_when_inactive(1);

my $gene_adaptor
	= Bio::EnsEMBL::Registry->get_adaptor($species, 'Core', 'Gene' );
my $header = <$if>;
chomp($header);
print $of "$header\tSynonyms\n";
while (my $geneid = <$if>) {
	chomp($geneid);
	my $gene = $gene_adaptor->fetch_by_stable_id($geneid);
	unless ($gene) {
		warn("Gene $geneid not found in database - ignoring.\n");
		next;
	}
	my @allsyn;
	my $dber = $gene->get_all_DBEntries($dbname);
	for my $db (@$dber) {
		my $synr = $db->get_all_synonyms();
		push(@allsyn, @$synr);
	}
	## Find unique synonyms (while maintaining order):
	my %seen;
	my @keep;
	for (0 .. $#allsyn) {
		unless ($seen{$allsyn[$_]}) {
			push(@keep, $_);
			$seen{$allsyn[$_]} = 1;
		}
	}
	print $of $geneid, "\t", join($sep, @allsyn[@keep]), "\n";
}
close($if);
close($of) or die("Error: could not close output file '$outfile'.");


More information about the Dev mailing list