[ensembl-dev] xref mapping

Genomeo Dev genomeodev at gmail.com
Thu Feb 27 16:15:55 GMT 2014


OK thanks.

Another question please:

Going back to my perl subroutine for getting xref for ensembl IDs:

use warnings;
sub print_DBEntries
{
my $db_entries = shift;
foreach my $dbe ( @{$db_entries} ) {
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n", $dbe->dbname(),
$dbe->display_id(), $dbe->description(), $dbe->db_display_name(),
$dbe->info_text(), $dbe->info_type(), $dbe->primary_id(), join(",
",@{$dbe->get_all_synonyms()}) , $dbe->version();
}
}

I found that some values are not returned for some databases. For example
using ENSG00000151067:

#Use of uninitialized value in printf at ./fetch_ensembl_genes_v2.pl line
47.
OTTG OTTHUMG00000150243 Havana gene NONE OTTHUMG00000150243 7
#Use of uninitialized value in printf at ./fetch_ensembl_genes_v2.pl line
47.
ENS_LRG_gene LRG_334 LRG display in Ensembl NONE LRG_334 0
LRG LRG_334 Locus Reference Genomic record for CACNA1C Locus Reference
Genomic DIRECT LRG_334 0
ArrayExpress ENSG00000151067 ArrayExpress DIRECT ENSG00000151067 0
EntrezGene CACNA1C calcium channel, voltage-dependent, L type, alpha 1C
subunit EntrezGene DEPENDENT 775 CACH2, CACN2, CACNL1A1, CaV1.2, CCHL1A1,
LQT8, TS 0
HGNC CACNA1C calcium channel, voltage-dependent, L type, alpha 1C subunit HGNC
Symbol Generated via ensembl_manual DIRECT 1390 CACH2, CACN2, CACNL1A1,
Cav1.2, CCHL1A1, LQT8, TS 0
MIM_GENE CALCIUM CHANNEL, VOLTAGE-DEPENDENT [*114205] CALCIUM CHANNEL,
VOLTAGE-DEPENDENT, L TYPE, ALPHA-1C SUBUNIT; CACNA1C MIM gene DEPENDENT
114205 0
MIM_MORBID TIMOTHY SYNDROME [#601005] TIMOTHY SYNDROME; TS MIM disease
DEPENDENT 601005 0
MIM_MORBID BRUGADA SYNDROME 3 [#611875] BRUGADA SYNDROME 3; BRGDA3 MIM
disease DEPENDENT 611875 0
UniGene Hs.690010 Voltage-dependent L-type Ca2+ channel alpha 1 subunit
(CACNA1C) mRNA, exon 1a and partial cds UniGene SEQUENCE_MATCH Hs.690010 0
UniGene Hs.697137 Transcribed locus, weakly similar to XP_416388.3
PREDICTED: voltage-dependent L-type calcium channel subunit alpha-1C
[Gallus gallus] UniGene SEQUENCE_MATCH Hs.697137 0
Uniprot_gn CACNA1C UniProtKB Gene Name DEPENDENT CACNA1C CACH2, CACN2,
CACNL1A1, CCHL1A1 0
WikiGene CACNA1C calcium channel, voltage-dependent, L type, alpha 1C
subunit WikiGene DEPENDENT 775 0

I would expect for missing value to be empty string. Am I missing anything?

On a related note, this code runs very slowly (takes 6 mns for 100 genes).
>From an earlier post, it seems that connecting to the database is the
bottleneck. Connecting to useastdb.ensembl.org instead of
ensembldb.ensembl.org is not much better. So I was wondering whether there
is a way to turn off lazy loading for the purpose of debugging?




On 27 February 2014 14:43, mag <mr6 at ebi.ac.uk> wrote:

>  Hi Genomeo,
>
> Taking the last entry as example:
> Hs.743764 refers to
> http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?UGID=5947187&TAXID=9606&SEARCH=Hs.743764
> This is a human locus.
> The description means that this locus is similar to a gene in rat, as it
> has not been fully annotated in human.
>
> I agree the description can be misleading, but it is imported directly
> from NCBI as is, so there is not much we can do about it.
>
>
> Regards,
> Magali
>
>
> On 27/02/2014 14:37, Genomeo Dev wrote:
>
> Thanks very much Magali for pointing that out.
>
>  If understand you correctly db_type and species are therefore attributes
> of the query gene IDs not the returned cross-reference ids. For my query
> IDs I see I define that here:
>
>  my $gene_adaptor = Bio::EnsEMBL::Registry->get_adaptor( "human", "core",
> "gene" );
>
>  I have tried to lookup about 5000 human ensembl IDs and found that for
> 256 I get cross mapping to other organisms. It only happens for UniGene.
> For example for ENSG00000010244:
>
>  display_id dbname ensembl_start xref_start display_id score
> db_display_name xref_end evalue info_text info_type ensembl_end primary_id
> ensembl_identity synonyms version cigar_line xref_identity dbname
> description
>  ENSG00000010244 ensembl 1 1 Hs.500775 23313 UniGene 4672 SEQUENCE_MATCH
> 4672 Hs.500775 99 0 1631M1D1592M1I1448M 99 UniGene Zinc finger protein 207
> ENSG00000010244 ensembl 5667 1 Hs.612377 1200 UniGene 249 SEQUENCE_MATCH
> 5417 Hs.612377 1 0 249M 97 UniGene Transcribed locus
> ENSG00000010244 ensembl 12853 1 Hs.636112 2260 UniGene 452 SEQUENCE_MATCH
> 12400 Hs.636112 3 0 452M 100 UniGene Transcribed locus
> ENSG00000010244 ensembl 5213 1 Hs.658344 3230 UniGene 684 SEQUENCE_MATCH
> 4526 Hs.658344 4 0 615M1I8M1D3M1D34M1D13M1D5M1I4M 91 UniGene Transcribed
> locus
> ENSG00000010244 ensembl 3014 23 Hs.670238 1995 UniGene 427 SEQUENCE_MATCH
> 2607 Hs.670238 2 0 399M1D6M 94 UniGene Transcribed locus
> ENSG00000010244 ensembl 11505 2 Hs.694378 3063 UniGene 628 SEQUENCE_MATCH
> 12131 Hs.694378 4 0 627M 98 UniGene Transcribed locus
> ENSG00000010244 ensembl 1 1 Hs.716993 1971 UniGene 396 SEQUENCE_MATCH 396
> Hs.716993 99 0 396M 99 UniGene Transcribed locus, strongly similar to
> NP_001034109.1 Zfp207 gene product [Rattus norvegicus]
> ENSG00000010244 ensembl 50 1 Hs.743764 3472 UniGene 791 SEQUENCE_MATCH 841
> Hs.743764 5 0 716M1D75M 91 UniGene Transcribed locus, moderately similar
> to NP_001034109.1 Zfp207 gene product [Rattus norvegicus]
>
>  (obtained from Rest)
>
>
>
> On 27 February 2014 14:00, mag <mr6 at ebi.ac.uk> wrote:
>
>>  Hi Genomeo,
>>
>> To find which attributes are available, the Ensembl Doxygen documentation
>> usually covers everything you need.
>> Looking at
>> http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Gene.html
>> will tell you that you can obtain the following from a gene:
>>
>> $gene->source()
>> $gene->analysis->logic_name()
>> $gene->description()
>> $gene->external_name()
>> $gene->biotype()
>> $gene->seq_region_start()
>> $gene->seq_region_end()
>> $gene->seq_region_name()
>> $gene->seq_region_strand()
>> $gene->display_id()
>>
>> When using the API, you should always know what object_type you are
>> using, as it allows you to use the correct attributes.
>> In this example, if you are using a Bio::EnsEMBL::Gene, object_type is
>> 'gene'
>>
>> For species and db_type as well, you need to know those beforehand when
>> using directly the perl API.
>> They are the ones which will allow you to connect to the correct database
>> based on the data you are looking for.
>>
>> Regarding cross references to other organisms, do you have any examples?
>> Generally, we should be only mapping to other resources for the same
>> organism.
>> For example, for pig, we will only assign cross references to Uniprot pig
>> proteins.
>>
>> The main exceptions I can think of are:
>> - HGNC names
>> Typically, if the coverage for a species is low (ie, not all 20 odd
>> thousand proteins have been submitted to Uniprot or RefSeq), we will use
>> HGNC names to fill in the gaps.
>> Where no name can be found and there is a homolog in human, we use the
>> same name as in human.
>> - Ensembl translations
>> For some low coverage species, annotations was provided by projecting
>> human annotation via a whole genome alignment.
>> For these models, we add an external reference to the human translation
>> which was used to build the model.
>>
>>
>> Hope this helps,
>> Magali
>>
>>
>>  On 27/02/2014 13:41, Genomeo Dev wrote:
>>
>> Thanks very much for the useful answer.
>>
>>  I noticed that cross ref also maps to genes from organisms other than
>> that of the query gene ID. Any comment on that?
>>
>>  Related to the previous question, I use the following Rest python code
>> to do id lookup for particular Ensembl IDs:
>>
>>  pref= "/lookup/id/"
>> ext = "?"
>>
>>  for line in inputfile1:
>>         geneid= line.rstrip('\n')
>>
>>          resp, content = http.request(server+pref+geneid+ext,
>> method="GET", headers={"Content-Type":"application/json"})
>>
>>          if not resp.status == 200:
>>                 print "%s\t%s\t%s" %  (geneid, "Invalid response:",
>> resp.status)
>>                 continue
>>                 #sys.exit()
>>         print "%s\t%s" % (geneid,content)
>>
>>
>>  And I get this output:
>>
>>  ENSG00000223972 {"source":"ensembl_havana","object_type":"Gene","logic_name":"ensembl_havana_gene","species":"homo_sapiens","description":"DEAD/H
>> (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [Source:HGNC
>> Symbol;Acc:37102]","display_name":"DDX11L1","biotype":"pseudogene","end":14412,"seq_region_name":"1","db_type":"core","strand":1,"id":"ENSG00000223972","start":11869}
>>
>>  What would be the classes/attributes to use under the Perl API to get
>> that? i.e:
>>
>>  source
>> object_type
>> logic_name
>> species
>> description
>> display_name
>> biotype
>> end
>> seq_region_name
>>  db_type
>> strand
>> id
>> start
>>
>>  Thanks,
>>
>>  G.
>>
>>
>> On 27 February 2014 11:39, mag <mr6 at ebi.ac.uk> wrote:
>>
>>>  Hi Genomeo,
>>>
>>> The REST server only display the current/latest release.
>>> The release version can be found with this endpoint:
>>> http://beta.rest.ensembl.org/documentation/info/software
>>>
>>> To get more details with the Ensembl API, you only need to update the
>>> print_DBEntries method to display all the attributes you are looking for.
>>> Compared to the output from REST, we have the following:
>>> - display_id is $dbe->display_id()
>>> - primary_id is $dbe->primary_id()
>>> - version is $dbe->version()
>>> - description is $dbe->description()
>>> - dbname is $dbe->dbname()
>>> - synonyms is $dbe->get_all_synonyms()
>>> - info_type is $dbe->info_type()
>>> - info_text is $dbe->info_text()
>>> - db_display_name is $dbe->db_display_name()
>>>
>>> You can chose what format the REST will output.
>>> Details of all formats can be found in our user guide:
>>> http://beta.rest.ensembl.org/documentation/user_guide
>>> For tab-delimited output, content_type=text/x-gff3 is used, but it is
>>> only available for the /feature endpoint.
>>>
>>> There is no file in the Ensembl ftp dumps that contains all the external
>>> references produced.
>>>
>>>
>>> Regards,
>>> Magali
>>>
>>>
>>> On 27/02/2014 11:20, Genomeo Dev wrote:
>>>
>>>   Hi,
>>>
>>>  I am interested in getting wide cross references to ensembl gene IDs.
>>> I found two programmatic ways to do that which give consistent results but
>>> different amount of details. Using ENSG00000223972 as an example:
>>>  (1)
>>> Using this rest API Endpoint python code (
>>> http://beta.rest.ensembl.org/documentation/info/xref_id)
>>>
>>>
>>>    1. import httplib2, sys
>>>    2.
>>>    3. http = httplib2.Http(".cache")
>>>    4.
>>>    5. server = "http://beta.rest.ensembl.org"
>>>    6. ext = "/xrefs/id/ENSG00000157764?"
>>>    7. resp, content = http.request(server+ext, method="GET", headers={
>>>    "Content-Type":"application/json"})
>>>    8.
>>>    9. if not resp.status == 200:
>>>    10. print "Invalid response: ", resp.status
>>>    11. sys.exit()
>>>    12. import json
>>>    13.
>>>    14. decoded = json.loads(content)
>>>    15. print repr(decoded)
>>>
>>>
>>>  I get:
>>>
>>>  {"display_id":"OTTHUMG00000000961","primary_id":"OTTHUMG00000000961","version":"2","description":null,"dbname":"OTTG","synonyms":[],"info_type":"NONE","info_text":"","db_display_name":"Havana
>>> gene"}
>>>
>>>  {"primary_id":"Hs.714157","dbname":"UniGene","ensembl_identity":98,"synonyms":[],"ensembl_start":6,"xref_start":1,"xref_end":1639,"db_display_name":"UniGene","display_id":"Hs.714157","ensembl_end":1657,"version":"0","score":8055,"cigar_line":"1200M1D299M12D140M","description":"DEAD/H
>>> (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>> 1","xref_identity":97,"evalue":null,"info_text":"","info_type":"SEQUENCE_MATCH"}
>>>
>>>  {"primary_id":"Hs.618434","dbname":"UniGene","ensembl_identity":58,"synonyms":[],"ensembl_start":669,"xref_start":1,"xref_end":974,"db_display_name":"UniGene","display_id":"Hs.618434","ensembl_end":1655,"version":"0","score":4757,"cigar_line":"537M1D299M12D138M","description":"Similar
>>> to DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 11 isoform 1, mRNA (cDNA
>>> clone
>>> IMAGE:6103207)","xref_identity":96,"evalue":null,"info_text":"","info_type":"SEQUENCE_MATCH"}
>>>
>>>  {"display_id":"DDX11L1","primary_id":"37102","version":"0","description":"DEAD/H
>>> (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>> 1","dbname":"HGNC","synonyms":[],"info_type":"DIRECT","info_text":"Generated
>>> via ensembl_manual","db_display_name":"HGNC Symbol"}
>>>
>>>  {"display_id":"DDX11L5","primary_id":"100287596","version":"0","description":"DEAD/H
>>> (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>> 5","dbname":"EntrezGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"EntrezGene"}
>>>
>>>  {"display_id":"DDX11L1","primary_id":"100287102","version":"0","description":"DEAD/H
>>> (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>> 1","dbname":"EntrezGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"EntrezGene"}
>>>
>>>
>>> {"display_id":"ENSG00000223972","primary_id":"ENSG00000223972","version":"0","description":"","dbname":"ArrayExpress","synonyms":[],"info_type":"DIRECT","info_text":"","db_display_name":"ArrayExpress"}
>>>
>>>  {"display_id":"DDX11L5","primary_id":"100287596","version":"0","description":"DEAD/H
>>> (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>> 5","dbname":"WikiGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"WikiGene"}
>>>
>>>  {"display_id":"DDX11L1","primary_id":"100287102","version":"0","description":"DEAD/H
>>> (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>> 1","dbname":"WikiGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"WikiGene"}]
>>>
>>>  (2)
>>>
>>>  Using this perl API code (based on
>>> http://www.ensembl.org/info/docs/api/core/core_tutorial.html):
>>>
>>>  # Define a helper subroutine to print DBEntries
>>> sub print_DBEntries
>>> {
>>>     my $db_entries = shift;
>>>
>>>     foreach my $dbe ( @{$db_entries} ) {
>>>         printf "\tXREF %s (%s)\n", $dbe->display_id(), $dbe->dbname();
>>>     }
>>> }
>>>
>>> my $genes = $gene_adaptor->fetch_all_by_stable_id_list([@gene_list]);
>>>
>>>
>>> ...
>>>
>>>
>>> print "GENE ", $gene->stable_id(), "\n";
>>> print_DBEntries( $gene->get_all_DBEntries() );
>>>
>>>  I get:
>>>  XREF OTTHUMG00000000961 (OTTG)
>>> XREF ENSG00000223972 (ArrayExpress)
>>> XREF DDX11L1 (EntrezGene)
>>> XREF DDX11L5 (EntrezGene)
>>> XREF DDX11L1 (HGNC)
>>> XREF Hs.618434 (UniGene)
>>> XREF Hs.714157 (UniGene)
>>>  XREF DDX11L1 (WikiGene)
>>> XREF DDX11L5 (WikiGene)
>>>
>>>
>>>  Questions:
>>>
>>>  1. am I correct in saying that the Rest code uses the latest Ensembl
>>> release while the API code uses the Ensembl release currently installed as
>>> part of the VM (I am using release 74)?
>>>
>>>  2. Rest code gives more extensive details (which I like) compared to
>>> the perl API code. Could you suggest a simple way to use the API to get the
>>> same details?
>>>
>>>  3. The Rest code output format. Is tab separated text supported?
>>>
>>>  4. Is there a  file in the Ensembl ftp area which contains pre
>>> generated detailed cross ref mappings for all current Ensembl genes?
>>> --
>>>
>>>  Thanks,
>>>
>>>  G.
>>>
>>>
>>>  _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>
>>
>>  --
>> G.
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
>
>  --
> G.
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>


-- 
G.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140227/8b234d6c/attachment.html>


More information about the Dev mailing list