[ensembl-dev] xref mapping

Thu Feb 27 14:43:49 GMT 2014

Hi Genomeo,

Taking the last entry as example:
Hs.743764 refers to 
http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?UGID=5947187&TAXID=9606&SEARCH=Hs.743764
This is a human locus.
The description means that this locus is similar to a gene in rat, as it 
has not been fully annotated in human.

I agree the description can be misleading, but it is imported directly 
from NCBI as is, so there is not much we can do about it.

Regards,
Magali

On 27/02/2014 14:37, Genomeo Dev wrote:
> Thanks very much Magali for pointing that out.
>
> If understand you correctly db_type and species are therefore 
> attributes of the query gene IDs not the returned cross-reference ids. 
> For my query IDs I see I define that here:
>
> my $gene_adaptor = Bio::EnsEMBL::Registry->get_adaptor( "human", 
> "core", "gene" );
>
> I have tried to lookup about 5000 human ensembl IDs and found that for 
> 256 I get cross mapping to other organisms. It only happens for 
> UniGene. For example for ENSG00000010244:
>
> display_iddbnameensembl_startxref_startdisplay_idscoredb_display_namexref_endevalueinfo_textinfo_typeensembl_endprimary_idensembl_identitysynonymsversioncigar_linexref_identitydbnamedescription
> ENSG00000010244ensembl11Hs.50077523313UniGene4672SEQUENCE_MATCH4672Hs.5007759901631M1D1592M1I1448M99UniGeneZinc 
> finger protein 207
> ENSG00000010244ensembl56671Hs.6123771200UniGene249SEQUENCE_MATCH5417Hs.61237710249M97UniGeneTranscribed 
> locus
> ENSG00000010244ensembl128531Hs.6361122260UniGene452SEQUENCE_MATCH12400Hs.63611230452M100UniGeneTranscribed 
> locus
> ENSG00000010244ensembl52131Hs.6583443230UniGene684SEQUENCE_MATCH4526Hs.65834440615M1I8M1D3M1D34M1D13M1D5M1I4M91UniGeneTranscribed 
> locus
> ENSG00000010244ensembl301423Hs.6702381995UniGene427SEQUENCE_MATCH2607Hs.67023820399M1D6M94UniGeneTranscribed 
> locus
> ENSG00000010244ensembl115052Hs.6943783063UniGene628SEQUENCE_MATCH12131Hs.69437840627M98UniGeneTranscribed 
> locus
> ENSG00000010244ensembl11Hs.7169931971UniGene396SEQUENCE_MATCH396Hs.716993990396M99UniGeneTranscribed 
> locus, strongly similar to NP_001034109.1 Zfp207 gene product [Rattus 
> norvegicus]
> ENSG00000010244ensembl501Hs.7437643472UniGene791SEQUENCE_MATCH841Hs.74376450716M1D75M91UniGeneTranscribed 
> locus, moderately similar to NP_001034109.1 Zfp207 gene product 
> [Rattus norvegicus]
>
> (obtained from Rest)
>
>
>
> On 27 February 2014 14:00, mag <mr6 at ebi.ac.uk <mailto:mr6 at ebi.ac.uk>> 
> wrote:
>
>     Hi Genomeo,
>
>     To find which attributes are available, the Ensembl Doxygen
>     documentation usually covers everything you need.
>     Looking at
>     http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Gene.html
>     will tell you that you can obtain the following from a gene:
>
>     $gene->source()
>     $gene->analysis->logic_name()
>     $gene->description()
>     $gene->external_name()
>     $gene->biotype()
>     $gene->seq_region_start()
>     $gene->seq_region_end()
>     $gene->seq_region_name()
>     $gene->seq_region_strand()
>     $gene->display_id()
>
>     When using the API, you should always know what object_type you
>     are using, as it allows you to use the correct attributes.
>     In this example, if you are using a Bio::EnsEMBL::Gene,
>     object_type is 'gene'
>
>     For species and db_type as well, you need to know those beforehand
>     when using directly the perl API.
>     They are the ones which will allow you to connect to the correct
>     database based on the data you are looking for.
>
>     Regarding cross references to other organisms, do you have any
>     examples?
>     Generally, we should be only mapping to other resources for the
>     same organism.
>     For example, for pig, we will only assign cross references to
>     Uniprot pig proteins.
>
>     The main exceptions I can think of are:
>     - HGNC names
>     Typically, if the coverage for a species is low (ie, not all 20
>     odd thousand proteins have been submitted to Uniprot or RefSeq),
>     we will use HGNC names to fill in the gaps.
>     Where no name can be found and there is a homolog in human, we use
>     the same name as in human.
>     - Ensembl translations
>     For some low coverage species, annotations was provided by
>     projecting human annotation via a whole genome alignment.
>     For these models, we add an external reference to the human
>     translation which was used to build the model.
>
>
>     Hope this helps,
>     Magali
>
>
>     On 27/02/2014 13:41, Genomeo Dev wrote:
>>     Thanks very much for the useful answer.
>>
>>     I noticed that cross ref also maps to genes from organisms other
>>     than that of the query gene ID. Any comment on that?
>>
>>     Related to the previous question, I use the following Rest python
>>     code to do id lookup for particular Ensembl IDs:
>>
>>     pref= "/lookup/id/"
>>     ext = "?"
>>
>>     for line in inputfile1:
>>             geneid= line.rstrip('\n')
>>
>>             resp, content = http.request(server+pref+geneid+ext,
>>     method="GET", headers={"Content-Type":"application/json"})
>>
>>             if not resp.status == 200:
>>                     print "%s\t%s\t%s" %  (geneid, "Invalid
>>     response:", resp.status)
>>                     continue
>>                     #sys.exit()
>>             print "%s\t%s" % (geneid,content)
>>
>>
>>     And I get this output:
>>
>>     ENSG00000223972{"source":"ensembl_havana","object_type":"Gene","logic_name":"ensembl_havana_gene","species":"homo_sapiens","description":"DEAD/H
>>     (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [Source:HGNC
>>     Symbol;Acc:37102]","display_name":"DDX11L1","biotype":"pseudogene","end":14412,"seq_region_name":"1","db_type":"core","strand":1,"id":"ENSG00000223972","start":11869}
>>
>>     What would be the classes/attributes to use under the Perl API to
>>     get that? i.e:
>>
>>     source
>>     object_type
>>     logic_name
>>     species
>>     description
>>     display_name
>>     biotype
>>     end
>>     seq_region_name
>>     db_type
>>     strand
>>     id
>>     start
>>
>>     Thanks,
>>
>>     G.
>>
>>
>>     On 27 February 2014 11:39, mag <mr6 at ebi.ac.uk
>>     <mailto:mr6 at ebi.ac.uk>> wrote:
>>
>>         Hi Genomeo,
>>
>>         The REST server only display the current/latest release.
>>         The release version can be found with this endpoint:
>>         http://beta.rest.ensembl.org/documentation/info/software
>>
>>         To get more details with the Ensembl API, you only need to
>>         update the print_DBEntries method to display all the
>>         attributes you are looking for.
>>         Compared to the output from REST, we have the following:
>>         - display_id is $dbe->display_id()
>>         - primary_id is $dbe->primary_id()
>>         - version is $dbe->version()
>>         - description is $dbe->description()
>>         - dbname is $dbe->dbname()
>>         - synonyms is $dbe->get_all_synonyms()
>>         - info_type is $dbe->info_type()
>>         - info_text is $dbe->info_text()
>>         - db_display_name is $dbe->db_display_name()
>>
>>         You can chose what format the REST will output.
>>         Details of all formats can be found in our user guide:
>>         http://beta.rest.ensembl.org/documentation/user_guide
>>         For tab-delimited output, content_type=text/x-gff3 is used,
>>         but it is only available for the /feature endpoint.
>>
>>         There is no file in the Ensembl ftp dumps that contains all
>>         the external references produced.
>>
>>
>>         Regards,
>>         Magali
>>
>>
>>         On 27/02/2014 11:20, Genomeo Dev wrote:
>>>         Hi,
>>>
>>>         I am interested in getting wide cross references to ensembl
>>>         gene IDs. I found two programmatic ways to do that which
>>>         give consistent results but different amount of details.
>>>         Using ENSG00000223972 as an example:
>>>         (1)
>>>         Using this rest API Endpoint python code
>>>         (http://beta.rest.ensembl.org/documentation/info/xref_id)
>>>
>>>          1. importhttplib2,sys
>>>         2.
>>>          3. http =httplib2.Http(".cache")
>>>         4.
>>>          5. server ="http://beta.rest.ensembl.org"
>>>          6. ext ="/xrefs/id/ENSG00000157764?"
>>>          7. resp,content
>>>             =http.request(server+ext,method="GET",headers={"Content-Type":"application/json"})
>>>         8.
>>>          9. ifnotresp.status ==200:
>>>         10. print"Invalid response: ",resp.status
>>>         11. sys.exit()
>>>         12. importjson
>>>        13.
>>>         14. decoded =json.loads(content)
>>>         15. printrepr(decoded)
>>>
>>>
>>>         I get:
>>>
>>>         {"display_id":"OTTHUMG00000000961","primary_id":"OTTHUMG00000000961","version":"2","description":null,"dbname":"OTTG","synonyms":[],"info_type":"NONE","info_text":"","db_display_name":"Havana
>>>         gene"}
>>>
>>>         {"primary_id":"Hs.714157","dbname":"UniGene","ensembl_identity":98,"synonyms":[],"ensembl_start":6,"xref_start":1,"xref_end":1639,"db_display_name":"UniGene","display_id":"Hs.714157","ensembl_end":1657,"version":"0","score":8055,"cigar_line":"1200M1D299M12D140M","description":"DEAD/H
>>>         (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>>         1","xref_identity":97,"evalue":null,"info_text":"","info_type":"SEQUENCE_MATCH"}
>>>
>>>         {"primary_id":"Hs.618434","dbname":"UniGene","ensembl_identity":58,"synonyms":[],"ensembl_start":669,"xref_start":1,"xref_end":974,"db_display_name":"UniGene","display_id":"Hs.618434","ensembl_end":1655,"version":"0","score":4757,"cigar_line":"537M1D299M12D138M","description":"Similar
>>>         to DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 11 isoform
>>>         1, mRNA (cDNA clone
>>>         IMAGE:6103207)","xref_identity":96,"evalue":null,"info_text":"","info_type":"SEQUENCE_MATCH"}
>>>
>>>         {"display_id":"DDX11L1","primary_id":"37102","version":"0","description":"DEAD/H
>>>         (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>>         1","dbname":"HGNC","synonyms":[],"info_type":"DIRECT","info_text":"Generated
>>>         via ensembl_manual","db_display_name":"HGNC Symbol"}
>>>
>>>         {"display_id":"DDX11L5","primary_id":"100287596","version":"0","description":"DEAD/H
>>>         (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>>         5","dbname":"EntrezGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"EntrezGene"}
>>>
>>>         {"display_id":"DDX11L1","primary_id":"100287102","version":"0","description":"DEAD/H
>>>         (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>>         1","dbname":"EntrezGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"EntrezGene"}
>>>
>>>         {"display_id":"ENSG00000223972","primary_id":"ENSG00000223972","version":"0","description":"","dbname":"ArrayExpress","synonyms":[],"info_type":"DIRECT","info_text":"","db_display_name":"ArrayExpress"}
>>>
>>>         {"display_id":"DDX11L5","primary_id":"100287596","version":"0","description":"DEAD/H
>>>         (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>>         5","dbname":"WikiGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"WikiGene"}
>>>
>>>         {"display_id":"DDX11L1","primary_id":"100287102","version":"0","description":"DEAD/H
>>>         (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>>         1","dbname":"WikiGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"WikiGene"}]
>>>
>>>         (2)
>>>
>>>         Using this perl API code (based on
>>>         http://www.ensembl.org/info/docs/api/core/core_tutorial.html):
>>>
>>>         # Define a helper subroutine to print DBEntries
>>>         sub print_DBEntries
>>>         {
>>>              my $db_entries = shift;
>>>
>>>              foreach my $dbe ( @{$db_entries} ) {
>>>                  printf "\tXREF %s (%s)\n", $dbe->display_id(), $dbe->dbname();
>>>              }
>>>         }
>>>
>>>         my $genes = $gene_adaptor->fetch_all_by_stable_id_list([@gene_list]);
>>>
>>>         ...
>>>
>>>         print "GENE ", $gene->stable_id(), "\n";
>>>         print_DBEntries( $gene->get_all_DBEntries() );
>>>         I get:
>>>         XREF OTTHUMG00000000961 (OTTG)
>>>         XREF ENSG00000223972 (ArrayExpress)
>>>         XREF DDX11L1 (EntrezGene)
>>>         XREF DDX11L5 (EntrezGene)
>>>         XREF DDX11L1 (HGNC)
>>>         XREF Hs.618434 (UniGene)
>>>         XREF Hs.714157 (UniGene)
>>>         XREF DDX11L1 (WikiGene)
>>>         XREF DDX11L5 (WikiGene)
>>>
>>>
>>>         Questions:
>>>
>>>         1. am I correct in saying that the Rest code uses the latest
>>>         Ensembl release while the API code uses the Ensembl release
>>>         currently installed as part of the VM (I am using release 74)?
>>>
>>>         2. Rest code gives more extensive details (which I like)
>>>         compared to the perl API code. Could you suggest a simple
>>>         way to use the API to get the same details?
>>>
>>>         3. The Rest code output format. Is tab separated text supported?
>>>
>>>         4. Is there a  file in the Ensembl ftp area which contains
>>>         pre generated detailed cross ref mappings for all current
>>>         Ensembl genes?
>>>         -- 
>>>
>>>         Thanks,
>>>
>>>         G.
>>>
>>>
>>>         _______________________________________________
>>>         Dev mailing listDev at ensembl.org  <mailto:Dev at ensembl.org>
>>>         Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>>         Ensembl Blog:http://www.ensembl.info/
>>
>>
>>         _______________________________________________
>>         Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>         Posting guidelines and subscribe/unsubscribe info:
>>         http://lists.ensembl.org/mailman/listinfo/dev
>>         Ensembl Blog: http://www.ensembl.info/
>>
>>
>>
>>
>>     -- 
>>     G.
>>
>>
>>     _______________________________________________
>>     Dev mailing listDev at ensembl.org  <mailto:Dev at ensembl.org>
>>     Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>     Ensembl Blog:http://www.ensembl.info/
>
>
>     _______________________________________________
>     Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>     Posting guidelines and subscribe/unsubscribe info:
>     http://lists.ensembl.org/mailman/listinfo/dev
>     Ensembl Blog: http://www.ensembl.info/
>
>
>
>
> -- 
> G.
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140227/b0c3f2ce/attachment.html>