[ensembl-dev] xref mapping

mag mr6 at ebi.ac.uk
Thu Feb 27 14:00:09 GMT 2014


Hi Genomeo,

To find which attributes are available, the Ensembl Doxygen 
documentation usually covers everything you need.
Looking at 
http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Gene.html
will tell you that you can obtain the following from a gene:

$gene->source()
$gene->analysis->logic_name()
$gene->description()
$gene->external_name()
$gene->biotype()
$gene->seq_region_start()
$gene->seq_region_end()
$gene->seq_region_name()
$gene->seq_region_strand()
$gene->display_id()

When using the API, you should always know what object_type you are 
using, as it allows you to use the correct attributes.
In this example, if you are using a Bio::EnsEMBL::Gene, object_type is 
'gene'

For species and db_type as well, you need to know those beforehand when 
using directly the perl API.
They are the ones which will allow you to connect to the correct 
database based on the data you are looking for.

Regarding cross references to other organisms, do you have any examples?
Generally, we should be only mapping to other resources for the same 
organism.
For example, for pig, we will only assign cross references to Uniprot 
pig proteins.

The main exceptions I can think of are:
- HGNC names
Typically, if the coverage for a species is low (ie, not all 20 odd 
thousand proteins have been submitted to Uniprot or RefSeq), we will use 
HGNC names to fill in the gaps.
Where no name can be found and there is a homolog in human, we use the 
same name as in human.
- Ensembl translations
For some low coverage species, annotations was provided by projecting 
human annotation via a whole genome alignment.
For these models, we add an external reference to the human translation 
which was used to build the model.


Hope this helps,
Magali

On 27/02/2014 13:41, Genomeo Dev wrote:
> Thanks very much for the useful answer.
>
> I noticed that cross ref also maps to genes from organisms other than 
> that of the query gene ID. Any comment on that?
>
> Related to the previous question, I use the following Rest python code 
> to do id lookup for particular Ensembl IDs:
>
> pref= "/lookup/id/"
> ext = "?"
>
> for line in inputfile1:
>         geneid= line.rstrip('\n')
>
>         resp, content = http.request(server+pref+geneid+ext, 
> method="GET", headers={"Content-Type":"application/json"})
>
>         if not resp.status == 200:
>                 print "%s\t%s\t%s" %  (geneid, "Invalid response:", 
> resp.status)
>                 continue
>                 #sys.exit()
>         print "%s\t%s" % (geneid,content)
>
>
> And I get this output:
>
> ENSG00000223972{"source":"ensembl_havana","object_type":"Gene","logic_name":"ensembl_havana_gene","species":"homo_sapiens","description":"DEAD/H 
> (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [Source:HGNC 
> Symbol;Acc:37102]","display_name":"DDX11L1","biotype":"pseudogene","end":14412,"seq_region_name":"1","db_type":"core","strand":1,"id":"ENSG00000223972","start":11869}
>
> What would be the classes/attributes to use under the Perl API to get 
> that? i.e:
>
> source
> object_type
> logic_name
> species
> description
> display_name
> biotype
> end
> seq_region_name
> db_type
> strand
> id
> start
>
> Thanks,
>
> G.
>
>
> On 27 February 2014 11:39, mag <mr6 at ebi.ac.uk <mailto:mr6 at ebi.ac.uk>> 
> wrote:
>
>     Hi Genomeo,
>
>     The REST server only display the current/latest release.
>     The release version can be found with this endpoint:
>     http://beta.rest.ensembl.org/documentation/info/software
>
>     To get more details with the Ensembl API, you only need to update
>     the print_DBEntries method to display all the attributes you are
>     looking for.
>     Compared to the output from REST, we have the following:
>     - display_id is $dbe->display_id()
>     - primary_id is $dbe->primary_id()
>     - version is $dbe->version()
>     - description is $dbe->description()
>     - dbname is $dbe->dbname()
>     - synonyms is $dbe->get_all_synonyms()
>     - info_type is $dbe->info_type()
>     - info_text is $dbe->info_text()
>     - db_display_name is $dbe->db_display_name()
>
>     You can chose what format the REST will output.
>     Details of all formats can be found in our user guide:
>     http://beta.rest.ensembl.org/documentation/user_guide
>     For tab-delimited output, content_type=text/x-gff3 is used, but it
>     is only available for the /feature endpoint.
>
>     There is no file in the Ensembl ftp dumps that contains all the
>     external references produced.
>
>
>     Regards,
>     Magali
>
>
>     On 27/02/2014 11:20, Genomeo Dev wrote:
>>     Hi,
>>
>>     I am interested in getting wide cross references to ensembl gene
>>     IDs. I found two programmatic ways to do that which give
>>     consistent results but different amount of details. Using
>>     ENSG00000223972 as an example:
>>     (1)
>>     Using this rest API Endpoint python code
>>     (http://beta.rest.ensembl.org/documentation/info/xref_id)
>>
>>      1. importhttplib2,sys
>>     2.
>>      3. http =httplib2.Http(".cache")
>>     4.
>>      5. server ="http://beta.rest.ensembl.org"
>>      6. ext ="/xrefs/id/ENSG00000157764?"
>>      7. resp,content
>>         =http.request(server+ext,method="GET",headers={"Content-Type":"application/json"})
>>     8.
>>      9. ifnotresp.status ==200:
>>     10. print"Invalid response: ",resp.status
>>     11. sys.exit()
>>     12. importjson
>>    13.
>>     14. decoded =json.loads(content)
>>     15. printrepr(decoded)
>>
>>
>>     I get:
>>
>>     {"display_id":"OTTHUMG00000000961","primary_id":"OTTHUMG00000000961","version":"2","description":null,"dbname":"OTTG","synonyms":[],"info_type":"NONE","info_text":"","db_display_name":"Havana
>>     gene"}
>>
>>     {"primary_id":"Hs.714157","dbname":"UniGene","ensembl_identity":98,"synonyms":[],"ensembl_start":6,"xref_start":1,"xref_end":1639,"db_display_name":"UniGene","display_id":"Hs.714157","ensembl_end":1657,"version":"0","score":8055,"cigar_line":"1200M1D299M12D140M","description":"DEAD/H
>>     (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>     1","xref_identity":97,"evalue":null,"info_text":"","info_type":"SEQUENCE_MATCH"}
>>
>>     {"primary_id":"Hs.618434","dbname":"UniGene","ensembl_identity":58,"synonyms":[],"ensembl_start":669,"xref_start":1,"xref_end":974,"db_display_name":"UniGene","display_id":"Hs.618434","ensembl_end":1655,"version":"0","score":4757,"cigar_line":"537M1D299M12D138M","description":"Similar
>>     to DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 11 isoform 1,
>>     mRNA (cDNA clone
>>     IMAGE:6103207)","xref_identity":96,"evalue":null,"info_text":"","info_type":"SEQUENCE_MATCH"}
>>
>>     {"display_id":"DDX11L1","primary_id":"37102","version":"0","description":"DEAD/H
>>     (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>     1","dbname":"HGNC","synonyms":[],"info_type":"DIRECT","info_text":"Generated
>>     via ensembl_manual","db_display_name":"HGNC Symbol"}
>>
>>     {"display_id":"DDX11L5","primary_id":"100287596","version":"0","description":"DEAD/H
>>     (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>     5","dbname":"EntrezGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"EntrezGene"}
>>
>>     {"display_id":"DDX11L1","primary_id":"100287102","version":"0","description":"DEAD/H
>>     (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>     1","dbname":"EntrezGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"EntrezGene"}
>>
>>     {"display_id":"ENSG00000223972","primary_id":"ENSG00000223972","version":"0","description":"","dbname":"ArrayExpress","synonyms":[],"info_type":"DIRECT","info_text":"","db_display_name":"ArrayExpress"}
>>
>>     {"display_id":"DDX11L5","primary_id":"100287596","version":"0","description":"DEAD/H
>>     (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>     5","dbname":"WikiGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"WikiGene"}
>>
>>     {"display_id":"DDX11L1","primary_id":"100287102","version":"0","description":"DEAD/H
>>     (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>     1","dbname":"WikiGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"WikiGene"}]
>>
>>     (2)
>>
>>     Using this perl API code (based on
>>     http://www.ensembl.org/info/docs/api/core/core_tutorial.html):
>>
>>     # Define a helper subroutine to print DBEntries
>>     sub print_DBEntries
>>     {
>>          my $db_entries = shift;
>>
>>          foreach my $dbe ( @{$db_entries} ) {
>>              printf "\tXREF %s (%s)\n", $dbe->display_id(), $dbe->dbname();
>>          }
>>     }
>>
>>     my $genes = $gene_adaptor->fetch_all_by_stable_id_list([@gene_list]);
>>
>>     ...
>>
>>     print "GENE ", $gene->stable_id(), "\n";
>>     print_DBEntries( $gene->get_all_DBEntries() );
>>     I get:
>>     XREF OTTHUMG00000000961 (OTTG)
>>     XREF ENSG00000223972 (ArrayExpress)
>>     XREF DDX11L1 (EntrezGene)
>>     XREF DDX11L5 (EntrezGene)
>>     XREF DDX11L1 (HGNC)
>>     XREF Hs.618434 (UniGene)
>>     XREF Hs.714157 (UniGene)
>>     XREF DDX11L1 (WikiGene)
>>     XREF DDX11L5 (WikiGene)
>>
>>
>>     Questions:
>>
>>     1. am I correct in saying that the Rest code uses the latest
>>     Ensembl release while the API code uses the Ensembl release
>>     currently installed as part of the VM (I am using release 74)?
>>
>>     2. Rest code gives more extensive details (which I like) compared
>>     to the perl API code. Could you suggest a simple way to use the
>>     API to get the same details?
>>
>>     3. The Rest code output format. Is tab separated text supported?
>>
>>     4. Is there a  file in the Ensembl ftp area which contains pre
>>     generated detailed cross ref mappings for all current Ensembl genes?
>>     -- 
>>
>>     Thanks,
>>
>>     G.
>>
>>
>>     _______________________________________________
>>     Dev mailing listDev at ensembl.org  <mailto:Dev at ensembl.org>
>>     Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>     Ensembl Blog:http://www.ensembl.info/
>
>
>     _______________________________________________
>     Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>     Posting guidelines and subscribe/unsubscribe info:
>     http://lists.ensembl.org/mailman/listinfo/dev
>     Ensembl Blog: http://www.ensembl.info/
>
>
>
>
> -- 
> G.
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140227/07f735c5/attachment.html>


More information about the Dev mailing list