[ensembl-dev] xref mapping

Genomeo Dev genomeodev at gmail.com
Thu Feb 27 14:37:33 GMT 2014


Thanks very much Magali for pointing that out.

If understand you correctly db_type and species are therefore attributes of
the query gene IDs not the returned cross-reference ids. For my query IDs I
see I define that here:

my $gene_adaptor = Bio::EnsEMBL::Registry->get_adaptor( "human", "core",
"gene" );

I have tried to lookup about 5000 human ensembl IDs and found that for 256
I get cross mapping to other organisms. It only happens for UniGene. For
example for ENSG00000010244:

display_id dbname ensembl_start xref_start display_id score db_display_name
xref_end evalue info_text info_type ensembl_end primary_id ensembl_identity
synonyms version cigar_line xref_identity dbname description
ENSG00000010244 ensembl 1 1 Hs.500775 23313 UniGene 4672 SEQUENCE_MATCH 4672
Hs.500775 99 0 1631M1D1592M1I1448M 99 UniGene Zinc finger protein 207
ENSG00000010244 ensembl 5667 1 Hs.612377 1200 UniGene 249 SEQUENCE_MATCH
5417 Hs.612377 1 0 249M 97 UniGene Transcribed locus
ENSG00000010244 ensembl 12853 1 Hs.636112 2260 UniGene 452 SEQUENCE_MATCH
12400 Hs.636112 3 0 452M 100 UniGene Transcribed locus
ENSG00000010244 ensembl 5213 1 Hs.658344 3230 UniGene 684 SEQUENCE_MATCH
4526 Hs.658344 4 0 615M1I8M1D3M1D34M1D13M1D5M1I4M 91 UniGene Transcribed
locus
ENSG00000010244 ensembl 3014 23 Hs.670238 1995 UniGene 427 SEQUENCE_MATCH
2607 Hs.670238 2 0 399M1D6M 94 UniGene Transcribed locus
ENSG00000010244 ensembl 11505 2 Hs.694378 3063 UniGene 628 SEQUENCE_MATCH
12131 Hs.694378 4 0 627M 98 UniGene Transcribed locus
ENSG00000010244 ensembl 1 1 Hs.716993 1971 UniGene 396 SEQUENCE_MATCH 396
Hs.716993 99 0 396M 99 UniGene Transcribed locus, strongly similar to
NP_001034109.1 Zfp207 gene product [Rattus norvegicus]
ENSG00000010244 ensembl 50 1 Hs.743764 3472 UniGene 791 SEQUENCE_MATCH 841
Hs.743764 5 0 716M1D75M 91 UniGene Transcribed locus, moderately similar to
NP_001034109.1 Zfp207 gene product [Rattus norvegicus]

(obtained from Rest)



On 27 February 2014 14:00, mag <mr6 at ebi.ac.uk> wrote:

>  Hi Genomeo,
>
> To find which attributes are available, the Ensembl Doxygen documentation
> usually covers everything you need.
> Looking at
> http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Gene.html
> will tell you that you can obtain the following from a gene:
>
> $gene->source()
> $gene->analysis->logic_name()
> $gene->description()
> $gene->external_name()
> $gene->biotype()
> $gene->seq_region_start()
> $gene->seq_region_end()
> $gene->seq_region_name()
> $gene->seq_region_strand()
> $gene->display_id()
>
> When using the API, you should always know what object_type you are using,
> as it allows you to use the correct attributes.
> In this example, if you are using a Bio::EnsEMBL::Gene, object_type is
> 'gene'
>
> For species and db_type as well, you need to know those beforehand when
> using directly the perl API.
> They are the ones which will allow you to connect to the correct database
> based on the data you are looking for.
>
> Regarding cross references to other organisms, do you have any examples?
> Generally, we should be only mapping to other resources for the same
> organism.
> For example, for pig, we will only assign cross references to Uniprot pig
> proteins.
>
> The main exceptions I can think of are:
> - HGNC names
> Typically, if the coverage for a species is low (ie, not all 20 odd
> thousand proteins have been submitted to Uniprot or RefSeq), we will use
> HGNC names to fill in the gaps.
> Where no name can be found and there is a homolog in human, we use the
> same name as in human.
> - Ensembl translations
> For some low coverage species, annotations was provided by projecting
> human annotation via a whole genome alignment.
> For these models, we add an external reference to the human translation
> which was used to build the model.
>
>
> Hope this helps,
> Magali
>
>
>  On 27/02/2014 13:41, Genomeo Dev wrote:
>
> Thanks very much for the useful answer.
>
>  I noticed that cross ref also maps to genes from organisms other than
> that of the query gene ID. Any comment on that?
>
>  Related to the previous question, I use the following Rest python code
> to do id lookup for particular Ensembl IDs:
>
>  pref= "/lookup/id/"
> ext = "?"
>
>  for line in inputfile1:
>         geneid= line.rstrip('\n')
>
>          resp, content = http.request(server+pref+geneid+ext,
> method="GET", headers={"Content-Type":"application/json"})
>
>          if not resp.status == 200:
>                 print "%s\t%s\t%s" %  (geneid, "Invalid response:",
> resp.status)
>                 continue
>                 #sys.exit()
>         print "%s\t%s" % (geneid,content)
>
>
>  And I get this output:
>
>  ENSG00000223972 {"source":"ensembl_havana","object_type":"Gene","logic_name":"ensembl_havana_gene","species":"homo_sapiens","description":"DEAD/H
> (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [Source:HGNC
> Symbol;Acc:37102]","display_name":"DDX11L1","biotype":"pseudogene","end":14412,"seq_region_name":"1","db_type":"core","strand":1,"id":"ENSG00000223972","start":11869}
>
>  What would be the classes/attributes to use under the Perl API to get
> that? i.e:
>
>  source
> object_type
> logic_name
> species
> description
> display_name
> biotype
> end
> seq_region_name
>  db_type
> strand
> id
> start
>
>  Thanks,
>
>  G.
>
>
> On 27 February 2014 11:39, mag <mr6 at ebi.ac.uk> wrote:
>
>>  Hi Genomeo,
>>
>> The REST server only display the current/latest release.
>> The release version can be found with this endpoint:
>> http://beta.rest.ensembl.org/documentation/info/software
>>
>> To get more details with the Ensembl API, you only need to update the
>> print_DBEntries method to display all the attributes you are looking for.
>> Compared to the output from REST, we have the following:
>> - display_id is $dbe->display_id()
>> - primary_id is $dbe->primary_id()
>> - version is $dbe->version()
>> - description is $dbe->description()
>> - dbname is $dbe->dbname()
>> - synonyms is $dbe->get_all_synonyms()
>> - info_type is $dbe->info_type()
>> - info_text is $dbe->info_text()
>> - db_display_name is $dbe->db_display_name()
>>
>> You can chose what format the REST will output.
>> Details of all formats can be found in our user guide:
>> http://beta.rest.ensembl.org/documentation/user_guide
>> For tab-delimited output, content_type=text/x-gff3 is used, but it is
>> only available for the /feature endpoint.
>>
>> There is no file in the Ensembl ftp dumps that contains all the external
>> references produced.
>>
>>
>> Regards,
>> Magali
>>
>>
>> On 27/02/2014 11:20, Genomeo Dev wrote:
>>
>>   Hi,
>>
>>  I am interested in getting wide cross references to ensembl gene IDs. I
>> found two programmatic ways to do that which give consistent results but
>> different amount of details. Using ENSG00000223972 as an example:
>>  (1)
>> Using this rest API Endpoint python code (
>> http://beta.rest.ensembl.org/documentation/info/xref_id)
>>
>>
>>    1. import httplib2, sys
>>    2.
>>    3. http = httplib2.Http(".cache")
>>    4.
>>    5. server = "http://beta.rest.ensembl.org"
>>    6. ext = "/xrefs/id/ENSG00000157764?"
>>    7. resp, content = http.request(server+ext, method="GET", headers={
>>    "Content-Type":"application/json"})
>>    8.
>>    9. if not resp.status == 200:
>>    10. print "Invalid response: ", resp.status
>>    11. sys.exit()
>>    12. import json
>>    13.
>>    14. decoded = json.loads(content)
>>    15. print repr(decoded)
>>
>>
>>  I get:
>>
>>  {"display_id":"OTTHUMG00000000961","primary_id":"OTTHUMG00000000961","version":"2","description":null,"dbname":"OTTG","synonyms":[],"info_type":"NONE","info_text":"","db_display_name":"Havana
>> gene"}
>>
>>  {"primary_id":"Hs.714157","dbname":"UniGene","ensembl_identity":98,"synonyms":[],"ensembl_start":6,"xref_start":1,"xref_end":1639,"db_display_name":"UniGene","display_id":"Hs.714157","ensembl_end":1657,"version":"0","score":8055,"cigar_line":"1200M1D299M12D140M","description":"DEAD/H
>> (Asp-Glu-Ala-Asp/His) box helicase 11 like
>> 1","xref_identity":97,"evalue":null,"info_text":"","info_type":"SEQUENCE_MATCH"}
>>
>>  {"primary_id":"Hs.618434","dbname":"UniGene","ensembl_identity":58,"synonyms":[],"ensembl_start":669,"xref_start":1,"xref_end":974,"db_display_name":"UniGene","display_id":"Hs.618434","ensembl_end":1655,"version":"0","score":4757,"cigar_line":"537M1D299M12D138M","description":"Similar
>> to DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 11 isoform 1, mRNA (cDNA
>> clone
>> IMAGE:6103207)","xref_identity":96,"evalue":null,"info_text":"","info_type":"SEQUENCE_MATCH"}
>>
>>  {"display_id":"DDX11L1","primary_id":"37102","version":"0","description":"DEAD/H
>> (Asp-Glu-Ala-Asp/His) box helicase 11 like
>> 1","dbname":"HGNC","synonyms":[],"info_type":"DIRECT","info_text":"Generated
>> via ensembl_manual","db_display_name":"HGNC Symbol"}
>>
>>  {"display_id":"DDX11L5","primary_id":"100287596","version":"0","description":"DEAD/H
>> (Asp-Glu-Ala-Asp/His) box helicase 11 like
>> 5","dbname":"EntrezGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"EntrezGene"}
>>
>>  {"display_id":"DDX11L1","primary_id":"100287102","version":"0","description":"DEAD/H
>> (Asp-Glu-Ala-Asp/His) box helicase 11 like
>> 1","dbname":"EntrezGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"EntrezGene"}
>>
>>
>> {"display_id":"ENSG00000223972","primary_id":"ENSG00000223972","version":"0","description":"","dbname":"ArrayExpress","synonyms":[],"info_type":"DIRECT","info_text":"","db_display_name":"ArrayExpress"}
>>
>>  {"display_id":"DDX11L5","primary_id":"100287596","version":"0","description":"DEAD/H
>> (Asp-Glu-Ala-Asp/His) box helicase 11 like
>> 5","dbname":"WikiGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"WikiGene"}
>>
>>  {"display_id":"DDX11L1","primary_id":"100287102","version":"0","description":"DEAD/H
>> (Asp-Glu-Ala-Asp/His) box helicase 11 like
>> 1","dbname":"WikiGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"WikiGene"}]
>>
>>  (2)
>>
>>  Using this perl API code (based on
>> http://www.ensembl.org/info/docs/api/core/core_tutorial.html):
>>
>>  # Define a helper subroutine to print DBEntries
>> sub print_DBEntries
>> {
>>     my $db_entries = shift;
>>
>>     foreach my $dbe ( @{$db_entries} ) {
>>         printf "\tXREF %s (%s)\n", $dbe->display_id(), $dbe->dbname();
>>     }
>> }
>>
>> my $genes = $gene_adaptor->fetch_all_by_stable_id_list([@gene_list]);
>>
>>
>> ...
>>
>>
>> print "GENE ", $gene->stable_id(), "\n";
>> print_DBEntries( $gene->get_all_DBEntries() );
>>
>>  I get:
>>  XREF OTTHUMG00000000961 (OTTG)
>> XREF ENSG00000223972 (ArrayExpress)
>> XREF DDX11L1 (EntrezGene)
>> XREF DDX11L5 (EntrezGene)
>> XREF DDX11L1 (HGNC)
>> XREF Hs.618434 (UniGene)
>> XREF Hs.714157 (UniGene)
>>  XREF DDX11L1 (WikiGene)
>> XREF DDX11L5 (WikiGene)
>>
>>
>>  Questions:
>>
>>  1. am I correct in saying that the Rest code uses the latest Ensembl
>> release while the API code uses the Ensembl release currently installed as
>> part of the VM (I am using release 74)?
>>
>>  2. Rest code gives more extensive details (which I like) compared to
>> the perl API code. Could you suggest a simple way to use the API to get the
>> same details?
>>
>>  3. The Rest code output format. Is tab separated text supported?
>>
>>  4. Is there a  file in the Ensembl ftp area which contains pre
>> generated detailed cross ref mappings for all current Ensembl genes?
>> --
>>
>>  Thanks,
>>
>>  G.
>>
>>
>>  _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info:
>> http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>>
>>
>
>
>  --
> G.
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info:
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>


-- 
G.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140227/9603cf76/attachment.html>


More information about the Dev mailing list