[ensembl-dev] xref mapping

Thu Feb 27 16:56:34 GMT 2014

Hi Genomeo,

If no value is available, the attribute is null.
An empty string can still be considered a value, unlike 'null' which is 
undefined.
We tend to add tests for existence if a value can be null.

For the latency, I get the same result as you.
Connecting to the database and fetching all the genes took about 5 and a 
half minutes.

Going through all those genes and print out information is not really 
going to be a bottleneck, it is the initial fetching of data which is.

Unfortunately, our public server is being used heavily at the moment, 
maybe because release 75 just came out.

If you are going to need the data regularly, our advice is to create 
your own local server.
Running your script on our local server, I was able to fetch all the 
genes in 6s.
Looping through these 63677 genes and printing all the results took 
around 6 minutes.

Regards,
Magali

On 27/02/2014 16:15, Genomeo Dev wrote:
> OK thanks.
>
> Another question please:
>
> Going back to my perl subroutine for getting xref for ensembl IDs:
>
> use warnings;
> sub print_DBEntries
> {
> my $db_entries = shift;
> foreach my $dbe ( @{$db_entries} ) {
> printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n", $dbe->dbname(), 
> $dbe->display_id(), $dbe->description(), $dbe->db_display_name(), 
> $dbe->info_text(), $dbe->info_type(), $dbe->primary_id(), join(", 
> ",@{$dbe->get_all_synonyms()}) , $dbe->version();
> }
> }
>
> I found that some values are not returned for some databases. For 
> example using ENSG00000151067:
>
> #Use of uninitialized value in printf at ./fetch_ensembl_genes_v2.pl 
> <http://fetch_ensembl_genes_v2.pl> line 47.
> OTTGOTTHUMG00000150243Havana geneNONEOTTHUMG000001502437
> #Use of uninitialized value in printf at ./fetch_ensembl_genes_v2.pl 
> <http://fetch_ensembl_genes_v2.pl> line 47.
> ENS_LRG_geneLRG_334LRG display in EnsemblNONELRG_3340
> LRGLRG_334Locus Reference Genomic record for CACNA1CLocus Reference 
> GenomicDIRECTLRG_3340
> ArrayExpressENSG00000151067ArrayExpressDIRECTENSG000001510670
> EntrezGeneCACNA1Ccalcium channel, voltage-dependent, L type, alpha 1C 
> subunitEntrezGeneDEPENDENT775CACH2, CACN2, CACNL1A1, CaV1.2, CCHL1A1, 
> LQT8, TS0
> HGNCCACNA1Ccalcium channel, voltage-dependent, L type, alpha 1C 
> subunitHGNC SymbolGenerated via ensembl_manualDIRECT1390CACH2, CACN2, 
> CACNL1A1, Cav1.2, CCHL1A1, LQT8, TS0
> MIM_GENECALCIUM CHANNEL, VOLTAGE-DEPENDENT [*114205]CALCIUM CHANNEL, 
> VOLTAGE-DEPENDENT, L TYPE, ALPHA-1C SUBUNIT; CACNA1CMIM 
> geneDEPENDENT1142050
> MIM_MORBIDTIMOTHY SYNDROME [#601005]TIMOTHY SYNDROME; TSMIM 
> diseaseDEPENDENT6010050
> MIM_MORBIDBRUGADA SYNDROME 3 [#611875]BRUGADA SYNDROME 3; BRGDA3MIM 
> diseaseDEPENDENT6118750
> UniGeneHs.690010Voltage-dependent L-type Ca2+ channel alpha 1 subunit 
> (CACNA1C) mRNA, exon 1a and partial cdsUniGeneSEQUENCE_MATCHHs.6900100
> UniGeneHs.697137Transcribed locus, weakly similar to XP_416388.3 
> PREDICTED: voltage-dependent L-type calcium channel subunit alpha-1C 
> [Gallus gallus]UniGeneSEQUENCE_MATCHHs.6971370
> Uniprot_gnCACNA1CUniProtKB Gene NameDEPENDENTCACNA1CCACH2, CACN2, 
> CACNL1A1, CCHL1A10
> WikiGeneCACNA1Ccalcium channel, voltage-dependent, L type, alpha 1C 
> subunitWikiGeneDEPENDENT7750
>
> I would expect for missing value to be empty string. Am I missing 
> anything?
>
> On a related note, this code runs very slowly (takes 6 mns for 100 
> genes). From an earlier post, it seems that connecting to the database 
> is the bottleneck. Connecting to useastdb.ensembl.org 
> <http://useastdb.ensembl.org> instead of ensembldb.ensembl.org 
> <http://ensembldb.ensembl.org> is not much better. So I was wondering 
> whether there is a way to turn off lazy loading for the purpose of 
> debugging?
>
>
>
>
> On 27 February 2014 14:43, mag <mr6 at ebi.ac.uk <mailto:mr6 at ebi.ac.uk>> 
> wrote:
>
>     Hi Genomeo,
>
>     Taking the last entry as example:
>     Hs.743764 refers to
>     http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?UGID=5947187&TAXID=9606&SEARCH=Hs.743764
>     This is a human locus.
>     The description means that this locus is similar to a gene in rat,
>     as it has not been fully annotated in human.
>
>     I agree the description can be misleading, but it is imported
>     directly from NCBI as is, so there is not much we can do about it.
>
>
>     Regards,
>     Magali
>
>
>     On 27/02/2014 14:37, Genomeo Dev wrote:
>>     Thanks very much Magali for pointing that out.
>>
>>     If understand you correctly db_type and species are therefore
>>     attributes of the query gene IDs not the returned cross-reference
>>     ids. For my query IDs I see I define that here:
>>
>>     my $gene_adaptor = Bio::EnsEMBL::Registry->get_adaptor( "human",
>>     "core", "gene" );
>>
>>     I have tried to lookup about 5000 human ensembl IDs and found
>>     that for 256 I get cross mapping to other organisms. It only
>>     happens for UniGene. For example for ENSG00000010244:
>>
>>     display_iddbnameensembl_startxref_startdisplay_idscoredb_display_namexref_endevalueinfo_textinfo_typeensembl_endprimary_idensembl_identitysynonymsversioncigar_linexref_identitydbnamedescription
>>     ENSG00000010244ensembl11Hs.50077523313UniGene4672SEQUENCE_MATCH4672Hs.5007759901631M1D1592M1I1448M99UniGeneZinc
>>     finger protein 207
>>     ENSG00000010244ensembl56671Hs.6123771200UniGene249SEQUENCE_MATCH5417Hs.61237710249M97UniGeneTranscribed
>>     locus
>>     ENSG00000010244ensembl128531Hs.6361122260UniGene452SEQUENCE_MATCH12400Hs.63611230452M100UniGeneTranscribed
>>     locus
>>     ENSG00000010244ensembl52131Hs.6583443230UniGene684SEQUENCE_MATCH4526Hs.65834440615M1I8M1D3M1D34M1D13M1D5M1I4M91UniGeneTranscribed
>>     locus
>>     ENSG00000010244ensembl301423Hs.6702381995UniGene427SEQUENCE_MATCH2607Hs.67023820399M1D6M94UniGeneTranscribed
>>     locus
>>     ENSG00000010244ensembl115052Hs.6943783063UniGene628SEQUENCE_MATCH12131Hs.69437840627M98UniGeneTranscribed
>>     locus
>>     ENSG00000010244ensembl11Hs.7169931971UniGene396SEQUENCE_MATCH396Hs.716993990396M99UniGeneTranscribed
>>     locus, strongly similar to NP_001034109.1 Zfp207 gene product
>>     [Rattus norvegicus]
>>     ENSG00000010244ensembl501Hs.7437643472UniGene791SEQUENCE_MATCH841Hs.74376450716M1D75M91UniGeneTranscribed
>>     locus, moderately similar to NP_001034109.1 Zfp207 gene product
>>     [Rattus norvegicus]
>>
>>     (obtained from Rest)
>>
>>
>>
>>     On 27 February 2014 14:00, mag <mr6 at ebi.ac.uk
>>     <mailto:mr6 at ebi.ac.uk>> wrote:
>>
>>         Hi Genomeo,
>>
>>         To find which attributes are available, the Ensembl Doxygen
>>         documentation usually covers everything you need.
>>         Looking at
>>         http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Gene.html
>>         will tell you that you can obtain the following from a gene:
>>
>>         $gene->source()
>>         $gene->analysis->logic_name()
>>         $gene->description()
>>         $gene->external_name()
>>         $gene->biotype()
>>         $gene->seq_region_start()
>>         $gene->seq_region_end()
>>         $gene->seq_region_name()
>>         $gene->seq_region_strand()
>>         $gene->display_id()
>>
>>         When using the API, you should always know what object_type
>>         you are using, as it allows you to use the correct attributes.
>>         In this example, if you are using a Bio::EnsEMBL::Gene,
>>         object_type is 'gene'
>>
>>         For species and db_type as well, you need to know those
>>         beforehand when using directly the perl API.
>>         They are the ones which will allow you to connect to the
>>         correct database based on the data you are looking for.
>>
>>         Regarding cross references to other organisms, do you have
>>         any examples?
>>         Generally, we should be only mapping to other resources for
>>         the same organism.
>>         For example, for pig, we will only assign cross references to
>>         Uniprot pig proteins.
>>
>>         The main exceptions I can think of are:
>>         - HGNC names
>>         Typically, if the coverage for a species is low (ie, not all
>>         20 odd thousand proteins have been submitted to Uniprot or
>>         RefSeq), we will use HGNC names to fill in the gaps.
>>         Where no name can be found and there is a homolog in human,
>>         we use the same name as in human.
>>         - Ensembl translations
>>         For some low coverage species, annotations was provided by
>>         projecting human annotation via a whole genome alignment.
>>         For these models, we add an external reference to the human
>>         translation which was used to build the model.
>>
>>
>>         Hope this helps,
>>         Magali
>>
>>
>>         On 27/02/2014 13:41, Genomeo Dev wrote:
>>>         Thanks very much for the useful answer.
>>>
>>>         I noticed that cross ref also maps to genes from organisms
>>>         other than that of the query gene ID. Any comment on that?
>>>
>>>         Related to the previous question, I use the following Rest
>>>         python code to do id lookup for particular Ensembl IDs:
>>>
>>>         pref= "/lookup/id/"
>>>         ext = "?"
>>>
>>>         for line in inputfile1:
>>>                 geneid= line.rstrip('\n')
>>>
>>>                 resp, content = http.request(server+pref+geneid+ext,
>>>         method="GET", headers={"Content-Type":"application/json"})
>>>
>>>                 if not resp.status == 200:
>>>                         print "%s\t%s\t%s" %  (geneid, "Invalid
>>>         response:", resp.status)
>>>                         continue
>>>         #sys.exit()
>>>                 print "%s\t%s" % (geneid,content)
>>>
>>>
>>>         And I get this output:
>>>
>>>         ENSG00000223972{"source":"ensembl_havana","object_type":"Gene","logic_name":"ensembl_havana_gene","species":"homo_sapiens","description":"DEAD/H
>>>         (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [Source:HGNC
>>>         Symbol;Acc:37102]","display_name":"DDX11L1","biotype":"pseudogene","end":14412,"seq_region_name":"1","db_type":"core","strand":1,"id":"ENSG00000223972","start":11869}
>>>
>>>         What would be the classes/attributes to use under the Perl
>>>         API to get that? i.e:
>>>
>>>         source
>>>         object_type
>>>         logic_name
>>>         species
>>>         description
>>>         display_name
>>>         biotype
>>>         end
>>>         seq_region_name
>>>         db_type
>>>         strand
>>>         id
>>>         start
>>>
>>>         Thanks,
>>>
>>>         G.
>>>
>>>
>>>         On 27 February 2014 11:39, mag <mr6 at ebi.ac.uk
>>>         <mailto:mr6 at ebi.ac.uk>> wrote:
>>>
>>>             Hi Genomeo,
>>>
>>>             The REST server only display the current/latest release.
>>>             The release version can be found with this endpoint:
>>>             http://beta.rest.ensembl.org/documentation/info/software
>>>
>>>             To get more details with the Ensembl API, you only need
>>>             to update the print_DBEntries method to display all the
>>>             attributes you are looking for.
>>>             Compared to the output from REST, we have the following:
>>>             - display_id is $dbe->display_id()
>>>             - primary_id is $dbe->primary_id()
>>>             - version is $dbe->version()
>>>             - description is $dbe->description()
>>>             - dbname is $dbe->dbname()
>>>             - synonyms is $dbe->get_all_synonyms()
>>>             - info_type is $dbe->info_type()
>>>             - info_text is $dbe->info_text()
>>>             - db_display_name is $dbe->db_display_name()
>>>
>>>             You can chose what format the REST will output.
>>>             Details of all formats can be found in our user guide:
>>>             http://beta.rest.ensembl.org/documentation/user_guide
>>>             For tab-delimited output, content_type=text/x-gff3 is
>>>             used, but it is only available for the /feature endpoint.
>>>
>>>             There is no file in the Ensembl ftp dumps that contains
>>>             all the external references produced.
>>>
>>>
>>>             Regards,
>>>             Magali
>>>
>>>
>>>             On 27/02/2014 11:20, Genomeo Dev wrote:
>>>>             Hi,
>>>>
>>>>             I am interested in getting wide cross references to
>>>>             ensembl gene IDs. I found two programmatic ways to do
>>>>             that which give consistent results but different amount
>>>>             of details. Using ENSG00000223972 as an example:
>>>>             (1)
>>>>             Using this rest API Endpoint python code
>>>>             (http://beta.rest.ensembl.org/documentation/info/xref_id)
>>>>
>>>>              1. importhttplib2,sys
>>>>             2.
>>>>              3. http =httplib2.Http(".cache")
>>>>             4.
>>>>              5. server ="http://beta.rest.ensembl.org"
>>>>              6. ext ="/xrefs/id/ENSG00000157764?"
>>>>              7. resp,content
>>>>                 =http.request(server+ext,method="GET",headers={"Content-Type":"application/json"})
>>>>             8.
>>>>              9. ifnotresp.status ==200:
>>>>             10. print"Invalid response: ",resp.status
>>>>             11. sys.exit()
>>>>             12. importjson
>>>>            13.
>>>>             14. decoded =json.loads(content)
>>>>             15. printrepr(decoded)
>>>>
>>>>
>>>>             I get:
>>>>
>>>>             {"display_id":"OTTHUMG00000000961","primary_id":"OTTHUMG00000000961","version":"2","description":null,"dbname":"OTTG","synonyms":[],"info_type":"NONE","info_text":"","db_display_name":"Havana
>>>>             gene"}
>>>>
>>>>             {"primary_id":"Hs.714157","dbname":"UniGene","ensembl_identity":98,"synonyms":[],"ensembl_start":6,"xref_start":1,"xref_end":1639,"db_display_name":"UniGene","display_id":"Hs.714157","ensembl_end":1657,"version":"0","score":8055,"cigar_line":"1200M1D299M12D140M","description":"DEAD/H
>>>>             (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>>>             1","xref_identity":97,"evalue":null,"info_text":"","info_type":"SEQUENCE_MATCH"}
>>>>
>>>>             {"primary_id":"Hs.618434","dbname":"UniGene","ensembl_identity":58,"synonyms":[],"ensembl_start":669,"xref_start":1,"xref_end":974,"db_display_name":"UniGene","display_id":"Hs.618434","ensembl_end":1655,"version":"0","score":4757,"cigar_line":"537M1D299M12D138M","description":"Similar
>>>>             to DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 11
>>>>             isoform 1, mRNA (cDNA clone
>>>>             IMAGE:6103207)","xref_identity":96,"evalue":null,"info_text":"","info_type":"SEQUENCE_MATCH"}
>>>>
>>>>             {"display_id":"DDX11L1","primary_id":"37102","version":"0","description":"DEAD/H
>>>>             (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>>>             1","dbname":"HGNC","synonyms":[],"info_type":"DIRECT","info_text":"Generated
>>>>             via ensembl_manual","db_display_name":"HGNC Symbol"}
>>>>
>>>>             {"display_id":"DDX11L5","primary_id":"100287596","version":"0","description":"DEAD/H
>>>>             (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>>>             5","dbname":"EntrezGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"EntrezGene"}
>>>>
>>>>             {"display_id":"DDX11L1","primary_id":"100287102","version":"0","description":"DEAD/H
>>>>             (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>>>             1","dbname":"EntrezGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"EntrezGene"}
>>>>
>>>>             {"display_id":"ENSG00000223972","primary_id":"ENSG00000223972","version":"0","description":"","dbname":"ArrayExpress","synonyms":[],"info_type":"DIRECT","info_text":"","db_display_name":"ArrayExpress"}
>>>>
>>>>             {"display_id":"DDX11L5","primary_id":"100287596","version":"0","description":"DEAD/H
>>>>             (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>>>             5","dbname":"WikiGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"WikiGene"}
>>>>
>>>>             {"display_id":"DDX11L1","primary_id":"100287102","version":"0","description":"DEAD/H
>>>>             (Asp-Glu-Ala-Asp/His) box helicase 11 like
>>>>             1","dbname":"WikiGene","synonyms":[],"info_type":"DEPENDENT","info_text":"","db_display_name":"WikiGene"}]
>>>>
>>>>             (2)
>>>>
>>>>             Using this perl API code (based on
>>>>             http://www.ensembl.org/info/docs/api/core/core_tutorial.html):
>>>>
>>>>             # Define a helper subroutine to print DBEntries
>>>>             sub print_DBEntries
>>>>             {
>>>>                  my $db_entries = shift;
>>>>
>>>>                  foreach my $dbe ( @{$db_entries} ) {
>>>>                      printf "\tXREF %s (%s)\n", $dbe->display_id(), $dbe->dbname();
>>>>                  }
>>>>             }
>>>>
>>>>             my $genes = $gene_adaptor->fetch_all_by_stable_id_list([@gene_list]);
>>>>
>>>>             ...
>>>>
>>>>             print "GENE ", $gene->stable_id(), "\n";
>>>>             print_DBEntries( $gene->get_all_DBEntries() );
>>>>             I get:
>>>>             XREF OTTHUMG00000000961 (OTTG)
>>>>             XREF ENSG00000223972 (ArrayExpress)
>>>>             XREF DDX11L1 (EntrezGene)
>>>>             XREF DDX11L5 (EntrezGene)
>>>>             XREF DDX11L1 (HGNC)
>>>>             XREF Hs.618434 (UniGene)
>>>>             XREF Hs.714157 (UniGene)
>>>>             XREF DDX11L1 (WikiGene)
>>>>             XREF DDX11L5 (WikiGene)
>>>>
>>>>
>>>>             Questions:
>>>>
>>>>             1. am I correct in saying that the Rest code uses the
>>>>             latest Ensembl release while the API code uses the
>>>>             Ensembl release currently installed as part of the VM
>>>>             (I am using release 74)?
>>>>
>>>>             2. Rest code gives more extensive details (which I
>>>>             like) compared to the perl API code. Could you suggest
>>>>             a simple way to use the API to get the same details?
>>>>
>>>>             3. The Rest code output format. Is tab separated text
>>>>             supported?
>>>>
>>>>             4. Is there a  file in the Ensembl ftp area which
>>>>             contains pre generated detailed cross ref mappings for
>>>>             all current Ensembl genes?
>>>>             -- 
>>>>
>>>>             Thanks,
>>>>
>>>>             G.
>>>>
>>>>
>>>>             _______________________________________________
>>>>             Dev mailing listDev at ensembl.org  <mailto:Dev at ensembl.org>
>>>>             Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>>>             Ensembl Blog:http://www.ensembl.info/
>>>
>>>
>>>             _______________________________________________
>>>             Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>>             Posting guidelines and subscribe/unsubscribe info:
>>>             http://lists.ensembl.org/mailman/listinfo/dev
>>>             Ensembl Blog: http://www.ensembl.info/
>>>
>>>
>>>
>>>
>>>         -- 
>>>         G.
>>>
>>>
>>>         _______________________________________________
>>>         Dev mailing listDev at ensembl.org  <mailto:Dev at ensembl.org>
>>>         Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>>         Ensembl Blog:http://www.ensembl.info/
>>
>>
>>         _______________________________________________
>>         Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>>         Posting guidelines and subscribe/unsubscribe info:
>>         http://lists.ensembl.org/mailman/listinfo/dev
>>         Ensembl Blog: http://www.ensembl.info/
>>
>>
>>
>>
>>     -- 
>>     G.
>>
>>
>>     _______________________________________________
>>     Dev mailing listDev at ensembl.org  <mailto:Dev at ensembl.org>
>>     Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>     Ensembl Blog:http://www.ensembl.info/
>
>
>     _______________________________________________
>     Dev mailing list Dev at ensembl.org <mailto:Dev at ensembl.org>
>     Posting guidelines and subscribe/unsubscribe info:
>     http://lists.ensembl.org/mailman/listinfo/dev
>     Ensembl Blog: http://www.ensembl.info/
>
>
>
>
> -- 
> G.
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20140227/e87d9e67/attachment.html>