[ensembl-dev] Bug or User error with filtering?

Andy Yates ayates at ebi.ac.uk
Fri Aug 26 12:09:45 BST 2011


Hi Phillipe,

So the short answer to all of this is that we get our Protein IDs from the UniProtKB record a protein is mapped to; this is why you are seeing low coverage of the Protein IDs. We take these IDs from UniProt because it is convenient and reduces the amount of work we need to perform in our Xref runs. In one of your examples of where the mapping is suspect; BAA05928.1 is linked to a Ensembl peptide but to ENSP00000336792. This is because MASP1_HUMAN (P48740) has been linked using a DIRECT mapping and the ProteinID has been brought in as a DEPENDENT xref. In the case of UniProt mappings they can be from an sequence match or from a lookup generated from a UniProt provided file (ftp://ftp.ebi.ac.uk/pub/contrib/xrefs/ens-sp.map). If a UniProtKB record has been mapped using a direct mapping then that takes preference over sequence matches.

So really the issues are

1). UniProt have told us that P48740 maps only to ENSP00000336792
2). The UniProt record lists all Protein IDs of the various isoforms into the 1 record
3). We do not de-tangle this so 1 Ensembl protein gets all the Protein ID records

I hope this helps.

Best regards,

Andy
---
Andrew Yates                   Ensembl Core Software Project Leader
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/

On 25 Aug 2011, at 15:35, pip pipster wrote:

> Hi Ian,
> Thanks for helping.  If you don't mind, I'm still trying to understand the process.  Here are my comments.
> 
> 
> >>> For this particular Transcript ENST00000169293 we have two UniProtSPTREMBL entries (C9JLU5_HUMAN and E7EQ37_HUMAN, well mapped to its translations) of which neither has any EMBL references with a valid protein_id and in fact it lists this as "NOT_ANNOTATED_CDS" for this part.
> a.  ENST00000337774 appears to also map to this as well (via going to Ensembl's External References->General Identifiers link).  Is the reason that ENST00000337774 shows up correctly as protein coding because in addition to this, it also maps to the UniProt SWISSPROT MASP1_HUMAN and the later is what flagged it correctly?
> 
> b.  As you can see here (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=ENST00000169293 ) the transcript shows up as protein coding and also maps to the same Gene/Protein as ENST0000033777 does.  Is it a bug that Uniprot is not showing the proper link information or ENST00000169293 or am I interpreting the following wrong http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=ENST00000169293 ?  From the NCBI link, I would have assumed that Uniprot would have showed the same for that transcript.
> 
> c.  Why does Ensembl use Unitprot files as an intermediate lookup for GenBank?  My fear here is that it is possible that thousands (over 60,000) transcript are showing up incorrectly.
> 
> Thank you,
> Phillipe
> 
> 
> 
> From: ian Longden <ianl at ebi.ac.uk>
> To: pip pipster <pipsterpip at yahoo.com>
> Cc: "Dev at ensembl.org" <Dev at ensembl.org>
> Sent: Thursday, August 25, 2011 5:57 AM
> Subject: Re: [ensembl-dev] Bug or User error with filtering?
> 
> Hi Phillipe,
> 
> We match GenBank entries to EnsEMBL via UniProt entries.
> The UniProt files we parse have DR lines which we use to get the
> Genbank Entries ( EMBL in the file).
> 
> i.e.
> 
> ID  EXAMPLE_HUMAN
> ...
> DR EMBL; AAAA; BBBB;
> ....
> 
> Where AAAA would be the EMBL accession and BBBB the protein_id
> 
> For this particular Transcript ENST00000169293 we have two UniProt
> SPTREMBL entries
> (C9JLU5_HUMAN and E7EQ37_HUMAN, well mapped to its translations) of
> which neither has any EMBL references with a valid protein_id and in
> fact it lists this as  "NOT_ANNOTATED_CDS" for this part.
> 
> Protein GeneBank ID: BAA05928 is mapped via the UniProt SWISSPROT
> MASP1_HUMAN which maps to another Transcript (ENST00000337774) in the
> same gene.
> 
> Does this help explain the process and explain why we have no
> protein_id for some Transcripts.
> 
> 
> -Ian.
> 
> On Wed, Aug 24, 2011 at 3:46 PM, pip pipster <pipsterpip at yahoo.com> wrote:
> > Sending this thread to the Ensembl mailing list now as it appears it may be
> > Ensembl data related.  Any ideas why the ENSEMBL transcripts aren't mapped
> > correctly to the GenBank Protein Accessions?
> > Thank you for the help.
> > Best regards,
> > Phillipe
> >
> > ----- Forwarded Message -----
> > From: Elena Rivkin <Elena.Rivkin at oicr.on.ca>
> > To: pip pipster <pipsterpip at yahoo.com>
> > Cc: Junjun Zhang <Junjun.Zhang at oicr.on.ca>
> > Sent: Tuesday, August 23, 2011 1:11 PM
> > Subject: Re: [BioMart Users] Bug or User error with filtering?
> >
> > Hi Phillipe,
> > From the data you provided, as you said, it looks like these Emsembl
> > transcripts (ENST00000169293) (and many others in similar categories) are
> > not mapped to the GenBank Protein Accessions, and therefore are not
> > retrieved via quries to BioMart.
> > Unfortunately, I don't know why that is. I recommned forwarding your
> > question to Ensembl helpdesk, and they might be able to assist you in this
> > matter.
> > Thank you.
> > Elena Rivkin, PhD
> > Outreach and Training Coordinator, Informatics and Bio-computing
> >
> > Ontario Institute for Cancer Research
> > MaRS Centre, South Tower
> > 101 College Street, Suite 800
> > Toronto, Ontario, Canada M5G 0A3
> > Tel: 647-258-4316
> > Toll-free: 1-866-678-6427
> > www.oicr.on.ca
> > This message and any attachments may contain confidential and/or privileged
> > information for the sole use of the intended recipient. Any review or
> > distribution by anyone other than the person for whom it was originally
> > intended is strictly prohibited. If you have received this message in error,
> > please contact the sender and delete all copies. Opinions, conclusions or
> > other information contained in this message may not be that of the
> > organization.
> >
> > From: pip pipster <pipsterpip at yahoo.com>
> > Reply-To: pip pipster <pipsterpip at yahoo.com>
> > Date: Tue, 23 Aug 2011 11:50:00 -0400
> > To: Microsoft Office User <Elena.Rivkin at oicr.on.ca>, Junjun Zhang
> > <Junjun.Zhang at oicr.on.ca>, "users at biomart.org" <users at biomart.org>
> > Cc: Rhoda Kinsella via RT <helpdesk at ensembl.org>
> > Subject: Re: [BioMart Users] Bug or User error with filtering?
> >
> > Elena,
> > You should be able to follow this up the chain in getting accession numbers.
> > a.  From Transcript
> > http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=ENST00000169293
> >
> > b.  To Gene (link to this Gene URL is located on Transcript link above)
> > http://www.ncbi.nlm.nih.gov/nuccore/D28593.1
> >
> > c.  To Protein (link to this Protein URL is located on Gene link above)
> > http://www.ncbi.nlm.nih.gov/protein/471128
> >
> > From this stand-point, I am led to believe that the Transcript maps to
> > a Genbank protein accession and should not be filtered out with
> > the $query->addFilter("with_protein_id", ["Only"]) filter.  But in either
> > case I would like to understand why it's being filtered out since I have to
> > trust the data I get back and deal with it accordingly.
> > Likewise, the following URL also appears to chain the Gene to the proper
> > transcripts.
> > http://www.ebi.ac.uk/ena/data/view/D28593
> >
> > It appears that for some reason the data in Emsembl is not mapping
> > transcript ENST00000169293 (and many others in similar categories) to the
> > proper Protein Accession.  But that's just my theory and would love to
> > understand it better.  Thoughts?
> > Best regards,
> > Phillipe
> >
> >
> >
> >
> > ________________________________
> > From: Elena Rivkin <Elena.Rivkin at oicr.on.ca>
> > To: pip pipster <pipsterpip at yahoo.com>; Junjun Zhang
> > <Junjun.Zhang at oicr.on.ca>; "users at biomart.org" <users at biomart.org>
> > Cc: Rhoda Kinsella via RT <helpdesk at ensembl.org>
> > Sent: Monday, August 22, 2011 2:04 PM
> > Subject: Re: [BioMart Users] Bug or User error with filtering?
> >
> > Hi Philliple,
> > When entering Protein GeneBank ID: BAA05928, and retrieving Ensembl gene id
> > and transcript id, I get the following:
> > ENSG00000127241 ENST00000337774
> > When entering Protein GeneBank ID: CAC17726, and retrieving Ensembl gene id
> > and transcript id, I get the following:
> > ENSG000000127152, ENST000000357195
> > It appears that in the Ensembl mart that you are querying, these GeneBank
> > Ids coorespond to a different transcripts (although to the same gene ID).
> > Regards,
> > Elena Rivkin, PhD
> > Outreach and Training Coordinator, Informatics and Bio-computing
> >
> > Ontario Institute for Cancer Research
> > MaRS Centre, South Tower
> > 101 College Street, Suite 800
> > Toronto, Ontario, Canada M5G 0A3
> > Tel: 647-258-4316
> > Toll-free: 1-866-678-6427
> > www.oicr.on.ca
> > This message and any attachments may contain confidential and/or privileged
> > information for the sole use of the intended recipient. Any review or
> > distribution by anyone other than the person for whom it was originally
> > intended is strictly prohibited. If you have received this message in error,
> > please contact the sender and delete all copies. Opinions, conclusions or
> > other information contained in this message may not be that of the
> > organization.
> >
> > From: pip pipster <pipsterpip at yahoo.com>
> > Reply-To: pip pipster <pipsterpip at yahoo.com>
> > Date: Mon, 22 Aug 2011 13:51:56 -0400
> > To: Junjun Zhang <Junjun.Zhang at oicr.on.ca>, "users at biomart.org"
> > <users at biomart.org>
> > Cc: Rhoda Kinsella via RT <helpdesk at ensembl.org>
> > Subject: Re: [BioMart Users] Bug or User error with filtering?
> >
> > Thank you Junjun.
> > Elena, to answer your question, I believe the ncbi links in the below thread
> > include a link to the protein where you can get the protein accession
> > number.  For example, for the 2 transcripts below you will find links to the
> > following proteins.  You will also see that the transcripts are correctly
> > showing up on the URL's as being protein coding.
> > http://www.ncbi.nlm.nih.gov/protein/471128 (accession BAA05928)
> > and
> > http://www.ncbi.nlm.nih.gov/protein/11558488 (accession CAC17726)
> > Thank you,
> > Phillipe
> >
> > ________________________________
> > From: Junjun Zhang <Junjun.Zhang at oicr.on.ca>
> > To: pip pipster <pipsterpip at yahoo.com>; "users at biomart.org"
> > <users at biomart.org>
> > Cc: Rhoda Kinsella via RT <helpdesk at ensembl.org>
> > Sent: Monday, August 22, 2011 12:59 PM
> > Subject: Re: [BioMart Users] Bug or User error with filtering?
> >
> > Hi Phillipe,
> > I am forwarding your questions to the Ensembl Helpdesk. Ensembl team is the
> > best to answer questions about data contents in Ensembl databases.
> > Cheers,
> > Junjun
> >
> > From: Elena Rivkin <Elena.Rivkin at oicr.on.ca>
> > Date: Mon, 22 Aug 2011 10:46:35 -0400
> > To: pip pipster <pipsterpip at yahoo.com>, Rhoda Kinsella <rhoda at ebi.ac.uk>,
> > "users at biomart.org" <users at biomart.org>
> > Subject: Re: [BioMart Users] Bug or User error with filtering?
> >
> > Hi Phillipe,
> > Can you let me know, for these two transcripts, what are their Genbank
> > protein accessions. I cant find them.
> > Thank you.
> > Elena Rivkin, PhD
> > Outreach and Training Coordinator, Informatics and Bio-computing
> >
> > Ontario Institute for Cancer Research
> > MaRS Centre, South Tower
> > 101 College Street, Suite 800
> > Toronto, Ontario, Canada M5G 0A3
> > Tel: 647-258-4316
> > Toll-free: 1-866-678-6427
> > www.oicr.on.ca
> > This message and any attachments may contain confidential and/or privileged
> > information for the sole use of the intended recipient. Any review or
> > distribution by anyone other than the person for whom it was originally
> > intended is strictly prohibited. If you have received this message in error,
> > please contact the sender and delete all copies. Opinions, conclusions or
> > other information contained in this message may not be that of the
> > organization.
> >
> > From: pip pipster <pipsterpip at yahoo.com>
> > Reply-To: pip pipster <pipsterpip at yahoo.com>
> > Date: Mon, 22 Aug 2011 10:32:43 -0400
> > To: Rhoda Kinsella <rhoda at ebi.ac.uk>, "users at biomart.org"
> > <users at biomart.org>
> > Subject: Re: [BioMart Users] Bug or User error with filtering?
> >
> > After doing more investigation, something definitely isn't adding up.  As it
> > turns out, filtering by Genbank protein accession is what we want and we
> > need the ability to exclude.  The 2 transcripts below are examples (they
> > show up as protein coding Genbank as well as Ensembl) but there are
> > thousands more like this.  The filter below is taking them out despite them
> > having a Genbank protein accession.  What may be causing this?
> >
> > ENST00000169293
> > http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=ENST00000169293
> > http://www.ncbi.nlm.nih.gov/nuccore/D28593?
> > http://useast.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000127241;r=3:186964149-187009745;t=ENST00000169293
> >
> > ENST00000345514
> > http://www.ncbi.nlm.nih.gov/gene?term=ENST00000345514
> > http://useast.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000127152;r=14:99635624-99737822;t=ENST00000345514
> >
> > Filter used:
> > Manual (non-Perl)
> >     Homo sapiens genes (GRCh37.p3)
> >     Filters
> >         with protein ID(s): Only
> >     Attributes
> >         Ensembl Gene ID
> >         Ensembl Transcript ID
> > Same problem occurs using Perl filter as well
> >     $query->addFilter("with_protein_id", ["Only"]);
> > ________________________________
> > From: pip pipster <pipsterpip at yahoo.com>
> > To: Rhoda Kinsella <rhoda at ebi.ac.uk>
> > Cc: "users at biomart.org" <users at biomart.org>
> > Sent: Monday, August 22, 2011 8:07 AM
> > Subject: Re: [BioMart Users] Bug or User error with filtering?
> >
> > Rhoda,
> > Thank you for the feedback, very helpful.  The Gene Type filter,
> > 'protein_coding' will likely work, however it doesn't allow me to do an
> > 'exclude' type filter (i.e. give me everything except for the non
> > protein-coding genes).  Do you know if you can still do an exclude using the
> > method you described?
> > Thank you!
> > Phillipe
> > ________________________________
> > From: Rhoda Kinsella <rhoda at ebi.ac.uk>
> > To: pip pipster <pipsterpip at yahoo.com>
> > Cc: "users at biomart.org" <users at biomart.org>
> > Sent: Monday, August 22, 2011 5:04 AM
> > Subject: Re: [BioMart Users] Bug or User error with filtering?
> >
> > Hi Phillipe
> > You are filtering using the protein ID (Genbank protein accession) and as
> > this Ensembl protein ID does not have a corresponding Genbank protein
> > accession, you will not get this ENSP. Please filter using the Gene type
> > filter and select protein_coding. That way you will get the ENSP data you
> > require.
> > Regards
> > Rhoda
> >
> > On 21 Aug 2011, at 22:54, pip pipster wrote:
> >
> > We are seeing strange things occur with the protein ID filter.  For example,
> > transcript ENST00000345514 is being filtered out by the following search
> > below.  However, you can see that it indeed has a Preotin ID shown here:
> > http://useast.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000127152;r=14:99635624-99737861;t=ENST00000345514
> > .  Any idea why this is being filtered?  Could this be a bug in Biomart/Data
> > or User Error?
> >
> > Manual (non-Perl)
> >     Homo sapiens genes (GRCh37.p3)
> >     Filters
> >         with protein ID(s): Only
> >     Attributes
> >         Ensembl Gene ID
> >         Ensembl Transcript ID
> > Same problem occurs using Perl filter as well
> >     $query->addFilter("with_protein_id", ["Only"]);
> > Thank you,
> > Phillipe
> > _______________________________________________
> > Users mailing list
> > Users at biomart.org
> > https://lists.biomart.org/mailman/listinfo/users
> >
> > Rhoda Kinsella Ph.D.
> > Ensembl Bioinformatician,
> > European Bioinformatics Institute (EMBL-EBI),
> > Wellcome Trust Genome Campus,
> > Hinxton
> > Cambridge CB10 1SD,
> > UK.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > Dev mailing list    Dev at ensembl.org
> > List admin (including subscribe/unsubscribe):
> > http://lists.ensembl.org/mailman/listinfo/dev
> > Ensembl Blog: http://www.ensembl.info/
> >
> >
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/









More information about the Dev mailing list