[ensembl-dev] Bug or User error with filtering?

Southan, Christopher Christopher.Southan at astrazeneca.com
Fri Aug 26 08:30:09 BST 2011


Phillipe,  

1) I think this is for someone from the teams to pick up.   I am just an observer (and Ensembl fan of course) although is part of my job to try to understand protein database relationships
2) As comments to help you UniProt (especially the Swiss-Prot component) is certainly the best choice for the "is-a-(real)-protein" filter against the gene models. 
3) Its imperfections are some circularity (but this is  low I think) and an element of redundancy between Swiss-Prot and TrEMBL
4) The flow of coding annotation from GenBank mRNAs is complex but, as you know, starts with the CDS given by the submitter and a primary protein accession no but these get merged into RefSeq and UniProt. There are also some predicted CDSs in genomic records
5) As mentioned below UniProt entries will have a pointer to the mRNAs (except for the few circular cases) 
6) But remember many Ensembl possible splice variant transcripts will not have an exact UniProt match, just scores
 

Yours,  Chris  

From: pip pipster [mailto:pipsterpip at yahoo.com] 
Sent: den 25 augusti 2011 23:21
To: Southan, Christopher; ian Longden
Cc: Dev at ensembl.org
Subject: Re: [ensembl-dev] Bug or User error with filtering?

Thank you Christopher.  Do you know why TrEMBL/Uniprot are used at all for the Genebank Is-Protein filter?  Also, do you know if the logic/predictions for the following are documented anywhere, "TrEMBL taking Ensembl predictions"

If there is a way to filter by what transcripts Genebank believes code for proteins, that would be ideal.

Thank you,
Phillipe

________________________________________
From: "Southan, Christopher" <Christopher.Southan at astrazeneca.com>
To: pip pipster <pipsterpip at yahoo.com>; ian Longden <ianl at ebi.ac.uk>
Cc: Dev at ensembl.org
Sent: Thursday, August 25, 2011 2:04 PM
Subject: RE: [ensembl-dev] Bug or User error with filtering?
Pip/Ian 
 
In this case some circularity has been introduced via  TrEMBL taking Ensembl predictions  (which I thought they did not but only UniParc did) 
 
So  http://www.uniprot.org/uniprot/C9JLU5  “is”  ENSP00000409047  but http://www.uniprot.org/uniprot/E7EQ37   was removed last month
 
If you walk through the General identifiers   e.g.   http://www.ensembl.org/Homo_sapiens/Transcript/Similarity?db=core;g=ENSG00000127241;r=3:186935942-187009810;t=ENST00000296280 
 
You will  see the ranked UniProt matches will change according to the scores against that particular Ensembl splice variant 
(but you will only see the Swiss-Prot splice variant matches in the Supporting evidence view) 
 
Thus  while  ENST00000337774  “top hits”  Swiss-Prot MASP1_HUMAN  in the General Identifiers view
 
ENST00000296280  and some of the others  “top hit”   E7EQ37   and  C9JLU5  
 
And  ENSP00000296280  “top hits” Q9NSY8_HUMAN (TrEMBL) which does have a mRNA entry and GenBank protein ID -  but – arguably - should have also been removed from UniProt by being  subsumed  into the Swiss-Prot entry
 
Yours,  Chris 
 
Christopher Southan, B.Sc., M.Sc., Ph.D.
Consultant, Chemistry Intelligence and Knowledge Engineering Programme
_____________________________
AstraZeneca  R&D, R&D Information
KD2 BO7,  Mölndal, S-43183, Sweden
christopher.southan at astrazenca.com
Tel +44-31-7065288,  mob+46-702530710
http://www.linkedin.com/in/cdsouthan
 
________________________________________
Confidentiality Notice: This message is private and may contain confidential and proprietary information. If you have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this message is not permitted and may be unlawful.
 
From: dev-bounces at ensembl.org [mailto:dev-bounces at ensembl.org] On Behalf Of pip pipster
Sent: den 25 augusti 2011 16:35
To: ian Longden
Cc: Dev at ensembl.org
Subject: Re: [ensembl-dev] Bug or User error with filtering?
 
Hi Ian,
Thanks for helping.  If you don't mind, I'm still trying to understand the process.  Here are my comments.
 

>>> For this particular Transcript ENST00000169293 we have two UniProtSPTREMBL entries (C9JLU5_HUMAN and E7EQ37_HUMAN, well mapped to its translations) of which neither has any EMBL references with a valid protein_id and in fact it lists this as "NOT_ANNOTATED_CDS" for this part.
a.  ENST00000337774 appears to also map to this as well (via going to Ensembl's External References->General Identifiers link).  Is the reason that ENST00000337774 shows up correctly as protein coding because in addition to this, it also maps to the UniProt SWISSPROT MASP1_HUMAN and the later is what flagged it correctly?
 
b.  As you can see here (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=ENST00000169293 ) the transcript shows up as protein coding and also maps to the same Gene/Protein as ENST0000033777 does.  Is it a bug that Uniprot is not showing the proper link information or ENST00000169293 or am I interpreting the following wrong http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=ENST00000169293 ?  From the NCBI link, I would have assumed that Uniprot would have showed the same for that transcript.
 
c.  Why does Ensembl use Unitprot files as an intermediate lookup for GenBank?  My fear here is that it is possible that thousands (over 60,000) transcript are showing up incorrectly.
 
Thank you,
Phillipe
 
 

________________________________________
From: ian Longden <ianl at ebi.ac.uk>
To: pip pipster <pipsterpip at yahoo.com>
Cc: "Dev at ensembl.org" <Dev at ensembl.org>
Sent: Thursday, August 25, 2011 5:57 AM
Subject: Re: [ensembl-dev] Bug or User error with filtering?

Hi Phillipe,

We match GenBank entries to EnsEMBL via UniProt entries.
The UniProt files we parse have DR lines which we use to get the
Genbank Entries ( EMBL in the file).

i.e.

ID  EXAMPLE_HUMAN
...
DR EMBL; AAAA; BBBB;
....

Where AAAA would be the EMBL accession and BBBB the protein_id

For this particular Transcript ENST00000169293 we have two UniProt
SPTREMBL entries
(C9JLU5_HUMAN and E7EQ37_HUMAN, well mapped to its translations) of
which neither has any EMBL references with a valid protein_id and in
fact it lists this as  "NOT_ANNOTATED_CDS" for this part.

Protein GeneBank ID: BAA05928 is mapped via the UniProt SWISSPROT
MASP1_HUMAN which maps to another Transcript (ENST00000337774) in the
same gene.

Does this help explain the process and explain why we have no
protein_id for some Transcripts.


-Ian.

On Wed, Aug 24, 2011 at 3:46 PM, pip pipster <pipsterpip at yahoo.com> wrote:
> Sending this thread to the Ensembl mailing list now as it appears it may be
> Ensembl data related.  Any ideas why the ENSEMBL transcripts aren't mapped
> correctly to the GenBank Protein Accessions?
> Thank you for the help.
> Best regards,
> Phillipe
>
> ----- Forwarded Message -----
> From: Elena Rivkin <Elena.Rivkin at oicr.on.ca>
> To: pip pipster <pipsterpip at yahoo.com>
> Cc: Junjun Zhang <Junjun.Zhang at oicr.on.ca>
> Sent: Tuesday, August 23, 2011 1:11 PM
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Hi Phillipe,
> From the data you provided, as you said, it looks like these Emsembl
> transcripts (ENST00000169293) (and many others in similar categories) are
> not mapped to the GenBank Protein Accessions, and therefore are not
> retrieved via quries to BioMart.
> Unfortunately, I don't know why that is. I recommned forwarding your
> question to Ensembl helpdesk, and they might be able to assist you in this
> matter.
> Thank you.
> Elena Rivkin, PhD
> Outreach and Training Coordinator, Informatics and Bio-computing
>
> Ontario Institute for Cancer Research
> MaRS Centre, South Tower
> 101 College Street, Suite 800
> Toronto, Ontario, Canada M5G 0A3
> Tel: 647-258-4316
> Toll-free: 1-866-678-6427
> www.oicr.on.ca
> This message and any attachments may contain confidential and/or privileged
> information for the sole use of the intended recipient. Any review or
> distribution by anyone other than the person for whom it was originally
> intended is strictly prohibited. If you have received this message in error,
> please contact the sender and delete all copies. Opinions, conclusions or
> other information contained in this message may not be that of the
> organization.
>
> From: pip pipster <pipsterpip at yahoo.com>
> Reply-To: pip pipster <pipsterpip at yahoo.com>
> Date: Tue, 23 Aug 2011 11:50:00 -0400
> To: Microsoft Office User <Elena.Rivkin at oicr.on.ca>, Junjun Zhang
> <Junjun.Zhang at oicr.on.ca>, "users at biomart.org" <users at biomart.org>
> Cc: Rhoda Kinsella via RT <helpdesk at ensembl.org>
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Elena,
> You should be able to follow this up the chain in getting accession numbers.
> a.  From Transcript
> http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=ENST00000169293
>
> b.  To Gene (link to this Gene URL is located on Transcript link above)
> http://www.ncbi.nlm.nih.gov/nuccore/D28593.1
>
> c.  To Protein (link to this Protein URL is located on Gene link above)
> http://www.ncbi.nlm.nih.gov/protein/471128
>
> From this stand-point, I am led to believe that the Transcript maps to
> a Genbank protein accession and should not be filtered out with
> the $query->addFilter("with_protein_id", ["Only"]) filter.  But in either
> case I would like to understand why it's being filtered out since I have to
> trust the data I get back and deal with it accordingly.
> Likewise, the following URL also appears to chain the Gene to the proper
> transcripts.
> http://www.ebi.ac.uk/ena/data/view/D28593
>
> It appears that for some reason the data in Emsembl is not mapping
> transcript ENST00000169293 (and many others in similar categories) to the
> proper Protein Accession.  But that's just my theory and would love to
> understand it better.  Thoughts?
> Best regards,
> Phillipe
>
>
>
>
> ________________________________
> From: Elena Rivkin <Elena.Rivkin at oicr.on.ca>
> To: pip pipster <pipsterpip at yahoo.com>; Junjun Zhang
> <Junjun.Zhang at oicr.on.ca>; "users at biomart.org" <users at biomart.org>
> Cc: Rhoda Kinsella via RT <helpdesk at ensembl.org>
> Sent: Monday, August 22, 2011 2:04 PM
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Hi Philliple,
> When entering Protein GeneBank ID: BAA05928, and retrieving Ensembl gene id
> and transcript id, I get the following:
> ENSG00000127241 ENST00000337774
> When entering Protein GeneBank ID: CAC17726, and retrieving Ensembl gene id
> and transcript id, I get the following:
> ENSG000000127152, ENST000000357195
> It appears that in the Ensembl mart that you are querying, these GeneBank
> Ids coorespond to a different transcripts (although to the same gene ID).
> Regards,
> Elena Rivkin, PhD
> Outreach and Training Coordinator, Informatics and Bio-computing
>
> Ontario Institute for Cancer Research
> MaRS Centre, South Tower
> 101 College Street, Suite 800
> Toronto, Ontario, Canada M5G 0A3
> Tel: 647-258-4316
> Toll-free: 1-866-678-6427
> www.oicr.on.ca
> This message and any attachments may contain confidential and/or privileged
> information for the sole use of the intended recipient. Any review or
> distribution by anyone other than the person for whom it was originally
> intended is strictly prohibited. If you have received this message in error,
> please contact the sender and delete all copies. Opinions, conclusions or
> other information contained in this message may not be that of the
> organization.
>
> From: pip pipster <pipsterpip at yahoo.com>
> Reply-To: pip pipster <pipsterpip at yahoo.com>
> Date: Mon, 22 Aug 2011 13:51:56 -0400
> To: Junjun Zhang <Junjun.Zhang at oicr.on.ca>, "users at biomart.org"
> <users at biomart.org>
> Cc: Rhoda Kinsella via RT <helpdesk at ensembl.org>
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Thank you Junjun.
> Elena, to answer your question, I believe the ncbi links in the below thread
> include a link to the protein where you can get the protein accession
> number.  For example, for the 2 transcripts below you will find links to the
> following proteins.  You will also see that the transcripts are correctly
> showing up on the URL's as being protein coding.
> http://www.ncbi.nlm.nih.gov/protein/471128 (accession BAA05928)
> and
> http://www.ncbi.nlm.nih.gov/protein/11558488 (accession CAC17726)
> Thank you,
> Phillipe
>
> ________________________________
> From: Junjun Zhang <Junjun.Zhang at oicr.on.ca>
> To: pip pipster <pipsterpip at yahoo.com>; "users at biomart.org"
> <users at biomart.org>
> Cc: Rhoda Kinsella via RT <helpdesk at ensembl.org>
> Sent: Monday, August 22, 2011 12:59 PM
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Hi Phillipe,
> I am forwarding your questions to the Ensembl Helpdesk. Ensembl team is the
> best to answer questions about data contents in Ensembl databases.
> Cheers,
> Junjun
>
> From: Elena Rivkin <Elena.Rivkin at oicr.on.ca>
> Date: Mon, 22 Aug 2011 10:46:35 -0400
> To: pip pipster <pipsterpip at yahoo.com>, Rhoda Kinsella <rhoda at ebi.ac.uk>,
> "users at biomart.org" <users at biomart.org>
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Hi Phillipe,
> Can you let me know, for these two transcripts, what are their Genbank
> protein accessions. I cant find them.
> Thank you.
> Elena Rivkin, PhD
> Outreach and Training Coordinator, Informatics and Bio-computing
>
> Ontario Institute for Cancer Research
> MaRS Centre, South Tower
> 101 College Street, Suite 800
> Toronto, Ontario, Canada M5G 0A3
> Tel: 647-258-4316
> Toll-free: 1-866-678-6427
> www.oicr.on.ca
> This message and any attachments may contain confidential and/or privileged
> information for the sole use of the intended recipient. Any review or
> distribution by anyone other than the person for whom it was originally
> intended is strictly prohibited. If you have received this message in error,
> please contact the sender and delete all copies. Opinions, conclusions or
> other information contained in this message may not be that of the
> organization.
>
> From: pip pipster <pipsterpip at yahoo.com>
> Reply-To: pip pipster <pipsterpip at yahoo.com>
> Date: Mon, 22 Aug 2011 10:32:43 -0400
> To: Rhoda Kinsella <rhoda at ebi.ac.uk>, "users at biomart.org"
> <users at biomart.org>
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> After doing more investigation, something definitely isn't adding up.  As it
> turns out, filtering by Genbank protein accession is what we want and we
> need the ability to exclude.  The 2 transcripts below are examples (they
> show up as protein coding Genbank as well as Ensembl) but there are
> thousands more like this.  The filter below is taking them out despite them
> having a Genbank protein accession.  What may be causing this?
>
> ENST00000169293
> http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=ENST00000169293
> http://www.ncbi.nlm.nih.gov/nuccore/D28593?
> http://useast.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000127241;r=3:186964149-187009745;t=ENST00000169293
>
> ENST00000345514
> http://www.ncbi.nlm.nih.gov/gene?term=ENST00000345514
> http://useast.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000127152;r=14:99635624-99737822;t=ENST00000345514
>
> Filter used:
> Manual (non-Perl)
>     Homo sapiens genes (GRCh37.p3)
>     Filters
>         with protein ID(s): Only
>     Attributes
>         Ensembl Gene ID
>         Ensembl Transcript ID
> Same problem occurs using Perl filter as well
>     $query->addFilter("with_protein_id", ["Only"]);
> ________________________________
> From: pip pipster <pipsterpip at yahoo.com>
> To: Rhoda Kinsella <rhoda at ebi.ac.uk>
> Cc: "users at biomart.org" <users at biomart.org>
> Sent: Monday, August 22, 2011 8:07 AM
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Rhoda,
> Thank you for the feedback, very helpful.  The Gene Type filter,
> 'protein_coding' will likely work, however it doesn't allow me to do an
> 'exclude' type filter (i.e. give me everything except for the non
> protein-coding genes).  Do you know if you can still do an exclude using the
> method you described?
> Thank you!
> Phillipe
> ________________________________
> From: Rhoda Kinsella <rhoda at ebi.ac.uk>
> To: pip pipster <pipsterpip at yahoo.com>
> Cc: "users at biomart.org" <users at biomart.org>
> Sent: Monday, August 22, 2011 5:04 AM
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Hi Phillipe
> You are filtering using the protein ID (Genbank protein accession) and as
> this Ensembl protein ID does not have a corresponding Genbank protein
> accession, you will not get this ENSP. Please filter using the Gene type
> filter and select protein_coding. That way you will get the ENSP data you
> require.
> Regards
> Rhoda
>
> On 21 Aug 2011, at 22:54, pip pipster wrote:
>
> We are seeing strange things occur with the protein ID filter.  For example,
> transcript ENST00000345514 is being filtered out by the following search
> below.  However, you can see that it indeed has a Preotin ID shown here:
> http://useast.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000127152;r=14:99635624-99737861;t=ENST00000345514
> .  Any idea why this is being filtered?  Could this be a bug in Biomart/Data
> or User Error?
>
> Manual (non-Perl)
>     Homo sapiens genes (GRCh37.p3)
>     Filters
>         with protein ID(s): Only
>     Attributes
>         Ensembl Gene ID
>         Ensembl Transcript ID
> Same problem occurs using Perl filter as well
>     $query->addFilter("with_protein_id", ["Only"]);
> Thank you,
> Phillipe
> _______________________________________________
> Users mailing list
> Users at biomart.org
> https://lists.biomart.org/mailman/listinfo/users
>
> Rhoda Kinsella Ph.D.
> Ensembl Bioinformatician,
> European Bioinformatics Institute (EMBL-EBI),
> Wellcome Trust Genome Campus,
> Hinxton
> Cambridge CB10 1SD,
> UK.
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>


--------------------------------------------------------------------------
Confidentiality Notice: This message is private and may contain confidential and proprietary information. If you have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this message is not permitted and may be unlawful.
 




More information about the Dev mailing list