[ensembl-dev] Bug or User error with filtering?

ian Longden ianl at ebi.ac.uk
Thu Aug 25 10:57:09 BST 2011


Hi Phillipe,

We match GenBank entries to EnsEMBL via UniProt entries.
The UniProt files we parse have DR lines which we use to get the
Genbank Entries ( EMBL in the file).

i.e.

ID   EXAMPLE_HUMAN
...
DR EMBL; AAAA; BBBB;
....

Where AAAA would be the EMBL accession and BBBB the protein_id

For this particular Transcript ENST00000169293 we have two UniProt
SPTREMBL entries
 (C9JLU5_HUMAN and E7EQ37_HUMAN, well mapped to its translations) of
which neither has any EMBL references with a valid protein_id and in
fact it lists this as   "NOT_ANNOTATED_CDS" for this part.

Protein GeneBank ID: BAA05928 is mapped via the UniProt SWISSPROT
MASP1_HUMAN which maps to another Transcript (ENST00000337774) in the
same gene.

Does this help explain the process and explain why we have no
protein_id for some Transcripts.


-Ian.

On Wed, Aug 24, 2011 at 3:46 PM, pip pipster <pipsterpip at yahoo.com> wrote:
> Sending this thread to the Ensembl mailing list now as it appears it may be
> Ensembl data related.  Any ideas why the ENSEMBL transcripts aren't mapped
> correctly to the GenBank Protein Accessions?
> Thank you for the help.
> Best regards,
> Phillipe
>
> ----- Forwarded Message -----
> From: Elena Rivkin <Elena.Rivkin at oicr.on.ca>
> To: pip pipster <pipsterpip at yahoo.com>
> Cc: Junjun Zhang <Junjun.Zhang at oicr.on.ca>
> Sent: Tuesday, August 23, 2011 1:11 PM
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Hi Phillipe,
> From the data you provided, as you said, it looks like these Emsembl
> transcripts (ENST00000169293) (and many others in similar categories) are
> not mapped to the GenBank Protein Accessions, and therefore are not
> retrieved via quries to BioMart.
> Unfortunately, I don't know why that is. I recommned forwarding your
> question to Ensembl helpdesk, and they might be able to assist you in this
> matter.
> Thank you.
> Elena Rivkin, PhD
> Outreach and Training Coordinator, Informatics and Bio-computing
>
> Ontario Institute for Cancer Research
> MaRS Centre, South Tower
> 101 College Street, Suite 800
> Toronto, Ontario, Canada M5G 0A3
> Tel: 647-258-4316
> Toll-free: 1-866-678-6427
> www.oicr.on.ca
> This message and any attachments may contain confidential and/or privileged
> information for the sole use of the intended recipient. Any review or
> distribution by anyone other than the person for whom it was originally
> intended is strictly prohibited. If you have received this message in error,
> please contact the sender and delete all copies. Opinions, conclusions or
> other information contained in this message may not be that of the
> organization.
>
> From: pip pipster <pipsterpip at yahoo.com>
> Reply-To: pip pipster <pipsterpip at yahoo.com>
> Date: Tue, 23 Aug 2011 11:50:00 -0400
> To: Microsoft Office User <Elena.Rivkin at oicr.on.ca>, Junjun Zhang
> <Junjun.Zhang at oicr.on.ca>, "users at biomart.org" <users at biomart.org>
> Cc: Rhoda Kinsella via RT <helpdesk at ensembl.org>
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Elena,
> You should be able to follow this up the chain in getting accession numbers.
> a.  From Transcript
> http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=ENST00000169293
>
> b.  To Gene (link to this Gene URL is located on Transcript link above)
> http://www.ncbi.nlm.nih.gov/nuccore/D28593.1
>
> c.  To Protein (link to this Protein URL is located on Gene link above)
> http://www.ncbi.nlm.nih.gov/protein/471128
>
> From this stand-point, I am led to believe that the Transcript maps to
> a Genbank protein accession and should not be filtered out with
> the $query->addFilter("with_protein_id", ["Only"]) filter.  But in either
> case I would like to understand why it's being filtered out since I have to
> trust the data I get back and deal with it accordingly.
> Likewise, the following URL also appears to chain the Gene to the proper
> transcripts.
> http://www.ebi.ac.uk/ena/data/view/D28593
>
> It appears that for some reason the data in Emsembl is not mapping
> transcript ENST00000169293 (and many others in similar categories) to the
> proper Protein Accession.  But that's just my theory and would love to
> understand it better.  Thoughts?
> Best regards,
> Phillipe
>
>
>
>
> ________________________________
> From: Elena Rivkin <Elena.Rivkin at oicr.on.ca>
> To: pip pipster <pipsterpip at yahoo.com>; Junjun Zhang
> <Junjun.Zhang at oicr.on.ca>; "users at biomart.org" <users at biomart.org>
> Cc: Rhoda Kinsella via RT <helpdesk at ensembl.org>
> Sent: Monday, August 22, 2011 2:04 PM
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Hi Philliple,
> When entering Protein GeneBank ID: BAA05928, and retrieving Ensembl gene id
> and transcript id, I get the following:
> ENSG00000127241 ENST00000337774
> When entering Protein GeneBank ID: CAC17726, and retrieving Ensembl gene id
> and transcript id, I get the following:
> ENSG000000127152, ENST000000357195
> It appears that in the Ensembl mart that you are querying, these GeneBank
> Ids coorespond to a different transcripts (although to the same gene ID).
> Regards,
> Elena Rivkin, PhD
> Outreach and Training Coordinator, Informatics and Bio-computing
>
> Ontario Institute for Cancer Research
> MaRS Centre, South Tower
> 101 College Street, Suite 800
> Toronto, Ontario, Canada M5G 0A3
> Tel: 647-258-4316
> Toll-free: 1-866-678-6427
> www.oicr.on.ca
> This message and any attachments may contain confidential and/or privileged
> information for the sole use of the intended recipient. Any review or
> distribution by anyone other than the person for whom it was originally
> intended is strictly prohibited. If you have received this message in error,
> please contact the sender and delete all copies. Opinions, conclusions or
> other information contained in this message may not be that of the
> organization.
>
> From: pip pipster <pipsterpip at yahoo.com>
> Reply-To: pip pipster <pipsterpip at yahoo.com>
> Date: Mon, 22 Aug 2011 13:51:56 -0400
> To: Junjun Zhang <Junjun.Zhang at oicr.on.ca>, "users at biomart.org"
> <users at biomart.org>
> Cc: Rhoda Kinsella via RT <helpdesk at ensembl.org>
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Thank you Junjun.
> Elena, to answer your question, I believe the ncbi links in the below thread
> include a link to the protein where you can get the protein accession
> number.  For example, for the 2 transcripts below you will find links to the
> following proteins.  You will also see that the transcripts are correctly
> showing up on the URL's as being protein coding.
> http://www.ncbi.nlm.nih.gov/protein/471128 (accession BAA05928)
> and
> http://www.ncbi.nlm.nih.gov/protein/11558488 (accession CAC17726)
> Thank you,
> Phillipe
>
> ________________________________
> From: Junjun Zhang <Junjun.Zhang at oicr.on.ca>
> To: pip pipster <pipsterpip at yahoo.com>; "users at biomart.org"
> <users at biomart.org>
> Cc: Rhoda Kinsella via RT <helpdesk at ensembl.org>
> Sent: Monday, August 22, 2011 12:59 PM
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Hi Phillipe,
> I am forwarding your questions to the Ensembl Helpdesk. Ensembl team is the
> best to answer questions about data contents in Ensembl databases.
> Cheers,
> Junjun
>
> From: Elena Rivkin <Elena.Rivkin at oicr.on.ca>
> Date: Mon, 22 Aug 2011 10:46:35 -0400
> To: pip pipster <pipsterpip at yahoo.com>, Rhoda Kinsella <rhoda at ebi.ac.uk>,
> "users at biomart.org" <users at biomart.org>
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Hi Phillipe,
> Can you let me know, for these two transcripts, what are their Genbank
> protein accessions. I cant find them.
> Thank you.
> Elena Rivkin, PhD
> Outreach and Training Coordinator, Informatics and Bio-computing
>
> Ontario Institute for Cancer Research
> MaRS Centre, South Tower
> 101 College Street, Suite 800
> Toronto, Ontario, Canada M5G 0A3
> Tel: 647-258-4316
> Toll-free: 1-866-678-6427
> www.oicr.on.ca
> This message and any attachments may contain confidential and/or privileged
> information for the sole use of the intended recipient. Any review or
> distribution by anyone other than the person for whom it was originally
> intended is strictly prohibited. If you have received this message in error,
> please contact the sender and delete all copies. Opinions, conclusions or
> other information contained in this message may not be that of the
> organization.
>
> From: pip pipster <pipsterpip at yahoo.com>
> Reply-To: pip pipster <pipsterpip at yahoo.com>
> Date: Mon, 22 Aug 2011 10:32:43 -0400
> To: Rhoda Kinsella <rhoda at ebi.ac.uk>, "users at biomart.org"
> <users at biomart.org>
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> After doing more investigation, something definitely isn't adding up.  As it
> turns out, filtering by Genbank protein accession is what we want and we
> need the ability to exclude.  The 2 transcripts below are examples (they
> show up as protein coding Genbank as well as Ensembl) but there are
> thousands more like this.  The filter below is taking them out despite them
> having a Genbank protein accession.  What may be causing this?
>
> ENST00000169293
> http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=ENST00000169293
> http://www.ncbi.nlm.nih.gov/nuccore/D28593?
> http://useast.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000127241;r=3:186964149-187009745;t=ENST00000169293
>
> ENST00000345514
> http://www.ncbi.nlm.nih.gov/gene?term=ENST00000345514
> http://useast.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000127152;r=14:99635624-99737822;t=ENST00000345514
>
> Filter used:
> Manual (non-Perl)
>     Homo sapiens genes (GRCh37.p3)
>     Filters
>         with protein ID(s): Only
>     Attributes
>         Ensembl Gene ID
>         Ensembl Transcript ID
> Same problem occurs using Perl filter as well
>     $query->addFilter("with_protein_id", ["Only"]);
> ________________________________
> From: pip pipster <pipsterpip at yahoo.com>
> To: Rhoda Kinsella <rhoda at ebi.ac.uk>
> Cc: "users at biomart.org" <users at biomart.org>
> Sent: Monday, August 22, 2011 8:07 AM
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Rhoda,
> Thank you for the feedback, very helpful.  The Gene Type filter,
> 'protein_coding' will likely work, however it doesn't allow me to do an
> 'exclude' type filter (i.e. give me everything except for the non
> protein-coding genes).  Do you know if you can still do an exclude using the
> method you described?
> Thank you!
> Phillipe
> ________________________________
> From: Rhoda Kinsella <rhoda at ebi.ac.uk>
> To: pip pipster <pipsterpip at yahoo.com>
> Cc: "users at biomart.org" <users at biomart.org>
> Sent: Monday, August 22, 2011 5:04 AM
> Subject: Re: [BioMart Users] Bug or User error with filtering?
>
> Hi Phillipe
> You are filtering using the protein ID (Genbank protein accession) and as
> this Ensembl protein ID does not have a corresponding Genbank protein
> accession, you will not get this ENSP. Please filter using the Gene type
> filter and select protein_coding. That way you will get the ENSP data you
> require.
> Regards
> Rhoda
>
> On 21 Aug 2011, at 22:54, pip pipster wrote:
>
> We are seeing strange things occur with the protein ID filter.  For example,
> transcript ENST00000345514 is being filtered out by the following search
> below.  However, you can see that it indeed has a Preotin ID shown here:
> http://useast.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000127152;r=14:99635624-99737861;t=ENST00000345514
> .  Any idea why this is being filtered?  Could this be a bug in Biomart/Data
> or User Error?
>
> Manual (non-Perl)
>     Homo sapiens genes (GRCh37.p3)
>     Filters
>         with protein ID(s): Only
>     Attributes
>         Ensembl Gene ID
>         Ensembl Transcript ID
> Same problem occurs using Perl filter as well
>     $query->addFilter("with_protein_id", ["Only"]);
> Thank you,
> Phillipe
> _______________________________________________
> Users mailing list
> Users at biomart.org
> https://lists.biomart.org/mailman/listinfo/users
>
> Rhoda Kinsella Ph.D.
> Ensembl Bioinformatician,
> European Bioinformatics Institute (EMBL-EBI),
> Wellcome Trust Genome Campus,
> Hinxton
> Cambridge CB10 1SD,
> UK.
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
>
>




More information about the Dev mailing list