[ensembl-dev] Effects predictor version 2

Tue May 17 16:26:21 BST 2011

Hi Andrea,

I'm not quite sure what you are trying to achieve with your query, it looks like you are just counting transcript_variation entries that lie in chromosome 1, this will include lots of transcript_variations that do not fall in the CDS of a transcript and are not predicted to cause a single amino acid substitution. 

The polyphen_prediction and sift_prediction tables include predictions for all possible single amino acid substitutions in the ensembl proteome and so can be used by the VEP script to look up predictions for novel mutations. When we run our pipeline to populate the transcript_variation table we look up the predictions for any actual variants in our database that are predicted to cause an amino acid substitution using these tables and store the qualitative prediction in the transcript_variation table (mainly to speed up various web views). We do not store the score in the transcript_variation table and so the API also uses these tables to look up the score when you call, e.g., the sift_score method on a TranscriptVariationAllele object. If you're interested in joining the transcript_variation table with, e.g., the polyphen_prediction table you could use some SQL like:

SELECT tv.transcript_variation_id, pred.prediction
FROM polyphen_prediction pred, protein_info pi, protein_position pp, transcript_variation tv
WHERE substr(pep_allele_string,3,3) = pred.amino_acid
AND pep_allele_string LIKE '_/_'
AND tv.feature_stable_id = pi.transcript_stable_id
AND pp.position = tv.translation_start
AND pp.protein_info_id = pi.protein_info_id
AND pred.protein_position_id = pp.protein_position_id

and you could add in some constraint to limit this to transcript_variations on chromosome 1 if that's what you're after (though this will probably take a long time to return because of the LIKE constraint). Does that help?

Cheers,

Graham

On 17 May 2011, at 16:01, Andrea Edwards wrote:

> Hi, I appreciate you will only have sift and polyphen for non synonymous snps in exons. I was just wonderng if my query was correct or whether i needed to link to sift_prediction/polyphen_prediction/protein_position tables to get more accurate data. I presumed the predictions from these tables was simply copied into the polyphen and sift fields of the transcript_variant table and that my query is ok.
> 
> On 17/05/11 15:42, Graham Ritchie wrote:
>> Hi Andrea,
>> 
>> We only have sift and polyphen predictions for variants which are predicted to result in single amino acid substitutions.
>> 
>> Cheers,
>> 
>> Graham
>> 
>> On 17 May 2011, at 15:37, Andrea Edwards wrote:
>> 
>>> Hello
>>> 
>>> Whilst looking into Stuart's question I looked at the variants on chromosome 1 out of curiosity and found that most of them don't have sift/polyphen data.
>>> Is this correct or have i made a mistake in my understanding of the schema
>>> 
>>> variants on chr1 (seq_region_id = 27511)
>>> ============================
>>> 
>>> mysql>  select count(*) from transcript_variation tv inner join
>>> homo_sapiens_core_62_37g.transcript_stable_id st on st.stable_id =
>>> tv.feature_stable_id inner join homo_sapiens_core_62_37g.transcript t on
>>> t.transcript_id = st.transcript_id where t.seq_region_id = 27511;
>>> +----------+
>>> | count(*) |
>>> +----------+
>>> | 9633745 |
>>> +----------+
>>> 1 row in set (3.34 sec)
>>> 
>>> 
>>> variants on chr1 without sift and polyphen
>>> ===========================
>>> 
>>> mysql>  select count(*) from transcript_variation tv inner join
>>> homo_sapiens_core_62_37g.transcript_stable_id st on st.stable_id =
>>> tv.feature_stable_id inner join homo_sapiens_core_62_37g.transcript t on
>>> t.transcript_id = st.transcript_id where t.seq_region_id = 27511 and
>>> tv.sift_prediction is null and tv.polyphen_prediction is null;
>>> +----------+
>>> | count(*) |
>>> +----------+
>>> | 9562313 |
>>> +----------+
>>> 1 row in set (11.22 sec)
>>> 
>>> 
>>> variants on chr1 with sift and polyphen
>>> =========================
>>> 
>>> mysql>  select count(*) from transcript_variation tv inner join
>>> homo_sapiens_core_62_37g.transcript_stable_id st on st.stable_id =
>>> tv.feature_stable_id inner join homo_sapiens_core_62_37g.transcript t on
>>> t.transcript_id = st.transcript_id where t.seq_region_id = 27511 and
>>> tv.sift_prediction is not null and tv.polyphen_prediction is not null;
>>> +----------+
>>> | count(*) |
>>> +----------+
>>> | 67919 |
>>> +----------+
>>> 1 row in set (11.19 sec)
>>> 
>>> 
>>> 
>>> thanks
>>> 
>>> 
>>> On 17/05/11 13:59, Stuart Meacham wrote:
>>>> Hello,
>>>> 
>>>> Thanks for the reply.
>>>> 
>>>> On 17/05/11 13:35, Will McLaren wrote:
>>>> 
>>>>> This is strange - are you sure you are checking out the branch and not
>>>>> the head of the API? You should be doing something like:
>>>>> 
>>>>> cvs checkout -r branch-ensembl-62 ensembl
>>>>> cvs checkout -r branch-ensembl-62 ensembl-variation
>>>> Actually I just used the links from the site here:
>>>> 
>>>> http://www.ensembl.org/info/docs/api/api_installation.html
>>>> 
>>>> the link(s) resolve to things like:
>>>> 
>>>> http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl.tar.gz?root=ensembl&only_with_tag=branch-ensembl-62&view=tar
>>>> 
>>>>>> The script silently over-writes an existing output file of the same name,
>>>>>> this seems a bit brutal, perhaps the default should be to fail if the file
>>>>>> exists.
>>>>> I think this is pretty standard behaviour for command-line programs. I
>>>>> could change it to only run if in an output file name is specified
>>>>> perhaps?
>>>> Yes, probably it's standard behaviour. I was just imagining accidentally overwriting a file the script had spent 24 hours creating . . .
>>>> 
>>>>> That's also odd - any variants classified as non-synonymous coding
>>>>> should have a "SIFT=*" entry in the final column. Can you try the
>>>>> attached file as input on your system?
>>>>> 
>>>> No problem, the command I used was:
>>>> 
>>>> perl ./variant_effect_predictor_2.pl -r reg.pl -i ./test.txt -w -b 100000 --sift=p --polyphen=p --failed=0 -terms=so
>>>> 
>>>> and the output (no errors but also no predictions) is attached.
>>>> 
>>>> Cheers
>>>> 
>>>> Stuart
>>>> 
>>>> _______________________________________________
>>>> Dev mailing list
>>>> Dev at ensembl.org
>>>> 
>>>> List admin (including subscribe/unsubscribe):
>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>> 
>>>> Ensembl Blog:
>>>> http://www.ensembl.info/
>>> _______________________________________________
>>> Dev mailing list    Dev at ensembl.org
>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog: http://www.ensembl.info/