[ensembl-dev] Effects predictor version 2

Wed May 18 11:24:55 BST 2011

Hi Andrea,

We are considering running SIFT for other species in the near future. At the moment PolyPhen has only been trained on human mutation data so it might take longer to assess if we can usefully run it on other species.

Cheers,

Graham

On 17 May 2011, at 17:13, Andrea Edwards wrote:

> I forgot to ask, are there any plans for adding polyphen/sift data for any other species. We are interested in cow data.
> 
> 
> On 17/05/11 16:41, Andrea Edwards wrote:
>> I was just using that query as a lazy way to count non synonymous snps as I assumed only those that were non synonymous would have a polyphen or sift score associated with them. I was hoping that assumption would mean I didn't have to look whether the variation fell within a CDS. It was just a quick sanity check.
>> 
>> I wasn't thinking of novel snps, just known snps. I appreciate the protein position tables are needed for novel snps. Your explanation of how ensembl makes predictions on novel snps using is very useful for future reference though
>> 
>> On 17/05/11 16:26, Graham Ritchie wrote:
>>> Hi Andrea,
>>> 
>>> I'm not quite sure what you are trying to achieve with your query, it looks like you are just counting transcript_variation entries that lie in chromosome 1, this will include lots of transcript_variations that do not fall in the CDS of a transcript and are not predicted to cause a single amino acid substitution.
>>> 
>>> The polyphen_prediction and sift_prediction tables include predictions for all possible single amino acid substitutions in the ensembl proteome and so can be used by the VEP script to look up predictions for novel mutations. When we run our pipeline to populate the transcript_variation table we look up the predictions for any actual variants in our database that are predicted to cause an amino acid substitution using these tables and store the qualitative prediction in the transcript_variation table (mainly to speed up various web views). We do not store the score in the transcript_variation table and so the API also uses these tables to look up the score when you call, e.g., the sift_score method on a TranscriptVariationAllele object. If you're interested in joining the transcript_variation table with, e.g., the polyphen_prediction table you could use some SQL like:
>>> 
>>> SELECT tv.transcript_variation_id, pred.prediction
>>> FROM polyphen_prediction pred, protein_info pi, protein_position pp, transcript_variation tv
>>> WHERE substr(pep_allele_string,3,3) = pred.amino_acid
>>> AND pep_allele_string LIKE '_/_'
>>> AND tv.feature_stable_id = pi.transcript_stable_id
>>> AND pp.position = tv.translation_start
>>> AND pp.protein_info_id = pi.protein_info_id
>>> AND pred.protein_position_id = pp.protein_position_id
>>> 
>>> and you could add in some constraint to limit this to transcript_variations on chromosome 1 if that's what you're after (though this will probably take a long time to return because of the LIKE constraint). Does that help?
>>> 
>>> Cheers,
>>> 
>>> Graham
>>> 
>>> 
>>> On 17 May 2011, at 16:01, Andrea Edwards wrote:
>>> 
>>>> Hi, I appreciate you will only have sift and polyphen for non synonymous snps in exons. I was just wonderng if my query was correct or whether i needed to link to sift_prediction/polyphen_prediction/protein_position tables to get more accurate data. I presumed the predictions from these tables was simply copied into the polyphen and sift fields of the transcript_variant table and that my query is ok.
>>>> 
>>>> On 17/05/11 15:42, Graham Ritchie wrote:
>>>>> Hi Andrea,
>>>>> 
>>>>> We only have sift and polyphen predictions for variants which are predicted to result in single amino acid substitutions.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Graham
>>>>> 
>>>>> On 17 May 2011, at 15:37, Andrea Edwards wrote:
>>>>> 
>>>>>> Hello
>>>>>> 
>>>>>> Whilst looking into Stuart's question I looked at the variants on chromosome 1 out of curiosity and found that most of them don't have sift/polyphen data.
>>>>>> Is this correct or have i made a mistake in my understanding of the schema
>>>>>> 
>>>>>> variants on chr1 (seq_region_id = 27511)
>>>>>> ============================
>>>>>> 
>>>>>> mysql>   select count(*) from transcript_variation tv inner join
>>>>>> homo_sapiens_core_62_37g.transcript_stable_id st on st.stable_id =
>>>>>> tv.feature_stable_id inner join homo_sapiens_core_62_37g.transcript t on
>>>>>> t.transcript_id = st.transcript_id where t.seq_region_id = 27511;
>>>>>> +----------+
>>>>>> | count(*) |
>>>>>> +----------+
>>>>>> | 9633745 |
>>>>>> +----------+
>>>>>> 1 row in set (3.34 sec)
>>>>>> 
>>>>>> 
>>>>>> variants on chr1 without sift and polyphen
>>>>>> ===========================
>>>>>> 
>>>>>> mysql>   select count(*) from transcript_variation tv inner join
>>>>>> homo_sapiens_core_62_37g.transcript_stable_id st on st.stable_id =
>>>>>> tv.feature_stable_id inner join homo_sapiens_core_62_37g.transcript t on
>>>>>> t.transcript_id = st.transcript_id where t.seq_region_id = 27511 and
>>>>>> tv.sift_prediction is null and tv.polyphen_prediction is null;
>>>>>> +----------+
>>>>>> | count(*) |
>>>>>> +----------+
>>>>>> | 9562313 |
>>>>>> +----------+
>>>>>> 1 row in set (11.22 sec)
>>>>>> 
>>>>>> 
>>>>>> variants on chr1 with sift and polyphen
>>>>>> =========================
>>>>>> 
>>>>>> mysql>   select count(*) from transcript_variation tv inner join
>>>>>> homo_sapiens_core_62_37g.transcript_stable_id st on st.stable_id =
>>>>>> tv.feature_stable_id inner join homo_sapiens_core_62_37g.transcript t on
>>>>>> t.transcript_id = st.transcript_id where t.seq_region_id = 27511 and
>>>>>> tv.sift_prediction is not null and tv.polyphen_prediction is not null;
>>>>>> +----------+
>>>>>> | count(*) |
>>>>>> +----------+
>>>>>> | 67919 |
>>>>>> +----------+
>>>>>> 1 row in set (11.19 sec)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> thanks
>>>>>> 
>>>>>> 
>>>>>> On 17/05/11 13:59, Stuart Meacham wrote:
>>>>>>> Hello,
>>>>>>> 
>>>>>>> Thanks for the reply.
>>>>>>> 
>>>>>>> On 17/05/11 13:35, Will McLaren wrote:
>>>>>>> 
>>>>>>>> This is strange - are you sure you are checking out the branch and not
>>>>>>>> the head of the API? You should be doing something like:
>>>>>>>> 
>>>>>>>> cvs checkout -r branch-ensembl-62 ensembl
>>>>>>>> cvs checkout -r branch-ensembl-62 ensembl-variation
>>>>>>> Actually I just used the links from the site here:
>>>>>>> 
>>>>>>> http://www.ensembl.org/info/docs/api/api_installation.html
>>>>>>> 
>>>>>>> the link(s) resolve to things like:
>>>>>>> 
>>>>>>> http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl.tar.gz?root=ensembl&only_with_tag=branch-ensembl-62&view=tar 
>>>>>>> 
>>>>>>>>> The script silently over-writes an existing output file of the same name,
>>>>>>>>> this seems a bit brutal, perhaps the default should be to fail if the file
>>>>>>>>> exists.
>>>>>>>> I think this is pretty standard behaviour for command-line programs. I
>>>>>>>> could change it to only run if in an output file name is specified
>>>>>>>> perhaps?
>>>>>>> Yes, probably it's standard behaviour. I was just imagining accidentally overwriting a file the script had spent 24 hours creating . . .
>>>>>>> 
>>>>>>>> That's also odd - any variants classified as non-synonymous coding
>>>>>>>> should have a "SIFT=*" entry in the final column. Can you try the
>>>>>>>> attached file as input on your system?
>>>>>>>> 
>>>>>>> No problem, the command I used was:
>>>>>>> 
>>>>>>> perl ./variant_effect_predictor_2.pl -r reg.pl -i ./test.txt -w -b 100000 --sift=p --polyphen=p --failed=0 -terms=so
>>>>>>> 
>>>>>>> and the output (no errors but also no predictions) is attached.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>> Stuart
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Dev mailing list
>>>>>>> Dev at ensembl.org
>>>>>>> 
>>>>>>> List admin (including subscribe/unsubscribe):
>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>> 
>>>>>>> Ensembl Blog:
>>>>>>> http://www.ensembl.info/
>>>>>> _______________________________________________
>>>>>> Dev mailing list    Dev at ensembl.org
>>>>>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>>>>>> Ensembl Blog: http://www.ensembl.info/
>> 
>> 
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
> 
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/