[ensembl-dev] Effects predictor version 2

Andrea Edwards edwardsa at cs.man.ac.uk
Tue May 17 17:13:08 BST 2011


I forgot to ask, are there any plans for adding polyphen/sift data for 
any other species. We are interested in cow data.


On 17/05/11 16:41, Andrea Edwards wrote:
> I was just using that query as a lazy way to count non synonymous snps 
> as I assumed only those that were non synonymous would have a polyphen 
> or sift score associated with them. I was hoping that assumption would 
> mean I didn't have to look whether the variation fell within a CDS. It 
> was just a quick sanity check.
>
> I wasn't thinking of novel snps, just known snps. I appreciate the 
> protein position tables are needed for novel snps. Your explanation of 
> how ensembl makes predictions on novel snps using is very useful for 
> future reference though
>
> On 17/05/11 16:26, Graham Ritchie wrote:
>> Hi Andrea,
>>
>> I'm not quite sure what you are trying to achieve with your query, it 
>> looks like you are just counting transcript_variation entries that 
>> lie in chromosome 1, this will include lots of transcript_variations 
>> that do not fall in the CDS of a transcript and are not predicted to 
>> cause a single amino acid substitution.
>>
>> The polyphen_prediction and sift_prediction tables include 
>> predictions for all possible single amino acid substitutions in the 
>> ensembl proteome and so can be used by the VEP script to look up 
>> predictions for novel mutations. When we run our pipeline to populate 
>> the transcript_variation table we look up the predictions for any 
>> actual variants in our database that are predicted to cause an amino 
>> acid substitution using these tables and store the qualitative 
>> prediction in the transcript_variation table (mainly to speed up 
>> various web views). We do not store the score in the 
>> transcript_variation table and so the API also uses these tables to 
>> look up the score when you call, e.g., the sift_score method on a 
>> TranscriptVariationAllele object. If you're interested in joining the 
>> transcript_variation table with, e.g., the polyphen_prediction table 
>> you could use some SQL like:
>>
>> SELECT tv.transcript_variation_id, pred.prediction
>> FROM polyphen_prediction pred, protein_info pi, protein_position pp, 
>> transcript_variation tv
>> WHERE substr(pep_allele_string,3,3) = pred.amino_acid
>> AND pep_allele_string LIKE '_/_'
>> AND tv.feature_stable_id = pi.transcript_stable_id
>> AND pp.position = tv.translation_start
>> AND pp.protein_info_id = pi.protein_info_id
>> AND pred.protein_position_id = pp.protein_position_id
>>
>> and you could add in some constraint to limit this to 
>> transcript_variations on chromosome 1 if that's what you're after 
>> (though this will probably take a long time to return because of the 
>> LIKE constraint). Does that help?
>>
>> Cheers,
>>
>> Graham
>>
>>
>> On 17 May 2011, at 16:01, Andrea Edwards wrote:
>>
>>> Hi, I appreciate you will only have sift and polyphen for non 
>>> synonymous snps in exons. I was just wonderng if my query was 
>>> correct or whether i needed to link to 
>>> sift_prediction/polyphen_prediction/protein_position tables to get 
>>> more accurate data. I presumed the predictions from these tables was 
>>> simply copied into the polyphen and sift fields of the 
>>> transcript_variant table and that my query is ok.
>>>
>>> On 17/05/11 15:42, Graham Ritchie wrote:
>>>> Hi Andrea,
>>>>
>>>> We only have sift and polyphen predictions for variants which are 
>>>> predicted to result in single amino acid substitutions.
>>>>
>>>> Cheers,
>>>>
>>>> Graham
>>>>
>>>> On 17 May 2011, at 15:37, Andrea Edwards wrote:
>>>>
>>>>> Hello
>>>>>
>>>>> Whilst looking into Stuart's question I looked at the variants on 
>>>>> chromosome 1 out of curiosity and found that most of them don't 
>>>>> have sift/polyphen data.
>>>>> Is this correct or have i made a mistake in my understanding of 
>>>>> the schema
>>>>>
>>>>> variants on chr1 (seq_region_id = 27511)
>>>>> ============================
>>>>>
>>>>> mysql>   select count(*) from transcript_variation tv inner join
>>>>> homo_sapiens_core_62_37g.transcript_stable_id st on st.stable_id =
>>>>> tv.feature_stable_id inner join 
>>>>> homo_sapiens_core_62_37g.transcript t on
>>>>> t.transcript_id = st.transcript_id where t.seq_region_id = 27511;
>>>>> +----------+
>>>>> | count(*) |
>>>>> +----------+
>>>>> | 9633745 |
>>>>> +----------+
>>>>> 1 row in set (3.34 sec)
>>>>>
>>>>>
>>>>> variants on chr1 without sift and polyphen
>>>>> ===========================
>>>>>
>>>>> mysql>   select count(*) from transcript_variation tv inner join
>>>>> homo_sapiens_core_62_37g.transcript_stable_id st on st.stable_id =
>>>>> tv.feature_stable_id inner join 
>>>>> homo_sapiens_core_62_37g.transcript t on
>>>>> t.transcript_id = st.transcript_id where t.seq_region_id = 27511 and
>>>>> tv.sift_prediction is null and tv.polyphen_prediction is null;
>>>>> +----------+
>>>>> | count(*) |
>>>>> +----------+
>>>>> | 9562313 |
>>>>> +----------+
>>>>> 1 row in set (11.22 sec)
>>>>>
>>>>>
>>>>> variants on chr1 with sift and polyphen
>>>>> =========================
>>>>>
>>>>> mysql>   select count(*) from transcript_variation tv inner join
>>>>> homo_sapiens_core_62_37g.transcript_stable_id st on st.stable_id =
>>>>> tv.feature_stable_id inner join 
>>>>> homo_sapiens_core_62_37g.transcript t on
>>>>> t.transcript_id = st.transcript_id where t.seq_region_id = 27511 and
>>>>> tv.sift_prediction is not null and tv.polyphen_prediction is not 
>>>>> null;
>>>>> +----------+
>>>>> | count(*) |
>>>>> +----------+
>>>>> | 67919 |
>>>>> +----------+
>>>>> 1 row in set (11.19 sec)
>>>>>
>>>>>
>>>>>
>>>>> thanks
>>>>>
>>>>>
>>>>> On 17/05/11 13:59, Stuart Meacham wrote:
>>>>>> Hello,
>>>>>>
>>>>>> Thanks for the reply.
>>>>>>
>>>>>> On 17/05/11 13:35, Will McLaren wrote:
>>>>>>
>>>>>>> This is strange - are you sure you are checking out the branch 
>>>>>>> and not
>>>>>>> the head of the API? You should be doing something like:
>>>>>>>
>>>>>>> cvs checkout -r branch-ensembl-62 ensembl
>>>>>>> cvs checkout -r branch-ensembl-62 ensembl-variation
>>>>>> Actually I just used the links from the site here:
>>>>>>
>>>>>> http://www.ensembl.org/info/docs/api/api_installation.html
>>>>>>
>>>>>> the link(s) resolve to things like:
>>>>>>
>>>>>> http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl.tar.gz?root=ensembl&only_with_tag=branch-ensembl-62&view=tar 
>>>>>>
>>>>>>
>>>>>>>> The script silently over-writes an existing output file of the 
>>>>>>>> same name,
>>>>>>>> this seems a bit brutal, perhaps the default should be to fail 
>>>>>>>> if the file
>>>>>>>> exists.
>>>>>>> I think this is pretty standard behaviour for command-line 
>>>>>>> programs. I
>>>>>>> could change it to only run if in an output file name is specified
>>>>>>> perhaps?
>>>>>> Yes, probably it's standard behaviour. I was just imagining 
>>>>>> accidentally overwriting a file the script had spent 24 hours 
>>>>>> creating . . .
>>>>>>
>>>>>>> That's also odd - any variants classified as non-synonymous coding
>>>>>>> should have a "SIFT=*" entry in the final column. Can you try the
>>>>>>> attached file as input on your system?
>>>>>>>
>>>>>> No problem, the command I used was:
>>>>>>
>>>>>> perl ./variant_effect_predictor_2.pl -r reg.pl -i ./test.txt -w 
>>>>>> -b 100000 --sift=p --polyphen=p --failed=0 -terms=so
>>>>>>
>>>>>> and the output (no errors but also no predictions) is attached.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> Stuart
>>>>>>
>>>>>> _______________________________________________
>>>>>> Dev mailing list
>>>>>> Dev at ensembl.org
>>>>>>
>>>>>> List admin (including subscribe/unsubscribe):
>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>
>>>>>> Ensembl Blog:
>>>>>> http://www.ensembl.info/
>>>>> _______________________________________________
>>>>> Dev mailing list    Dev at ensembl.org
>>>>> List admin (including subscribe/unsubscribe): 
>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): 
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/





More information about the Dev mailing list