[ensembl-dev] Effects predictor version 2
Andrea Edwards
edwardsa at cs.man.ac.uk
Tue May 17 17:13:08 BST 2011
I forgot to ask, are there any plans for adding polyphen/sift data for
any other species. We are interested in cow data.
On 17/05/11 16:41, Andrea Edwards wrote:
> I was just using that query as a lazy way to count non synonymous snps
> as I assumed only those that were non synonymous would have a polyphen
> or sift score associated with them. I was hoping that assumption would
> mean I didn't have to look whether the variation fell within a CDS. It
> was just a quick sanity check.
>
> I wasn't thinking of novel snps, just known snps. I appreciate the
> protein position tables are needed for novel snps. Your explanation of
> how ensembl makes predictions on novel snps using is very useful for
> future reference though
>
> On 17/05/11 16:26, Graham Ritchie wrote:
>> Hi Andrea,
>>
>> I'm not quite sure what you are trying to achieve with your query, it
>> looks like you are just counting transcript_variation entries that
>> lie in chromosome 1, this will include lots of transcript_variations
>> that do not fall in the CDS of a transcript and are not predicted to
>> cause a single amino acid substitution.
>>
>> The polyphen_prediction and sift_prediction tables include
>> predictions for all possible single amino acid substitutions in the
>> ensembl proteome and so can be used by the VEP script to look up
>> predictions for novel mutations. When we run our pipeline to populate
>> the transcript_variation table we look up the predictions for any
>> actual variants in our database that are predicted to cause an amino
>> acid substitution using these tables and store the qualitative
>> prediction in the transcript_variation table (mainly to speed up
>> various web views). We do not store the score in the
>> transcript_variation table and so the API also uses these tables to
>> look up the score when you call, e.g., the sift_score method on a
>> TranscriptVariationAllele object. If you're interested in joining the
>> transcript_variation table with, e.g., the polyphen_prediction table
>> you could use some SQL like:
>>
>> SELECT tv.transcript_variation_id, pred.prediction
>> FROM polyphen_prediction pred, protein_info pi, protein_position pp,
>> transcript_variation tv
>> WHERE substr(pep_allele_string,3,3) = pred.amino_acid
>> AND pep_allele_string LIKE '_/_'
>> AND tv.feature_stable_id = pi.transcript_stable_id
>> AND pp.position = tv.translation_start
>> AND pp.protein_info_id = pi.protein_info_id
>> AND pred.protein_position_id = pp.protein_position_id
>>
>> and you could add in some constraint to limit this to
>> transcript_variations on chromosome 1 if that's what you're after
>> (though this will probably take a long time to return because of the
>> LIKE constraint). Does that help?
>>
>> Cheers,
>>
>> Graham
>>
>>
>> On 17 May 2011, at 16:01, Andrea Edwards wrote:
>>
>>> Hi, I appreciate you will only have sift and polyphen for non
>>> synonymous snps in exons. I was just wonderng if my query was
>>> correct or whether i needed to link to
>>> sift_prediction/polyphen_prediction/protein_position tables to get
>>> more accurate data. I presumed the predictions from these tables was
>>> simply copied into the polyphen and sift fields of the
>>> transcript_variant table and that my query is ok.
>>>
>>> On 17/05/11 15:42, Graham Ritchie wrote:
>>>> Hi Andrea,
>>>>
>>>> We only have sift and polyphen predictions for variants which are
>>>> predicted to result in single amino acid substitutions.
>>>>
>>>> Cheers,
>>>>
>>>> Graham
>>>>
>>>> On 17 May 2011, at 15:37, Andrea Edwards wrote:
>>>>
>>>>> Hello
>>>>>
>>>>> Whilst looking into Stuart's question I looked at the variants on
>>>>> chromosome 1 out of curiosity and found that most of them don't
>>>>> have sift/polyphen data.
>>>>> Is this correct or have i made a mistake in my understanding of
>>>>> the schema
>>>>>
>>>>> variants on chr1 (seq_region_id = 27511)
>>>>> ============================
>>>>>
>>>>> mysql> select count(*) from transcript_variation tv inner join
>>>>> homo_sapiens_core_62_37g.transcript_stable_id st on st.stable_id =
>>>>> tv.feature_stable_id inner join
>>>>> homo_sapiens_core_62_37g.transcript t on
>>>>> t.transcript_id = st.transcript_id where t.seq_region_id = 27511;
>>>>> +----------+
>>>>> | count(*) |
>>>>> +----------+
>>>>> | 9633745 |
>>>>> +----------+
>>>>> 1 row in set (3.34 sec)
>>>>>
>>>>>
>>>>> variants on chr1 without sift and polyphen
>>>>> ===========================
>>>>>
>>>>> mysql> select count(*) from transcript_variation tv inner join
>>>>> homo_sapiens_core_62_37g.transcript_stable_id st on st.stable_id =
>>>>> tv.feature_stable_id inner join
>>>>> homo_sapiens_core_62_37g.transcript t on
>>>>> t.transcript_id = st.transcript_id where t.seq_region_id = 27511 and
>>>>> tv.sift_prediction is null and tv.polyphen_prediction is null;
>>>>> +----------+
>>>>> | count(*) |
>>>>> +----------+
>>>>> | 9562313 |
>>>>> +----------+
>>>>> 1 row in set (11.22 sec)
>>>>>
>>>>>
>>>>> variants on chr1 with sift and polyphen
>>>>> =========================
>>>>>
>>>>> mysql> select count(*) from transcript_variation tv inner join
>>>>> homo_sapiens_core_62_37g.transcript_stable_id st on st.stable_id =
>>>>> tv.feature_stable_id inner join
>>>>> homo_sapiens_core_62_37g.transcript t on
>>>>> t.transcript_id = st.transcript_id where t.seq_region_id = 27511 and
>>>>> tv.sift_prediction is not null and tv.polyphen_prediction is not
>>>>> null;
>>>>> +----------+
>>>>> | count(*) |
>>>>> +----------+
>>>>> | 67919 |
>>>>> +----------+
>>>>> 1 row in set (11.19 sec)
>>>>>
>>>>>
>>>>>
>>>>> thanks
>>>>>
>>>>>
>>>>> On 17/05/11 13:59, Stuart Meacham wrote:
>>>>>> Hello,
>>>>>>
>>>>>> Thanks for the reply.
>>>>>>
>>>>>> On 17/05/11 13:35, Will McLaren wrote:
>>>>>>
>>>>>>> This is strange - are you sure you are checking out the branch
>>>>>>> and not
>>>>>>> the head of the API? You should be doing something like:
>>>>>>>
>>>>>>> cvs checkout -r branch-ensembl-62 ensembl
>>>>>>> cvs checkout -r branch-ensembl-62 ensembl-variation
>>>>>> Actually I just used the links from the site here:
>>>>>>
>>>>>> http://www.ensembl.org/info/docs/api/api_installation.html
>>>>>>
>>>>>> the link(s) resolve to things like:
>>>>>>
>>>>>> http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl.tar.gz?root=ensembl&only_with_tag=branch-ensembl-62&view=tar
>>>>>>
>>>>>>
>>>>>>>> The script silently over-writes an existing output file of the
>>>>>>>> same name,
>>>>>>>> this seems a bit brutal, perhaps the default should be to fail
>>>>>>>> if the file
>>>>>>>> exists.
>>>>>>> I think this is pretty standard behaviour for command-line
>>>>>>> programs. I
>>>>>>> could change it to only run if in an output file name is specified
>>>>>>> perhaps?
>>>>>> Yes, probably it's standard behaviour. I was just imagining
>>>>>> accidentally overwriting a file the script had spent 24 hours
>>>>>> creating . . .
>>>>>>
>>>>>>> That's also odd - any variants classified as non-synonymous coding
>>>>>>> should have a "SIFT=*" entry in the final column. Can you try the
>>>>>>> attached file as input on your system?
>>>>>>>
>>>>>> No problem, the command I used was:
>>>>>>
>>>>>> perl ./variant_effect_predictor_2.pl -r reg.pl -i ./test.txt -w
>>>>>> -b 100000 --sift=p --polyphen=p --failed=0 -terms=so
>>>>>>
>>>>>> and the output (no errors but also no predictions) is attached.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> Stuart
>>>>>>
>>>>>> _______________________________________________
>>>>>> Dev mailing list
>>>>>> Dev at ensembl.org
>>>>>>
>>>>>> List admin (including subscribe/unsubscribe):
>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>
>>>>>> Ensembl Blog:
>>>>>> http://www.ensembl.info/
>>>>> _______________________________________________
>>>>> Dev mailing list Dev at ensembl.org
>>>>> List admin (including subscribe/unsubscribe):
>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog: http://www.ensembl.info/
>
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> List admin (including subscribe/unsubscribe):
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
More information about the Dev
mailing list