[ensembl-dev] Prediction of consequence type for novel variants

Sung Gong sung at bio.cc
Sat Oct 22 14:53:41 BST 2011


Hi,

I thought it's better to follow this thread rather than making another.

Just wondering how to express complex types in terms of Ensembl API
language (esp. VariationFeature).
Fore example:
ATGC
A-TT

VCF format says:
POS REF ALT
1      AGT AT

Cheers,
Sung

On 14 December 2010 15:15, Will McLaren <wm2 at ebi.ac.uk> wrote:
> The coordinates for a deletion reflect the bases of the reference deleted:
>
> 1 2 3 4 5
> A A C T G
>
> A deletion of bases 2, 3 and 4 would have start = 2, end = 4 and an
> allele_string of ACT/- (this is the same even for the negative strand).
>
> Generally in Ensembl if a feature spans some region of DNA, start is always
> less than or equal to end (it is equal to end for features of length 1, such
> as SNPs).
>
> Start is only greater than end for insertions, since they occur _between_
> bases of the reference sequence.
> Cheers
>
> Will
>
> On 14 December 2010 15:10, Sung Gong <sung at bio.cc> wrote:
>> Start 1 smaller than end for a deletion?
>>
>>
>> On 14 December 2010 15:03, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>> Hi Sung,
>>>
>>> The coordinates would be the same regardless of the strand.
>>>
>>> Start is _always_ 1 greater than end for an insertion, regardless of
>>> strand or the size of the insertion.
>>>
>>> Will
>>>
>>> On 14 December 2010 14:58, Sung Gong <sung at bio.cc> wrote:
>>>> Hi Will,
>>>>
>>>> One more question about start/end positions in case of indels.
>>>>
>>>> In the API document
>>>>
>>>> (http://www.ensembl.org/info/docs/Pdoc/ensembl-variation/modules/Bio/EnsEMBL/Variation/VariationFeature.html),
>>>> it says:
>>>>    # Variation feature representing a 2bp insertion
>>>>    $vf = Bio::EnsEMBL::Variation::VariationFeature->new
>>>>       (-start   => 1522,
>>>>        -end     => 1521, # end = start-1 for insert
>>>>        -strand  => -1,
>>>>        -slice   => $slice,
>>>>        -allele_string => '-/AA',
>>>>        -variation_name => 'rs12111',
>>>>        -map_weight  => 1,
>>>>        -variation => $v2);
>>>>
>>>> The example above is only for -1 strand?
>>>> How can I generalise to set -start and -end?
>>>>
>>>> Cheers,
>>>> Sung
>>>>
>>>> On 10 December 2010 11:41, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>>>> Hi Sung
>>>>>
>>>>> The codons() method will work; it returns the codon something like:
>>>>>
>>>>> aGa/aCa
>>>>>
>>>>> where the base changed is in capital letters.
>>>>>
>>>>> Will
>>>>>
>>>>> On 10 December 2010 11:26, Sung Gong <sung at bio.cc> wrote:
>>>>>> Hi Will,
>>>>>>
>>>>>> Thanks for the paper. I appreciate your work.
>>>>>>
>>>>>> Before aware of your script, I used to get the corresponding codon and
>>>>>> the position (0, 1 or 2) where a single DNA variant occur using the
>>>>>> core API.
>>>>>> Any work-around for this?
>>>>>>
>>>>>> I found a 'codons' method from 'TranscriptVariation', but it is a
>>>>>> method of ConsequenceType?
>>>>>>
>>>>>> Thought better to ask you before going further.
>>>>>>
>>>>>> Cheers,
>>>>>> Sung
>>>>>>
>>>>>> On 9 December 2010 14:02, Will McLaren <wm2 at ebi.ac.uk> wrote:
>>>>>>> Hi Sung,
>>>>>>>
>>>>>>> There is a publication referring to the system, but it does not go
>>>>>>> into great detail on the internal workings:
>>>>>>>
>>>>>>> http://bioinformatics.oxfordjournals.org/content/26/16/2069.abstract
>>>>>>>
>>>>>>> Here's an approximate flow of what happens in the API. The vast
>>>>>>> majority of the code used is in the Core module
>>>>>>> Bio::EnsEMBL::Utils::TranscriptAlleles.pm, mainly the methods
>>>>>>> type_variation() and apply_aa_change():
>>>>>>>
>>>>>>> - find overlapping transcripts (using $vf->feature_Slice and
>>>>>>> $slice->get_all_Transcripts), then for each transcript:
>>>>>>>
>>>>>>> - get transcript mapper and map variation's coordinates to cDNA, CDS
>>>>>>> and peptide
>>>>>>>
>>>>>>> - any variants that don't fall in the coding sequence are classified
>>>>>>> here (e.g. INTRONIC, UPSTREAM) and the flow ends
>>>>>>>
>>>>>>> - if variation falls in exon (i.e. has defined CDS coordinates),
>>>>>>> generate alternative codon(s) and resulting translation
>>>>>>>
>>>>>>> - compare translation to reference; classify as e.g.
>>>>>>> SYNONYMOUS_CODING, NON_SYNONYMOUS_CODING
>>>>>>>
>>>>>>> We are currently working on an overhaul to this system which should
>>>>>>> make it easier to comprehend by following the code.
>>>>>>>
>>>>>>> I would recommend trying to follow through the code in Perl's
>>>>>>> debugger, using the "perl -d" option.
>>>>>>>
>>>>>>> Hope this helps
>>>>>>>
>>>>>>> Will McLaren
>>>>>>> Ensembl Variation
>>>>>>>
>>>>>>> On 9 December 2010 13:19, Sung Gong <sung at bio.cc> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I was thrilled to find that Ensembl API provides a nice script
>>>>>>>> (ftp://ftp.ensembl.org/pub/misc-scripts/) which can predict the
>>>>>>>> consequence types of novel variations.
>>>>>>>> Also, good to see a good demonstration how to use the API for that
>>>>>>>> purpose:
>>>>>>>>
>>>>>>>> http://www.ensembl.org/info/docs/api/variation/variation_tutorial.html
>>>>>>>>
>>>>>>>> Before realising the variation API can help predicting consequence
>>>>>>>> type of novel variants, I used to use only core API to map the
>>>>>>>> position of my variants to see whether they are within coding
>>>>>>>> region,
>>>>>>>> intron, exon and so on.
>>>>>>>> Now, I wondered how the variation API works for that purpose -
>>>>>>>> looked
>>>>>>>> at the source code, but found it is somewhat overwhelming.
>>>>>>>>
>>>>>>>> Can anybody explain how the novel prediction works internally under
>>>>>>>> the hood?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Sung
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Dev mailing list
>>>>>>>> Dev at ensembl.org
>>>>>>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>




More information about the Dev mailing list