[ensembl-dev] questions about variation schema

Andrea Edwards edwardsa at cs.man.ac.uk
Tue Jan 11 14:25:50 GMT 2011


Hello

Thanks Will for answering those question.
So in most cases you can get actually the source of the allele and 
genotype frequency data from the ssID field rather than from the ensembl 
source table where i was looking. I didn't make the connection that the 
subsnp id was to dbSNP as you don't store the ss part of the id! I 
thought it must have been an internal field It's obvious now you've said 
it :)

To the people who asked, I have some extra information to add to my 
notes which i will do later this week and then I will send it.
Naturally i supply it 'without warranty' and take no responsibility if 
I've got everything totally wrong :)

On 11/01/2011 09:50, Tjaart de Beer wrote:
> Hi Andrea
>
> I saw your post below on the Ensembl mailing list and I was wondering if
> you would mind sending m a copy of your doc on the variation schema. I
> have been using the variation stuff on and off for a while now but, just
> like you, I am unclear on a few things. Hopefully your document can help
> me.
>
> Thanks!
> Tjaart de Beer
>
>> Hello
>>
>> On 10 January 2011 18:26, Andrea Edwards<edwardsa at cs.man.ac.uk>  wrote:
>>
>>>   Hello and Happy New Year
>>>
>>> I have some quick questions about the variation schema.
>>>
>>> 1. Allele table
>>>
>>> When considering population frequency data for an allele, how do you
>>> know
>>> which source it is from.
>>> For example, imagine a SNP with alleles T/C that is described in say,
>>> dbSNP
>>> and HGMD. The source id for the variation on the variation table might
>>> be
>>> dbSNP and the variation would have a variation_synonym entry for HGMD.
>>> Lets
>>> say both dbSNP and HGMD have population frequency data for the variation
>>> which might look something like this.
>>>
>>>     Allele id
>>>
>>> Variation id
>>>
>>> Allele
>>>
>>> Frequency
>>>
>>> SampleID
>>>
>>> 1
>>>
>>> 1
>>>
>>> T
>>>
>>> 1
>>>
>>> 14
>>>
>>> 2
>>>
>>> 1
>>>
>>> C
>>>
>>> 0
>>>
>>> 14
>>>
>>> 3
>>>
>>> 1
>>>
>>> T
>>>
>>> 0.5
>>>
>>> 15
>>>
>>> 4
>>>
>>> 1
>>>
>>> C
>>>
>>> 0.5
>>>
>>> 15
>>>
>>>   In this case the dbSNP data is for population 14 and the HGMD is for
>>> population 15 but how would you know from looking?
>>> A sample isn't linked to the source that 'created' it so you can't tell
>>> from the sample.
>>>
>> Correct, samples do not have a source. This is not usually a problem,
>> since
>> the vast majority of our frequency data comes from dbSNP. The only
>> exceptions to this in human are the COSMIC data, the samples for which are
>> only associated with variations with source COSMIC.
>>
>>
>>> Also, what is the subsnp_id in the allele table?
>>>
>>>
>> This represents a submission of data on a variant to dbSNP. When people
>> submit data to dbSNP, each variant they submit is assigned a subsnp_id
>> (ssID). Since several groups or individuals may submit the same variant to
>> dbSNP, each of the ssIDs corresponding to the same variant is merged
>> together to form one rsID. ssIDs are normally shown as e.g ss12345, but we
>> only store the numerical part of the identifier.
>>
>>
>>> 2. What is subsnp_handle table?
>>>
>> Each submitter as described above is assigned a handle, or name, by dbSNP.
>> For example, when Ensembl submits data to dbSNP, it gets assigned the
>> handle
>> of ENSEMBL. This table keeps track of which ssIDs were submitted by which
>> submitters, thus allowing our users to distinguish between what they may
>> consider to be different standards of data.
>>
>> For a working example, take a look at this page:
>>
>> http://www.ensembl.org/Homo_sapiens/Variation/Population?r=9:22125003-22126003;v=rs1333049;vdb=variation;vf=18123086
>>
>> > From here you can click through to the dbSNP website from both the subsnp
>>> ID
>> and the submitter handle to see more information.
>>
>>
>>> 3 Population genotype
>>> What is the subnp_id field (might be answered by the previous question)?
>>> Am i correct in saying this table doesn't provide the source of the data
>>> (might also be answered by a previous question)?
>>>
>>>
>> Same as above.
>>
>>
>>> 4 Variation set
>>> What is the source of a variation set? I believe variation sets are
>>> defined
>>> by ensembl so i presume the source is implicitly ensembl?
>>>
>>>
>> All variation sets are loaded by Ensembl, so yes you can consider the
>> source
>> to be Ensembl.
>>
>>
>>> I've made quite a detailed document about the variation schema which i
>>> think might help other people like me learning the schema from scratch.
>>> I'm
>>> more than happy to make it available if there is a mechanism to do so.
>>>
>> If you email it to us we can see if we can integrate your document into
>> our
>> current documentation on the website.
>>
>> Cheers
>>
>> Will
>>
>>
>>> Thanks a lot
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list
>>> Dev at ensembl.org
>>> http://lists.ensembl.org/mailman/listinfo/dev
>>>
>>>
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
>>





More information about the Dev mailing list