[ensembl-dev] questions about variation schema

Will McLaren wm2 at ebi.ac.uk
Tue Jan 11 09:46:45 GMT 2011


Hello

On 10 January 2011 18:26, Andrea Edwards <edwardsa at cs.man.ac.uk> wrote:

>  Hello and Happy New Year
>
> I have some quick questions about the variation schema.
>
> 1. Allele table
>
> When considering population frequency data for an allele, how do you know
> which source it is from.
> For example, imagine a SNP with alleles T/C that is described in say, dbSNP
> and HGMD. The source id for the variation on the variation table might be
> dbSNP and the variation would have a variation_synonym entry for HGMD. Lets
> say both dbSNP and HGMD have population frequency data for the variation
> which might look something like this.
>
>    Allele id
>
> Variation id
>
> Allele
>
> Frequency
>
> SampleID
>
> 1
>
> 1
>
> T
>
> 1
>
> 14
>
> 2
>
> 1
>
> C
>
> 0
>
> 14
>
> 3
>
> 1
>
> T
>
> 0.5
>
> 15
>
> 4
>
> 1
>
> C
>
> 0.5
>
> 15
>
>  In this case the dbSNP data is for population 14 and the HGMD is for
> population 15 but how would you know from looking?
> A sample isn't linked to the source that 'created' it so you can't tell
> from the sample.
>

Correct, samples do not have a source. This is not usually a problem, since
the vast majority of our frequency data comes from dbSNP. The only
exceptions to this in human are the COSMIC data, the samples for which are
only associated with variations with source COSMIC.


>
> Also, what is the subsnp_id in the allele table?
>
>
This represents a submission of data on a variant to dbSNP. When people
submit data to dbSNP, each variant they submit is assigned a subsnp_id
(ssID). Since several groups or individuals may submit the same variant to
dbSNP, each of the ssIDs corresponding to the same variant is merged
together to form one rsID. ssIDs are normally shown as e.g ss12345, but we
only store the numerical part of the identifier.


>
> 2. What is subsnp_handle table?
>

Each submitter as described above is assigned a handle, or name, by dbSNP.
For example, when Ensembl submits data to dbSNP, it gets assigned the handle
of ENSEMBL. This table keeps track of which ssIDs were submitted by which
submitters, thus allowing our users to distinguish between what they may
consider to be different standards of data.

For a working example, take a look at this page:

http://www.ensembl.org/Homo_sapiens/Variation/Population?r=9:22125003-22126003;v=rs1333049;vdb=variation;vf=18123086

>From here you can click through to the dbSNP website from both the subsnp ID
and the submitter handle to see more information.


> 3 Population genotype
> What is the subnp_id field (might be answered by the previous question)?
> Am i correct in saying this table doesn't provide the source of the data
> (might also be answered by a previous question)?
>
>
Same as above.


> 4 Variation set
> What is the source of a variation set? I believe variation sets are defined
> by ensembl so i presume the source is implicitly ensembl?
>
>
All variation sets are loaded by Ensembl, so yes you can consider the source
to be Ensembl.


>
> I've made quite a detailed document about the variation schema which i
> think might help other people like me learning the schema from scratch. I'm
> more than happy to make it available if there is a mechanism to do so.
>

If you email it to us we can see if we can integrate your document into our
current documentation on the website.

Cheers

Will


>
> Thanks a lot
>
>
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20110111/4016b348/attachment.html>


More information about the Dev mailing list