[ensembl-dev] variation schema

Will McLaren wm2 at ebi.ac.uk
Tue Jan 4 16:35:56 GMT 2011


Hi,

A variation set is not by design associated with a population; it is just a
generic way to group together variations. Variants from one set may be
observed in more than one population; by their very nature variants can of
course occur in the same position in multiple individuals and populations.

There is no fail-safe way to join up sets and populations as they were not
designed that way; it so happens that some sets denote groups of variants
observed in a population in a particular study (1000 genomes, for example),
but the way the data are constructed is different for the sample table and
the variation_set tables.

Will

On 4 January 2011 15:35, Andrea Edwards <edwardsa at cs.man.ac.uk> wrote:

>  Hello
>
> Thanks for your reply. However I still don't quite see how you know*programmatically
> * what population a variation set is associated with:
>
>
> -if the variations in the variation set belong to only one population then
> you can *assume* that the variation set relates to that population
>
> but what about when when the variation belongs to multiple populations?
> there are rows for each population the variation belongs to in the allele
> table
> -the name of the variation set (e.g Ensembl Watson) is not the same as the
> name of the population (ENSEMBL:ENSEMBL_Watson) so you can't do a join on
> the variation_set_name to the sample_name to filter the appropriate records
> by population from the allele table
>
> Also Is it possible that a variation set could pertain to multiple
> populations?
>
> ==============================================
> Example of difficulty finding population for a variation set
> ===============================================
>   the variation sets for 1000 genomes are:
>
> mysql> select variation_set_id , name from variation_set;
> +-------------------------------------
> | id, name
> +-------------------------------------
> | 1, 1000 genomes
> | 8, 1000 genomes - Low coverage
> | 3, 1000 genomes - Trios - CEU
> | 4, 1000 genomes - Trios - YRI
>
> Looking at the variation_set_structure table, the last 3 variation sets are
> subsets of the first "1000 genomes"
>
> I don't know which populations these 4 variation sets pertain to. There are
> 3 possibilities in the population table
>
> mysql> select id, name from sample where name like "%1000"
> +-----------------------------+
> | id, name                        |
> +-----------------------------+
> | 11273, 1000GENOMES:pilot.1.CEU     |
> | 11274   1000GENOMES:pilot.1.CHB+JPT |
> | 11275   1000GENOMES:pilot.1.YRI     |
> .....
> 56 rows in set (0.06 sec)
>
> all population names contain the digit 1 suggesting they belong to the
> first low coverage pilot which suggests variation set 8 corresponds to
> populations 11273, 11274 and 11275
>
> but i know this obviously isn't right as what populations do the
> individuals in variation sets 3 and 4 belong to, and what populations do the
> individuals in variation set 1 (but not 8,3 and 4) belong to?
>
> thanks a lot
>
>
>
>
> On 22/12/2010 09:44, Will McLaren wrote:
>
> Hi Andrea,
>
>  Apologies, but the schema document has not been updated to include the
> variation set tables - your understanding of them is correct. Sets are a
> generic and catch-all way of grouping variations - it allows us to group,
> for example, all variants from the HapMap project, or all variants with
> phenotypic associations, or in this case all variants called in a particular
> individual.
>
>  Alleles are linked to populations, and there is a population representing
> Watson (the population is named "ENSEMBL:ENSEMBL_Watson" and is of size one,
> and has an individual named "Watson"). Thus if a variation belongs to the
> Watson set, it should have a pair of alleles linked to the Watson
> population.
>
>  Cheers
>
>  Will
>
> On 21 December 2010 21:06, Andrea Edwards <edwardsa at cs.man.ac.uk> wrote:
>
>> Hi
>>
>> I have been reading about the variation database schema here
>>
>> http://www.ensembl.org/info/docs/api/variation/variation_schema.html
>>
>> but there is no information in this document about the database tables
>> that, based on their name, look like they deal with variation sets namely
>>
>> *variation_set
>> *variation_set_structure
>> *variation_set_variation
>>
>> These tables aren't on the pdf schema diagram either.
>>
>> I was hoping i could get an explanation of these tables.
>>
>> It looks as though variation_set is simply a variation set with a name and
>> description.
>>
>> It looks then as if variation_set_variation is a simple link table to
>> resolve the many to many relationship between a variation and a variation
>> set. But if that is the case I don't know how you model the alleles in a
>> variation set such as the watson set.
>>
>> For example a particular variation might be triallelic overall (e.g. in
>> every individual looked at) but variations in the the watson variation can
>> only be diploid at most. The table that normally describes the alleles of a
>> variation and their frequencies  is allele. The allele table links to a
>> sample id so you which alleles occur for a variation in a population and you
>> know the frequency of a particular allele in that population. The allele
>> table doesn't seem to have any link to a variation set.
>>
>> It looks like there should be a link somewhere between a variation set and
>> a population/sample so that the allele table can still represent the
>> alleles/frequencies of a variation set
>>
>> Or i could be guessing this all wrong. Either way, i would really benefit
>> from some data about the schema that models variation sets. And I think I
>> need  ensembl's definition of a variation set (the POD simply says This is a
>> class representing a set of variations that are grouped by e.g. study,
>> method, quality measure etc.)
>>
>> Kind regards
>>
>> _______________________________________________
>> Dev mailing list
>> Dev at ensembl.org
>> http://lists.ensembl.org/mailman/listinfo/dev
>>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20110104/2e4ad2d0/attachment.html>


More information about the Dev mailing list