[ensembl-dev] assembly table in core database

Mon Mar 14 14:57:38 GMT 2011

Hi Andrea,

If you look at the coord_system table you will see the attribute 
sequence_level associated with the contig coord_system, indicating that 
it is on this level that the sequence data is stored.

As Bronwen mentioned, we load the data we get from the groups that 
create the assemblies. For human, the lowest level of the assembly is 
the contig (the GRC assembly is composed of contigs, 
supercontigs/scaffolds and chromosomes, clones are not, to my knowledge, 
included). Storing the sequence at the contig level enables us to store 
all of the sequence and reconstruct the higher levels of the assembly. 
We don't store sequence data on additional levels as this would 
effectively be duplicating data by storing additional DNA sequences that 
can be reconstructed from the existing data. Consequently, we store 
sequence data on the contigs, the lowest level of the assembly provided 
by GRC and reconstruct all other regions on other coord_systems from the 
contigs, including the clones.

I hope that helps to clarify things.

Regards,
Susan.

Andrea Edwards wrote:
> Hi
> 
> Thanks for a very detailed and clear answer. I did look at the all the 
> information you have given in the answer before I posted the question 
> and I  the ultimate reason why I thought I must be 'missing something' 
> was because i didn't see why the contig provided the sequence for the 
> clone. I expected it to be the other way round. I also thought the 
> assembly table was supposed to describe all overlapping areas of 
> sequence which I believe now is not the case. The  assembly table 
> provides the information needed to work out which parts of components 
> are used in higher levels and which are not because as you say 
> annotation is stored on the toplevel sequence. That's fair enough.
> 
> I still don't understand the last line of these results though
> 
> mysql> select name, assembly.* from assembly, seq_region where 
> cmp_seq_region_id
>  in (22114,49917) and asm_seq_region_id=seq_region.seq_region_id;
> +--------------+-------------------+-------------------+-----------+----------+- 
> 
> ----------+---------+-----+
> | name         | asm_seq_region_id | cmp_seq_region_id | asm_start | 
> asm_end  |
> cmp_start | cmp_end | ori |
> +--------------+-------------------+-------------------+-----------+----------+- 
> 
> ----------+---------+-----+
> | 13           |             27513 |             22114 |  32806050 | 
> 32972606 |
>       101 |  166657 |   1 |
> | HSCHR13_CTG1 |             27528 |             22114 |  13786050 | 
> 13952606 |
>       101 |  166657 |   1 |
> | AL445212.9   |             49917 |             22114 |         1 |   
> 166657 |
>         1 |  166657 |   1 |
> +--------------+-------------------+-------------------+-----------+----------+- 
> 
> ----------+---------+-----+
> 3 rows in set (0.11 sec)
> 
> where i expected asm_seq_region_id to be 22114 and cmp_seq_region_id to 
> be 49917
> 
> 
> On 14/03/2011 11:36, Susan Fairley wrote:
>> Hi Andrea,
>>
>>> Please could you tell me why, for example, the clone AL445212.9 (seq 
>>> region id = 22114) only has overlap from base 101 with chromosome 13 
>>> (seq region id = 27513) in the assembly table rather than from its 
>>> first base.
>>
>> This is because frequently only part of a contig contributes to the 
>> 'toplevel' sequence. In other words, when overlapping contigs form a 
>> supercontig, for the overlapping region, only sequence from one of the 
>> contigs will be used. The same thing happens when the supercontigs are 
>> assembled into chromosomes.
>>
>> For human, the way in which the assembly fits together is described by 
>> the .agp files from the Genome Reference Consortium. You will find 
>> more information about the assembly on their web pages.
>>
>> Looking for your example, AL445212.9, in 
>> ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/placed_scaffolds/AGP/chr13.placed.scaf.agp.gz 
>>
>>
>> there is the following entry:
>> GL000111.1    13786050    13952606    123    F    AL445212.9    101    
>> 166657    +
>>
>> The final three columns describe the part of the component 
>> (AL445212.9) that fits into the supercontig and the orientation of 
>> that sequence in the supercontig. It shows that only the section from 
>> base 101 to base 166657 contributes to the assembled genome.
>>
>> In the database, AL445212.9 exists as both a contig and a clone. By 
>> checking the cmp_seq_region_id column in the assembly table, you can 
>> see that part of the contig (101-166657) contributes to a supercontig 
>> and, in turn, chromosome 13. You can also see that all of the contig 
>> (1-166657) provides the sequence for the clone of the same name.
>>
>> select * from seq_region where name ='AL445212.9';
>> +---------------+------------+-----------------+--------+
>> | seq_region_id | name       | coord_system_id | length |
>> +---------------+------------+-----------------+--------+
>> |         22114 | AL445212.9 |               1 | 166657 |
>> |         49917 | AL445212.9 |               4 | 166657 |
>> +---------------+------------+-----------------+--------+
>> 2 rows in set (0.03 sec)
>>
>> select name, assembly.* from assembly, seq_region where 
>> cmp_seq_region_id in (22114,49917) and 
>> asm_seq_region_id=seq_region.seq_region_id;
>> +--------------+-------------------+-------------------+-----------+----------+-----------+---------+-----+ 
>>
>> | name         | asm_seq_region_id | cmp_seq_region_id | asm_start | 
>> asm_end  | cmp_start | cmp_end | ori |
>> +--------------+-------------------+-------------------+-----------+----------+-----------+---------+-----+ 
>>
>> | 13           |             27513 |             22114 |  32806050 | 
>> 32972606 |       101 |  166657 |   1 |
>> | HSCHR13_CTG1 |             27528 |             22114 |  13786050 | 
>> 13952606 |       101 |  166657 |   1 |
>> | AL445212.9   |             49917 |             22114 |         1 | 
>> 166657 |         1 |  166657 |   1 |
>> +--------------+-------------------+-------------------+-----------+----------+-----------+---------+-----+ 
>>
>> 3 rows in set (2.33 sec)
>>
>>> There is nothing in this table either about its overlap with its 
>>> neighbouring clones or its supercontig HSCGR13_CTG1. 
>>
>> The seq_region table, which contains the full length of the region, 
>> and the assembly table, which essentially contains the information 
>> found in an agp file, provide the information needed to work out which 
>> parts of components are used in higher levels and which are not. This 
>> includes the relationship between the contig and the supercontig, as 
>> illustrated above.
>>
>>> If you look at the clone the annotated region of the clone is from 
>>> base 101 onwards.
>>
>> In general, annotation is stored on the toplevel sequence.
>>
>> Regards,
>> Susan.
>>
>>
>