[ensembl-dev] assembly table in core database

Mon Mar 14 11:36:24 GMT 2011

Hi Andrea,

> Please could you tell me why, for example, the clone AL445212.9 (seq 
> region id = 22114) only has overlap from base 101 with chromosome 13 
> (seq region id = 27513) in the assembly table rather than from its first 
> base.

This is because frequently only part of a contig contributes to the 
'toplevel' sequence. In other words, when overlapping contigs form a 
supercontig, for the overlapping region, only sequence from one of the 
contigs will be used. The same thing happens when the supercontigs are 
assembled into chromosomes.

For human, the way in which the assembly fits together is described by 
the .agp files from the Genome Reference Consortium. You will find more 
information about the assembly on their web pages.

Looking for your example, AL445212.9, in 
ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/placed_scaffolds/AGP/chr13.placed.scaf.agp.gz

there is the following entry:
GL000111.1	13786050	13952606	123	F	AL445212.9	101	166657	+

The final three columns describe the part of the component (AL445212.9) 
that fits into the supercontig and the orientation of that sequence in 
the supercontig. It shows that only the section from base 101 to base 
166657 contributes to the assembled genome.

In the database, AL445212.9 exists as both a contig and a clone. By 
checking the cmp_seq_region_id column in the assembly table, you can see 
that part of the contig (101-166657) contributes to a supercontig and, 
in turn, chromosome 13. You can also see that all of the contig 
(1-166657) provides the sequence for the clone of the same name.

select * from seq_region where name ='AL445212.9';
+---------------+------------+-----------------+--------+
| seq_region_id | name       | coord_system_id | length |
+---------------+------------+-----------------+--------+
|         22114 | AL445212.9 |               1 | 166657 |
|         49917 | AL445212.9 |               4 | 166657 |
+---------------+------------+-----------------+--------+
2 rows in set (0.03 sec)

select name, assembly.* from assembly, seq_region where 
cmp_seq_region_id in (22114,49917) and 
asm_seq_region_id=seq_region.seq_region_id;
+--------------+-------------------+-------------------+-----------+----------+-----------+---------+-----+
| name         | asm_seq_region_id | cmp_seq_region_id | asm_start | 
asm_end  | cmp_start | cmp_end | ori |
+--------------+-------------------+-------------------+-----------+----------+-----------+---------+-----+
| 13           |             27513 |             22114 |  32806050 | 
32972606 |       101 |  166657 |   1 |
| HSCHR13_CTG1 |             27528 |             22114 |  13786050 | 
13952606 |       101 |  166657 |   1 |
| AL445212.9   |             49917 |             22114 |         1 | 
166657 |         1 |  166657 |   1 |
+--------------+-------------------+-------------------+-----------+----------+-----------+---------+-----+
3 rows in set (2.33 sec)

> There is nothing in this table either about its overlap with its 
> neighbouring clones or its supercontig HSCGR13_CTG1. 

The seq_region table, which contains the full length of the region, and 
the assembly table, which essentially contains the information found in 
an agp file, provide the information needed to work out which parts of 
components are used in higher levels and which are not. This includes 
the relationship between the contig and the supercontig, as illustrated 
above.

> If you look at the 
> clone the annotated region of the clone is from base 101 onwards.

In general, annotation is stored on the toplevel sequence.

Regards,
Susan.