[ensembl-dev] assembly table in core database
Susan Fairley
sf7 at sanger.ac.uk
Mon Mar 14 11:36:24 GMT 2011
Hi Andrea,
> Please could you tell me why, for example, the clone AL445212.9 (seq
> region id = 22114) only has overlap from base 101 with chromosome 13
> (seq region id = 27513) in the assembly table rather than from its first
> base.
This is because frequently only part of a contig contributes to the
'toplevel' sequence. In other words, when overlapping contigs form a
supercontig, for the overlapping region, only sequence from one of the
contigs will be used. The same thing happens when the supercontigs are
assembled into chromosomes.
For human, the way in which the assembly fits together is described by
the .agp files from the Genome Reference Consortium. You will find more
information about the assembly on their web pages.
Looking for your example, AL445212.9, in
ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/placed_scaffolds/AGP/chr13.placed.scaf.agp.gz
there is the following entry:
GL000111.1 13786050 13952606 123 F AL445212.9 101 166657 +
The final three columns describe the part of the component (AL445212.9)
that fits into the supercontig and the orientation of that sequence in
the supercontig. It shows that only the section from base 101 to base
166657 contributes to the assembled genome.
In the database, AL445212.9 exists as both a contig and a clone. By
checking the cmp_seq_region_id column in the assembly table, you can see
that part of the contig (101-166657) contributes to a supercontig and,
in turn, chromosome 13. You can also see that all of the contig
(1-166657) provides the sequence for the clone of the same name.
select * from seq_region where name ='AL445212.9';
+---------------+------------+-----------------+--------+
| seq_region_id | name | coord_system_id | length |
+---------------+------------+-----------------+--------+
| 22114 | AL445212.9 | 1 | 166657 |
| 49917 | AL445212.9 | 4 | 166657 |
+---------------+------------+-----------------+--------+
2 rows in set (0.03 sec)
select name, assembly.* from assembly, seq_region where
cmp_seq_region_id in (22114,49917) and
asm_seq_region_id=seq_region.seq_region_id;
+--------------+-------------------+-------------------+-----------+----------+-----------+---------+-----+
| name | asm_seq_region_id | cmp_seq_region_id | asm_start |
asm_end | cmp_start | cmp_end | ori |
+--------------+-------------------+-------------------+-----------+----------+-----------+---------+-----+
| 13 | 27513 | 22114 | 32806050 |
32972606 | 101 | 166657 | 1 |
| HSCHR13_CTG1 | 27528 | 22114 | 13786050 |
13952606 | 101 | 166657 | 1 |
| AL445212.9 | 49917 | 22114 | 1 |
166657 | 1 | 166657 | 1 |
+--------------+-------------------+-------------------+-----------+----------+-----------+---------+-----+
3 rows in set (2.33 sec)
> There is nothing in this table either about its overlap with its
> neighbouring clones or its supercontig HSCGR13_CTG1.
The seq_region table, which contains the full length of the region, and
the assembly table, which essentially contains the information found in
an agp file, provide the information needed to work out which parts of
components are used in higher levels and which are not. This includes
the relationship between the contig and the supercontig, as illustrated
above.
> If you look at the
> clone the annotated region of the clone is from base 101 onwards.
In general, annotation is stored on the toplevel sequence.
Regards,
Susan.
More information about the Dev
mailing list