[ensembl-dev] length of HG79_PATCH

Susan Fairley sf7 at sanger.ac.uk
Tue Nov 15 10:33:47 GMT 2011


As you mentioned, for each of the GRC assembly patches in Ensembl there 
are two entries in the seq_region table. As you also noted, one of these 
corresponds to a supercontig and the other a chromosome.

The supercontig is the patch scaffold/supercontig released by GRC. The 
length corresponds to the length of the patch sequence. For HG79_PATCH, 
this is 330,164 bases.

The chromosome is the patched version of the chromosome, i.e. what you 
would get if you incorporated the patch into the chromosome it is 
located on.

As you say, HG79_PATCH is located on chromosome 9 (this can be seen in 
the assembly_exception table as well).

The original chromosome 9 has a length of 141,213,431 bases.

Looking in the assembly exception table or this file, provided by GRC,

ensro at ens-livemirror : homo_sapiens_core_64_37 >select seq_region.name, 
seq_region_start, seq_region_end, exc_seq_region_start, 
exc_seq_region_end from seq_region, assembly_exception where 
name='HG79_PATCH' and 
| name       | seq_region_start | seq_region_end | exc_seq_region_start 
| exc_seq_region_end |
| HG79_PATCH |        136049442 |      136379605 |            136049442 
|          136369192 |
1 row in set (0.00 sec)

we can see that the patch replaces the region of chromosome 9 from 
136,049,442 to 136,369,192.

Consequently, a region of length 319,751 bases is removed from 
chromosome 9 (136,369,192 - 136,049,441 = 319,751) and replaced with the 

So, chromosome 9 (141,213,431) has 319,751 bases removed (giving 
140,893,680 bases) and then 330,164 bases added, giving 141,223,844 as 
the length of the patched chromosome. To differentiate the patched 
chromosome from the primary version of chromosome 9 it is given the same 
name as the patch.

In this case, the patched version of the chromosome is 10,413 bases 
longer because the patch is 10,413 bases longer than the region it 
replaces. Patches can, however, also be shorter or the same length as 
the regions they replace.

Also, it is worth noting that not all patches have names ending in 
'_PATCH'. As you will see in the GRC file above, patches exist with 
names like HSCHR3_1_CTG2_1.

To identify the patches in the database, you could use the following query:

homo_sapiens_core_64_37 >select count(*) from attrib_type, 
seq_region_attrib, seq_region where attrib_type.code in 
('patch_fix','patch_novel') and attrib_type.attrib_type_id = 
seq_region_attrib.attrib_type_id and seq_region_attrib.seq_region_id = 
seq_region.seq_region_id ;
| count(*) |
|      105 |
1 row in set (0.00 sec)

This identifies both the novel and the fix patches, indicated by the 
attrib_types with codes 'patch_fix' and 'patch_novel'.

Gene annotation is stored on the patch regions of the patched 
chromosomes. Currently, the annotation strategy for the patches combines 
three sets of annotation. Manual annotation from the Havana group is 
given precedence and can be found in the core database. It is 
supplemented, however, first with annotation projected from primary 
assembly and subsequently with annotation built based on the alignment 
of evidence to the patches. These second and third categories have 
stable_ids starting with ASMPATCH and are stored in the otherfeatures 

The annotation for HG79_PATCH is visible in the browser here:

I hope this is of help.

Kind regards,

Hervé Pagès wrote:
> Hi,
> Related to this thread from Oct 2010:
>   http://lists.ensembl.org/pipermail/dev/2010-October/000304.html
> In Ensembl release 64 (and maybe in previous releases, I didn't
> check), the 'seq_region' table for homo sapiens
> ftp://ftp.ensembl.org/pub/release-64/mysql/homo_sapiens_core_64_37/seq_region.txt.gz 
> contains entries for some of the "patch" sequences that belong to
> GRCh37.p5. Those "patch" sequences are named with the _PATCH suffix
> (e.g. HG7_PATCH, HG79_PATCH, HG506_HG1000_1_PATCH, etc...),
> and each "patch" has 2 entries in the table. For example, here are
> the 2 rows for HG79_PATCH:
> seq_region_id        name  coord_system_id     length
>     100965615  HG79_PATCH                2  141223844
>    1000157396  HG79_PATCH                3     330164
> According to the 'coord_system' table, coord_system_id 2 and 3
> correspond to "chromosome" and "supercontig", respectively.
> So one possible interpretation could be that the 2 lengths
> reported for HG79_PATCH are (1) the length of the chromosome
> that this patch belongs to, and (2) the length of the patched
> region.
> However, according to this page
> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79 
> HG79_PATCH belongs to chr9 and is mapped to region
> 136049443 - 136317858. So it's mapped to a region of length
> 268416, but the 2nd length reported for the patch is 330164.
> That seems to confirm what the OP reported in the above thread
> i.e. that the patch is replacing a region in the reference genome
> by a larger region.
> Also, according to this page
> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/index.shtml 
> the length of chr9 is 141213431. But the first length reported
> for HG79_PATCH in the 'seq_region' table is 141223844, which is
> the length of chr9 + 10413.
> Where are those 10413 extra nucleotides coming from? Could it be
> that this first length reported for HG79_PATCH is the length of
> chr9 *after* its alteration by the patch? But that doesn't seem
> to be the case either since this alteration would add 61748 bases
> to chr9 (330164 - 268416).
> So my question is: what are those 2 lengths reported for HG79_PATCH,
> and for the "patch" sequences in general?
> Thanks in advance for any clarification.
> Cheers,
> H.

More information about the Dev mailing list