[ensembl-dev] length of HG79_PATCH

Hervé Pagès hpages at fhcrc.org
Tue Nov 15 20:02:17 GMT 2011


Thanks Susan for your very detailed answer!

I guess my confusion came from the fact that this page at NCBI

 
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79 


suggests that HG79_PATCH is mapped to 136,049,443-136,317,858
on chr9, which doesn't seem to be correct.

So assuming that the correct information is provided by

 
ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p5/PATCHES/alt_scaffolds/alt_scaffold_placement.txt

then everything makes sense!

Also thanks for providing the URL for displaying HG79_PATCH in the
Ensembl browser. Your URL seems to be more appropriate than the URL
provided on

 
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79

The URL provided here is misleading.

Cheers,
H.


On 11-11-15 02:33 AM, Susan Fairley wrote:
> Hi,
>
> As you mentioned, for each of the GRC assembly patches in Ensembl there
> are two entries in the seq_region table. As you also noted, one of these
> corresponds to a supercontig and the other a chromosome.
>
> The supercontig is the patch scaffold/supercontig released by GRC. The
> length corresponds to the length of the patch sequence. For HG79_PATCH,
> this is 330,164 bases.
>
> The chromosome is the patched version of the chromosome, i.e. what you
> would get if you incorporated the patch into the chromosome it is
> located on.
>
> As you say, HG79_PATCH is located on chromosome 9 (this can be seen in
> the assembly_exception table as well).
>
> The original chromosome 9 has a length of 141,213,431 bases.
>
> Looking in the assembly exception table or this file, provided by GRC,
> ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p5/PATCHES/alt_scaffolds/alt_scaffold_placement.txt
>
>
> ensro at ens-livemirror : homo_sapiens_core_64_37 >select seq_region.name,
> seq_region_start, seq_region_end, exc_seq_region_start,
> exc_seq_region_end from seq_region, assembly_exception where
> name='HG79_PATCH' and
> seq_region.seq_region_id=assembly_exception.seq_region_id;
> +------------+------------------+----------------+----------------------+--------------------+
>
> | name | seq_region_start | seq_region_end | exc_seq_region_start |
> exc_seq_region_end |
> +------------+------------------+----------------+----------------------+--------------------+
>
> | HG79_PATCH | 136049442 | 136379605 | 136049442 | 136369192 |
> +------------+------------------+----------------+----------------------+--------------------+
>
> 1 row in set (0.00 sec)
>
> we can see that the patch replaces the region of chromosome 9 from
> 136,049,442 to 136,369,192.
>
> Consequently, a region of length 319,751 bases is removed from
> chromosome 9 (136,369,192 - 136,049,441 = 319,751) and replaced with the
> patch.
>
> So, chromosome 9 (141,213,431) has 319,751 bases removed (giving
> 140,893,680 bases) and then 330,164 bases added, giving 141,223,844 as
> the length of the patched chromosome. To differentiate the patched
> chromosome from the primary version of chromosome 9 it is given the same
> name as the patch.
>
> In this case, the patched version of the chromosome is 10,413 bases
> longer because the patch is 10,413 bases longer than the region it
> replaces. Patches can, however, also be shorter or the same length as
> the regions they replace.
>
> Also, it is worth noting that not all patches have names ending in
> '_PATCH'. As you will see in the GRC file above, patches exist with
> names like HSCHR3_1_CTG2_1.
>
> To identify the patches in the database, you could use the following query:
>
> homo_sapiens_core_64_37 >select count(*) from attrib_type,
> seq_region_attrib, seq_region where attrib_type.code in
> ('patch_fix','patch_novel') and attrib_type.attrib_type_id =
> seq_region_attrib.attrib_type_id and seq_region_attrib.seq_region_id =
> seq_region.seq_region_id ;
> +----------+
> | count(*) |
> +----------+
> | 105 |
> +----------+
> 1 row in set (0.00 sec)
>
> This identifies both the novel and the fix patches, indicated by the
> attrib_types with codes 'patch_fix' and 'patch_novel'.
>
> Gene annotation is stored on the patch regions of the patched
> chromosomes. Currently, the annotation strategy for the patches combines
> three sets of annotation. Manual annotation from the Havana group is
> given precedence and can be found in the core database. It is
> supplemented, however, first with annotation projected from primary
> assembly and subsequently with annotation built based on the alignment
> of evidence to the patches. These second and third categories have
> stable_ids starting with ASMPATCH and are stored in the otherfeatures
> database.
>
> The annotation for HG79_PATCH is visible in the browser here:
> http://www.ensembl.org/Homo_sapiens/Location/View?db=core;h=Q4SWC8.1%20%28802-855%29;r=HG79_PATCH:136049442-136379605
>
>
> I hope this is of help.
>
> Kind regards,
> Susan.
>
> Hervé Pagès wrote:
>> Hi,
>>
>> Related to this thread from Oct 2010:
>>
>> http://lists.ensembl.org/pipermail/dev/2010-October/000304.html
>>
>> In Ensembl release 64 (and maybe in previous releases, I didn't
>> check), the 'seq_region' table for homo sapiens
>>
>>
>> ftp://ftp.ensembl.org/pub/release-64/mysql/homo_sapiens_core_64_37/seq_region.txt.gz
>>
>>
>> contains entries for some of the "patch" sequences that belong to
>> GRCh37.p5. Those "patch" sequences are named with the _PATCH suffix
>> (e.g. HG7_PATCH, HG79_PATCH, HG506_HG1000_1_PATCH, etc...),
>> and each "patch" has 2 entries in the table. For example, here are
>> the 2 rows for HG79_PATCH:
>>
>> seq_region_id name coord_system_id length
>> 100965615 HG79_PATCH 2 141223844
>> 1000157396 HG79_PATCH 3 330164
>>
>> According to the 'coord_system' table, coord_system_id 2 and 3
>> correspond to "chromosome" and "supercontig", respectively.
>> So one possible interpretation could be that the 2 lengths
>> reported for HG79_PATCH are (1) the length of the chromosome
>> that this patch belongs to, and (2) the length of the patched
>> region.
>>
>> However, according to this page
>>
>>
>> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79
>>
>>
>> HG79_PATCH belongs to chr9 and is mapped to region
>> 136049443 - 136317858. So it's mapped to a region of length
>> 268416, but the 2nd length reported for the patch is 330164.
>> That seems to confirm what the OP reported in the above thread
>> i.e. that the patch is replacing a region in the reference genome
>> by a larger region.
>>
>> Also, according to this page
>>
>>
>> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/index.shtml
>>
>>
>> the length of chr9 is 141213431. But the first length reported
>> for HG79_PATCH in the 'seq_region' table is 141223844, which is
>> the length of chr9 + 10413.
>>
>> Where are those 10413 extra nucleotides coming from? Could it be
>> that this first length reported for HG79_PATCH is the length of
>> chr9 *after* its alteration by the patch? But that doesn't seem
>> to be the case either since this alteration would add 61748 bases
>> to chr9 (330164 - 268416).
>>
>> So my question is: what are those 2 lengths reported for HG79_PATCH,
>> and for the "patch" sequences in general?
>>
>> Thanks in advance for any clarification.
>>
>> Cheers,
>> H.
>>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319




More information about the Dev mailing list