[ensembl-dev] length of HG79_PATCH

Hervé Pagès hpages at fhcrc.org
Mon Nov 21 00:21:46 GMT 2011


Hi Kerstin,

On 11-11-16 01:02 AM, Kerstin Howe wrote:
> Dear Herve,
>
> I'm writing to you as a member of the Genome Reference Consortium, who also happens to be on ensembl-dev.
>
> I think there's a misunderstandings here, since you are comparing coordinates for a genome issue report (HG-79) with coordinates for the patch that was applied to fix it (HG79_PATCH). Two closely related, but still different entities.
>
> Genome issue HG-79 was reported on two neighbouring clones that didn't represent an existing haplotype, and their coordinates are documented at http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79. The link on this page to Ensembl takes you to the exact location of the reported issue.
>
> This issue was then investigated and a stretch of genomic sequence was released (HG79_PATCH) to fix the above issue. The coordinates of this fix patch aligning to the GRCh37 reference genome are reported in ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p5/PATCHES/alt_scaffolds/alt_scaffold_placement.txt.
>
> I take your point though that it would make sense to also report the details about the applied fix on the genome issue page. This is currently being implemented.
>
> I hope this helps,

Thank you very much. Making the details about the applied fix more
visible will indeed help.

Cheers,
H.

>
> Kerstin
>
>
>
> On 15 Nov 2011, at 20:02, Hervé Pagès wrote:
>
>> Thanks Susan for your very detailed answer!
>>
>> I guess my confusion came from the fact that this page at NCBI
>>
>> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79
>>
>> suggests that HG79_PATCH is mapped to 136,049,443-136,317,858
>> on chr9, which doesn't seem to be correct.
>>
>> So assuming that the correct information is provided by
>>
>> ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p5/PATCHES/alt_scaffolds/alt_scaffold_placement.txt
>>
>> then everything makes sense!
>>
>> Also thanks for providing the URL for displaying HG79_PATCH in the
>> Ensembl browser. Your URL seems to be more appropriate than the URL
>> provided on
>>
>> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79
>>
>> The URL provided here is misleading.
>>
>> Cheers,
>> H.
>>
>>
>> On 11-11-15 02:33 AM, Susan Fairley wrote:
>>> Hi,
>>>
>>> As you mentioned, for each of the GRC assembly patches in Ensembl there
>>> are two entries in the seq_region table. As you also noted, one of these
>>> corresponds to a supercontig and the other a chromosome.
>>>
>>> The supercontig is the patch scaffold/supercontig released by GRC. The
>>> length corresponds to the length of the patch sequence. For HG79_PATCH,
>>> this is 330,164 bases.
>>>
>>> The chromosome is the patched version of the chromosome, i.e. what you
>>> would get if you incorporated the patch into the chromosome it is
>>> located on.
>>>
>>> As you say, HG79_PATCH is located on chromosome 9 (this can be seen in
>>> the assembly_exception table as well).
>>>
>>> The original chromosome 9 has a length of 141,213,431 bases.
>>>
>>> Looking in the assembly exception table or this file, provided by GRC,
>>> ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p5/PATCHES/alt_scaffolds/alt_scaffold_placement.txt
>>>
>>>
>>> ensro at ens-livemirror : homo_sapiens_core_64_37>select seq_region.name,
>>> seq_region_start, seq_region_end, exc_seq_region_start,
>>> exc_seq_region_end from seq_region, assembly_exception where
>>> name='HG79_PATCH' and
>>> seq_region.seq_region_id=assembly_exception.seq_region_id;
>>> +------------+------------------+----------------+----------------------+--------------------+
>>>
>>> | name | seq_region_start | seq_region_end | exc_seq_region_start |
>>> exc_seq_region_end |
>>> +------------+------------------+----------------+----------------------+--------------------+
>>>
>>> | HG79_PATCH | 136049442 | 136379605 | 136049442 | 136369192 |
>>> +------------+------------------+----------------+----------------------+--------------------+
>>>
>>> 1 row in set (0.00 sec)
>>>
>>> we can see that the patch replaces the region of chromosome 9 from
>>> 136,049,442 to 136,369,192.
>>>
>>> Consequently, a region of length 319,751 bases is removed from
>>> chromosome 9 (136,369,192 - 136,049,441 = 319,751) and replaced with the
>>> patch.
>>>
>>> So, chromosome 9 (141,213,431) has 319,751 bases removed (giving
>>> 140,893,680 bases) and then 330,164 bases added, giving 141,223,844 as
>>> the length of the patched chromosome. To differentiate the patched
>>> chromosome from the primary version of chromosome 9 it is given the same
>>> name as the patch.
>>>
>>> In this case, the patched version of the chromosome is 10,413 bases
>>> longer because the patch is 10,413 bases longer than the region it
>>> replaces. Patches can, however, also be shorter or the same length as
>>> the regions they replace.
>>>
>>> Also, it is worth noting that not all patches have names ending in
>>> '_PATCH'. As you will see in the GRC file above, patches exist with
>>> names like HSCHR3_1_CTG2_1.
>>>
>>> To identify the patches in the database, you could use the following query:
>>>
>>> homo_sapiens_core_64_37>select count(*) from attrib_type,
>>> seq_region_attrib, seq_region where attrib_type.code in
>>> ('patch_fix','patch_novel') and attrib_type.attrib_type_id =
>>> seq_region_attrib.attrib_type_id and seq_region_attrib.seq_region_id =
>>> seq_region.seq_region_id ;
>>> +----------+
>>> | count(*) |
>>> +----------+
>>> | 105 |
>>> +----------+
>>> 1 row in set (0.00 sec)
>>>
>>> This identifies both the novel and the fix patches, indicated by the
>>> attrib_types with codes 'patch_fix' and 'patch_novel'.
>>>
>>> Gene annotation is stored on the patch regions of the patched
>>> chromosomes. Currently, the annotation strategy for the patches combines
>>> three sets of annotation. Manual annotation from the Havana group is
>>> given precedence and can be found in the core database. It is
>>> supplemented, however, first with annotation projected from primary
>>> assembly and subsequently with annotation built based on the alignment
>>> of evidence to the patches. These second and third categories have
>>> stable_ids starting with ASMPATCH and are stored in the otherfeatures
>>> database.
>>>
>>> The annotation for HG79_PATCH is visible in the browser here:
>>> http://www.ensembl.org/Homo_sapiens/Location/View?db=core;h=Q4SWC8.1%20%28802-855%29;r=HG79_PATCH:136049442-136379605
>>>
>>>
>>> I hope this is of help.
>>>
>>> Kind regards,
>>> Susan.
>>>
>>> Hervé Pagès wrote:
>>>> Hi,
>>>>
>>>> Related to this thread from Oct 2010:
>>>>
>>>> http://lists.ensembl.org/pipermail/dev/2010-October/000304.html
>>>>
>>>> In Ensembl release 64 (and maybe in previous releases, I didn't
>>>> check), the 'seq_region' table for homo sapiens
>>>>
>>>>
>>>> ftp://ftp.ensembl.org/pub/release-64/mysql/homo_sapiens_core_64_37/seq_region.txt.gz
>>>>
>>>>
>>>> contains entries for some of the "patch" sequences that belong to
>>>> GRCh37.p5. Those "patch" sequences are named with the _PATCH suffix
>>>> (e.g. HG7_PATCH, HG79_PATCH, HG506_HG1000_1_PATCH, etc...),
>>>> and each "patch" has 2 entries in the table. For example, here are
>>>> the 2 rows for HG79_PATCH:
>>>>
>>>> seq_region_id name coord_system_id length
>>>> 100965615 HG79_PATCH 2 141223844
>>>> 1000157396 HG79_PATCH 3 330164
>>>>
>>>> According to the 'coord_system' table, coord_system_id 2 and 3
>>>> correspond to "chromosome" and "supercontig", respectively.
>>>> So one possible interpretation could be that the 2 lengths
>>>> reported for HG79_PATCH are (1) the length of the chromosome
>>>> that this patch belongs to, and (2) the length of the patched
>>>> region.
>>>>
>>>> However, according to this page
>>>>
>>>>
>>>> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79
>>>>
>>>>
>>>> HG79_PATCH belongs to chr9 and is mapped to region
>>>> 136049443 - 136317858. So it's mapped to a region of length
>>>> 268416, but the 2nd length reported for the patch is 330164.
>>>> That seems to confirm what the OP reported in the above thread
>>>> i.e. that the patch is replacing a region in the reference genome
>>>> by a larger region.
>>>>
>>>> Also, according to this page
>>>>
>>>>
>>>> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/index.shtml
>>>>
>>>>
>>>> the length of chr9 is 141213431. But the first length reported
>>>> for HG79_PATCH in the 'seq_region' table is 141223844, which is
>>>> the length of chr9 + 10413.
>>>>
>>>> Where are those 10413 extra nucleotides coming from? Could it be
>>>> that this first length reported for HG79_PATCH is the length of
>>>> chr9 *after* its alteration by the patch? But that doesn't seem
>>>> to be the case either since this alteration would add 61748 bases
>>>> to chr9 (330164 - 268416).
>>>>
>>>> So my question is: what are those 2 lengths reported for HG79_PATCH,
>>>> and for the "patch" sequences in general?
>>>>
>>>> Thanks in advance for any clarification.
>>>>
>>>> Cheers,
>>>> H.
>>>>
>>>
>>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>
> --
> Dr. Kerstin Howe
> Senior Scientific Manager
> Genome Reference Informatics (Team 135)
> kerstin at sanger.ac.uk
>
> Wellcome Trust Sanger Institute
> Hinxton, Cambridge CB10 1SA, UK
>
>
>
>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319




More information about the Dev mailing list