[ensembl-dev] length of HG79_PATCH

Kerstin Howe kerstin at sanger.ac.uk
Wed Nov 16 09:03:28 GMT 2011


Dear Herve,

I'm writing to you as a member of the Genome Reference Consortium, who also happens to be on ensembl-dev.

I think there's a misunderstandings here, since you are comparing coordinates for a genome issue report (HG-79) with coordinates for the patch that was applied to fix it (HG79_PATCH). Two closely related, but still different entities.

Genome issue HG-79 was reported on two neighbouring clones that didn't represent an existing haplotype, and their coordinates are documented athttp://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79. The link on this page to Ensembl takes you to the exact location of the reported issue.

This issue was then investigated and a stretch of genomic sequence was released (HG79_PATCH) to fix the above issue. The coordinates of this fix patch aligning to the GRCh37 reference genome are reported in ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p5/PATCHES/alt_scaffolds/alt_scaffold_placement.txt. 

I take your point though that it would make sense to also report the details about the applied fix on the genome issue page. This is currently being implemented.

I hope this helps,

Kerstin



On 15 Nov 2011, at 20:02, Hervé Pagès wrote:

> Thanks Susan for your very detailed answer!
> 
> I guess my confusion came from the fact that this page at NCBI
> 
> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79 
> 
> suggests that HG79_PATCH is mapped to 136,049,443-136,317,858
> on chr9, which doesn't seem to be correct.
> 
> So assuming that the correct information is provided by
> 
> ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p5/PATCHES/alt_scaffolds/alt_scaffold_placement.txt
> 
> then everything makes sense!
> 
> Also thanks for providing the URL for displaying HG79_PATCH in the
> Ensembl browser. Your URL seems to be more appropriate than the URL
> provided on
> 
> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79
> 
> The URL provided here is misleading.
> 
> Cheers,
> H.
> 
> 
> On 11-11-15 02:33 AM, Susan Fairley wrote:
>> Hi,
>> 
>> As you mentioned, for each of the GRC assembly patches in Ensembl there
>> are two entries in the seq_region table. As you also noted, one of these
>> corresponds to a supercontig and the other a chromosome.
>> 
>> The supercontig is the patch scaffold/supercontig released by GRC. The
>> length corresponds to the length of the patch sequence. For HG79_PATCH,
>> this is 330,164 bases.
>> 
>> The chromosome is the patched version of the chromosome, i.e. what you
>> would get if you incorporated the patch into the chromosome it is
>> located on.
>> 
>> As you say, HG79_PATCH is located on chromosome 9 (this can be seen in
>> the assembly_exception table as well).
>> 
>> The original chromosome 9 has a length of 141,213,431 bases.
>> 
>> Looking in the assembly exception table or this file, provided by GRC,
>> ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p5/PATCHES/alt_scaffolds/alt_scaffold_placement.txt
>> 
>> 
>> ensro at ens-livemirror : homo_sapiens_core_64_37 >select seq_region.name,
>> seq_region_start, seq_region_end, exc_seq_region_start,
>> exc_seq_region_end from seq_region, assembly_exception where
>> name='HG79_PATCH' and
>> seq_region.seq_region_id=assembly_exception.seq_region_id;
>> +------------+------------------+----------------+----------------------+--------------------+
>> 
>> | name | seq_region_start | seq_region_end | exc_seq_region_start |
>> exc_seq_region_end |
>> +------------+------------------+----------------+----------------------+--------------------+
>> 
>> | HG79_PATCH | 136049442 | 136379605 | 136049442 | 136369192 |
>> +------------+------------------+----------------+----------------------+--------------------+
>> 
>> 1 row in set (0.00 sec)
>> 
>> we can see that the patch replaces the region of chromosome 9 from
>> 136,049,442 to 136,369,192.
>> 
>> Consequently, a region of length 319,751 bases is removed from
>> chromosome 9 (136,369,192 - 136,049,441 = 319,751) and replaced with the
>> patch.
>> 
>> So, chromosome 9 (141,213,431) has 319,751 bases removed (giving
>> 140,893,680 bases) and then 330,164 bases added, giving 141,223,844 as
>> the length of the patched chromosome. To differentiate the patched
>> chromosome from the primary version of chromosome 9 it is given the same
>> name as the patch.
>> 
>> In this case, the patched version of the chromosome is 10,413 bases
>> longer because the patch is 10,413 bases longer than the region it
>> replaces. Patches can, however, also be shorter or the same length as
>> the regions they replace.
>> 
>> Also, it is worth noting that not all patches have names ending in
>> '_PATCH'. As you will see in the GRC file above, patches exist with
>> names like HSCHR3_1_CTG2_1.
>> 
>> To identify the patches in the database, you could use the following query:
>> 
>> homo_sapiens_core_64_37 >select count(*) from attrib_type,
>> seq_region_attrib, seq_region where attrib_type.code in
>> ('patch_fix','patch_novel') and attrib_type.attrib_type_id =
>> seq_region_attrib.attrib_type_id and seq_region_attrib.seq_region_id =
>> seq_region.seq_region_id ;
>> +----------+
>> | count(*) |
>> +----------+
>> | 105 |
>> +----------+
>> 1 row in set (0.00 sec)
>> 
>> This identifies both the novel and the fix patches, indicated by the
>> attrib_types with codes 'patch_fix' and 'patch_novel'.
>> 
>> Gene annotation is stored on the patch regions of the patched
>> chromosomes. Currently, the annotation strategy for the patches combines
>> three sets of annotation. Manual annotation from the Havana group is
>> given precedence and can be found in the core database. It is
>> supplemented, however, first with annotation projected from primary
>> assembly and subsequently with annotation built based on the alignment
>> of evidence to the patches. These second and third categories have
>> stable_ids starting with ASMPATCH and are stored in the otherfeatures
>> database.
>> 
>> The annotation for HG79_PATCH is visible in the browser here:
>> http://www.ensembl.org/Homo_sapiens/Location/View?db=core;h=Q4SWC8.1%20%28802-855%29;r=HG79_PATCH:136049442-136379605
>> 
>> 
>> I hope this is of help.
>> 
>> Kind regards,
>> Susan.
>> 
>> Hervé Pagès wrote:
>>> Hi,
>>> 
>>> Related to this thread from Oct 2010:
>>> 
>>> http://lists.ensembl.org/pipermail/dev/2010-October/000304.html
>>> 
>>> In Ensembl release 64 (and maybe in previous releases, I didn't
>>> check), the 'seq_region' table for homo sapiens
>>> 
>>> 
>>> ftp://ftp.ensembl.org/pub/release-64/mysql/homo_sapiens_core_64_37/seq_region.txt.gz
>>> 
>>> 
>>> contains entries for some of the "patch" sequences that belong to
>>> GRCh37.p5. Those "patch" sequences are named with the _PATCH suffix
>>> (e.g. HG7_PATCH, HG79_PATCH, HG506_HG1000_1_PATCH, etc...),
>>> and each "patch" has 2 entries in the table. For example, here are
>>> the 2 rows for HG79_PATCH:
>>> 
>>> seq_region_id name coord_system_id length
>>> 100965615 HG79_PATCH 2 141223844
>>> 1000157396 HG79_PATCH 3 330164
>>> 
>>> According to the 'coord_system' table, coord_system_id 2 and 3
>>> correspond to "chromosome" and "supercontig", respectively.
>>> So one possible interpretation could be that the 2 lengths
>>> reported for HG79_PATCH are (1) the length of the chromosome
>>> that this patch belongs to, and (2) the length of the patched
>>> region.
>>> 
>>> However, according to this page
>>> 
>>> 
>>> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79
>>> 
>>> 
>>> HG79_PATCH belongs to chr9 and is mapped to region
>>> 136049443 - 136317858. So it's mapped to a region of length
>>> 268416, but the 2nd length reported for the patch is 330164.
>>> That seems to confirm what the OP reported in the above thread
>>> i.e. that the patch is replacing a region in the reference genome
>>> by a larger region.
>>> 
>>> Also, according to this page
>>> 
>>> 
>>> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/index.shtml
>>> 
>>> 
>>> the length of chr9 is 141213431. But the first length reported
>>> for HG79_PATCH in the 'seq_region' table is 141223844, which is
>>> the length of chr9 + 10413.
>>> 
>>> Where are those 10413 extra nucleotides coming from? Could it be
>>> that this first length reported for HG79_PATCH is the length of
>>> chr9 *after* its alteration by the patch? But that doesn't seem
>>> to be the case either since this alteration would add 61748 bases
>>> to chr9 (330164 - 268416).
>>> 
>>> So my question is: what are those 2 lengths reported for HG79_PATCH,
>>> and for the "patch" sequences in general?
>>> 
>>> Thanks in advance for any clarification.
>>> 
>>> Cheers,
>>> H.
>>> 
>> 
>> 
> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
> 
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

--
Dr. Kerstin Howe
Senior Scientific Manager
Genome Reference Informatics (Team 135)
kerstin at sanger.ac.uk

Wellcome Trust Sanger Institute
Hinxton, Cambridge CB10 1SA, UK




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20111116/4c5fb835/attachment.html>


More information about the Dev mailing list