[ensembl-dev] length of HG79_PATCH
Kerstin Howe
kerstin at sanger.ac.uk
Wed Nov 16 09:03:28 GMT 2011
Dear Herve,
I'm writing to you as a member of the Genome Reference Consortium, who also happens to be on ensembl-dev.
I think there's a misunderstandings here, since you are comparing coordinates for a genome issue report (HG-79) with coordinates for the patch that was applied to fix it (HG79_PATCH). Two closely related, but still different entities.
Genome issue HG-79 was reported on two neighbouring clones that didn't represent an existing haplotype, and their coordinates are documented athttp://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79. The link on this page to Ensembl takes you to the exact location of the reported issue.
This issue was then investigated and a stretch of genomic sequence was released (HG79_PATCH) to fix the above issue. The coordinates of this fix patch aligning to the GRCh37 reference genome are reported in ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p5/PATCHES/alt_scaffolds/alt_scaffold_placement.txt.
I take your point though that it would make sense to also report the details about the applied fix on the genome issue page. This is currently being implemented.
I hope this helps,
Kerstin
On 15 Nov 2011, at 20:02, Hervé Pagès wrote:
> Thanks Susan for your very detailed answer!
>
> I guess my confusion came from the fact that this page at NCBI
>
> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79
>
> suggests that HG79_PATCH is mapped to 136,049,443-136,317,858
> on chr9, which doesn't seem to be correct.
>
> So assuming that the correct information is provided by
>
> ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p5/PATCHES/alt_scaffolds/alt_scaffold_placement.txt
>
> then everything makes sense!
>
> Also thanks for providing the URL for displaying HG79_PATCH in the
> Ensembl browser. Your URL seems to be more appropriate than the URL
> provided on
>
> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79
>
> The URL provided here is misleading.
>
> Cheers,
> H.
>
>
> On 11-11-15 02:33 AM, Susan Fairley wrote:
>> Hi,
>>
>> As you mentioned, for each of the GRC assembly patches in Ensembl there
>> are two entries in the seq_region table. As you also noted, one of these
>> corresponds to a supercontig and the other a chromosome.
>>
>> The supercontig is the patch scaffold/supercontig released by GRC. The
>> length corresponds to the length of the patch sequence. For HG79_PATCH,
>> this is 330,164 bases.
>>
>> The chromosome is the patched version of the chromosome, i.e. what you
>> would get if you incorporated the patch into the chromosome it is
>> located on.
>>
>> As you say, HG79_PATCH is located on chromosome 9 (this can be seen in
>> the assembly_exception table as well).
>>
>> The original chromosome 9 has a length of 141,213,431 bases.
>>
>> Looking in the assembly exception table or this file, provided by GRC,
>> ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p5/PATCHES/alt_scaffolds/alt_scaffold_placement.txt
>>
>>
>> ensro at ens-livemirror : homo_sapiens_core_64_37 >select seq_region.name,
>> seq_region_start, seq_region_end, exc_seq_region_start,
>> exc_seq_region_end from seq_region, assembly_exception where
>> name='HG79_PATCH' and
>> seq_region.seq_region_id=assembly_exception.seq_region_id;
>> +------------+------------------+----------------+----------------------+--------------------+
>>
>> | name | seq_region_start | seq_region_end | exc_seq_region_start |
>> exc_seq_region_end |
>> +------------+------------------+----------------+----------------------+--------------------+
>>
>> | HG79_PATCH | 136049442 | 136379605 | 136049442 | 136369192 |
>> +------------+------------------+----------------+----------------------+--------------------+
>>
>> 1 row in set (0.00 sec)
>>
>> we can see that the patch replaces the region of chromosome 9 from
>> 136,049,442 to 136,369,192.
>>
>> Consequently, a region of length 319,751 bases is removed from
>> chromosome 9 (136,369,192 - 136,049,441 = 319,751) and replaced with the
>> patch.
>>
>> So, chromosome 9 (141,213,431) has 319,751 bases removed (giving
>> 140,893,680 bases) and then 330,164 bases added, giving 141,223,844 as
>> the length of the patched chromosome. To differentiate the patched
>> chromosome from the primary version of chromosome 9 it is given the same
>> name as the patch.
>>
>> In this case, the patched version of the chromosome is 10,413 bases
>> longer because the patch is 10,413 bases longer than the region it
>> replaces. Patches can, however, also be shorter or the same length as
>> the regions they replace.
>>
>> Also, it is worth noting that not all patches have names ending in
>> '_PATCH'. As you will see in the GRC file above, patches exist with
>> names like HSCHR3_1_CTG2_1.
>>
>> To identify the patches in the database, you could use the following query:
>>
>> homo_sapiens_core_64_37 >select count(*) from attrib_type,
>> seq_region_attrib, seq_region where attrib_type.code in
>> ('patch_fix','patch_novel') and attrib_type.attrib_type_id =
>> seq_region_attrib.attrib_type_id and seq_region_attrib.seq_region_id =
>> seq_region.seq_region_id ;
>> +----------+
>> | count(*) |
>> +----------+
>> | 105 |
>> +----------+
>> 1 row in set (0.00 sec)
>>
>> This identifies both the novel and the fix patches, indicated by the
>> attrib_types with codes 'patch_fix' and 'patch_novel'.
>>
>> Gene annotation is stored on the patch regions of the patched
>> chromosomes. Currently, the annotation strategy for the patches combines
>> three sets of annotation. Manual annotation from the Havana group is
>> given precedence and can be found in the core database. It is
>> supplemented, however, first with annotation projected from primary
>> assembly and subsequently with annotation built based on the alignment
>> of evidence to the patches. These second and third categories have
>> stable_ids starting with ASMPATCH and are stored in the otherfeatures
>> database.
>>
>> The annotation for HG79_PATCH is visible in the browser here:
>> http://www.ensembl.org/Homo_sapiens/Location/View?db=core;h=Q4SWC8.1%20%28802-855%29;r=HG79_PATCH:136049442-136379605
>>
>>
>> I hope this is of help.
>>
>> Kind regards,
>> Susan.
>>
>> Hervé Pagès wrote:
>>> Hi,
>>>
>>> Related to this thread from Oct 2010:
>>>
>>> http://lists.ensembl.org/pipermail/dev/2010-October/000304.html
>>>
>>> In Ensembl release 64 (and maybe in previous releases, I didn't
>>> check), the 'seq_region' table for homo sapiens
>>>
>>>
>>> ftp://ftp.ensembl.org/pub/release-64/mysql/homo_sapiens_core_64_37/seq_region.txt.gz
>>>
>>>
>>> contains entries for some of the "patch" sequences that belong to
>>> GRCh37.p5. Those "patch" sequences are named with the _PATCH suffix
>>> (e.g. HG7_PATCH, HG79_PATCH, HG506_HG1000_1_PATCH, etc...),
>>> and each "patch" has 2 entries in the table. For example, here are
>>> the 2 rows for HG79_PATCH:
>>>
>>> seq_region_id name coord_system_id length
>>> 100965615 HG79_PATCH 2 141223844
>>> 1000157396 HG79_PATCH 3 330164
>>>
>>> According to the 'coord_system' table, coord_system_id 2 and 3
>>> correspond to "chromosome" and "supercontig", respectively.
>>> So one possible interpretation could be that the 2 lengths
>>> reported for HG79_PATCH are (1) the length of the chromosome
>>> that this patch belongs to, and (2) the length of the patched
>>> region.
>>>
>>> However, according to this page
>>>
>>>
>>> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-79
>>>
>>>
>>> HG79_PATCH belongs to chr9 and is mapped to region
>>> 136049443 - 136317858. So it's mapped to a region of length
>>> 268416, but the 2nd length reported for the patch is 330164.
>>> That seems to confirm what the OP reported in the above thread
>>> i.e. that the patch is replacing a region in the reference genome
>>> by a larger region.
>>>
>>> Also, according to this page
>>>
>>>
>>> http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/index.shtml
>>>
>>>
>>> the length of chr9 is 141213431. But the first length reported
>>> for HG79_PATCH in the 'seq_region' table is 141223844, which is
>>> the length of chr9 + 10413.
>>>
>>> Where are those 10413 extra nucleotides coming from? Could it be
>>> that this first length reported for HG79_PATCH is the length of
>>> chr9 *after* its alteration by the patch? But that doesn't seem
>>> to be the case either since this alteration would add 61748 bases
>>> to chr9 (330164 - 268416).
>>>
>>> So my question is: what are those 2 lengths reported for HG79_PATCH,
>>> and for the "patch" sequences in general?
>>>
>>> Thanks in advance for any clarification.
>>>
>>> Cheers,
>>> H.
>>>
>>
>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> List admin (including subscribe/unsubscribe): http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
--
Dr. Kerstin Howe
Senior Scientific Manager
Genome Reference Informatics (Team 135)
kerstin at sanger.ac.uk
Wellcome Trust Sanger Institute
Hinxton, Cambridge CB10 1SA, UK
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20111116/4c5fb835/attachment.html>
More information about the Dev
mailing list