[ensembl-dev] Ensembl 59 Haplotype Sequences

Susan Fairley sf7 at sanger.ac.uk
Wed Oct 20 16:18:38 BST 2010


Hi,

The system you describe for calculating the length of the alternative 
chromosomes is correct. It should be: (length of the original chromosome 
  - the length of the region being replaced) + the haplotype length = 
alternative chromosome length.

Looking at the data, the  chromosome lengths were one base too long for 
five of the nine haplotypes. The result of this was to add an extra base 
(an N) at the end of the five alternative chromosomes. In other 
respects, the chromosome sequences are correct. The scaffold level 
sequences were not affected.

The five alternative chromosomes where this occurred are:
HSCHR6_MHC_MANN
HSCHR6_MHC_MCF
HSCHR6_MHC_SSTO
HSCHR4_1
HSCHR17_1

As release 60 is at an advanced stage of production, we will sadly not 
be able to alter this for the forthcoming release, however, it will be 
corrected in release 61.

Thank you for bringing this to our attention. I hope that it has not 
caused you too much inconvenience.

Kind regards,
Susan.

Bio X2Y wrote:
> Hi,
> 
> I understand that the Ensembl 59 is based on GRCh37.p1.
> 
> For haplotypes, GRCh37 seems to include sequences for the alternative 
> part of the target chromosome, rather than a full alternative version of 
> the chromosome. Ensembl seems to take the other approach, releasing a 
> full-sized alternative chromsome sequence (at least for file downloads).
> 
> Intuitively, I imagine this is done by identifying the region in the 
> original chromosome that corresponds to the alternative region, and 
> replacing it with that region.
> 
> When I try to verify this, however, I seem to be seeing an off-by-one 
> error for some haplotypes, and not for others.
> 
> GRCh37 releases a small file (alt_locus_scaf2primary.pos) with each 
> haplotype, and this seems to provide the coordinates (start from 1, 
> inclusive) that determine how to insert the alternative sequence into 
> the parent chromosome. For example, the following details are provided 
> for the APD haplotype for the chromosome 6 MHC:
> 
> Chrom_start = 28696604
> Chrom_end = 33335493
> Alt_loci_start = 1
> Alt_loci_end = 4622290
> 
> The sequence size of APD is 4622290 in GRCh37, and the full length APD 
> haplotype in Ensembl is 171098467.
> Since the original chromosome 6 is length 171115067, I would intuitively 
> think that the following procedure can be used to predict the Ensembl 
> size for the full haplotype:
> 
> (Full_chromosome_length - [chrom_end - chrom_start + 1] + [alt_loci_end 
> - alt_loci_start + 1])
> Where we can imagine that chrom_start and chrom_end describe the region 
> ("hole") in the original chromosome that is replaced with the 
> alternative region.
> 
> Indeed, this works for APD - we get the Ensembl figure of 171098467.
> 
> However, it doesn't work for the haplotypes where the size of the "hole" 
> in the original sequence is smaller than the region being inserted. In 
> these cases, it is off-by-one.
> 
> Also, it doesn't work for the chromosome 4 haplotype, even though the 
> "hole" in the original sequence is larger than the region being inserted.
> 
> Could someone perhaps explain why I'm seeing this? I assume I'm missing 
> something simple.
> 
> Thanks for your time.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Dev mailing list
> Dev at ensembl.org
> http://lists.ensembl.org/mailman/listinfo/dev




More information about the Dev mailing list