[ensembl-dev] Differences in Ensembl GRCh38 fasta and NCBI GRCh38 fasta

Wed Dec 23 21:40:15 GMT 2015

I have some questions about differences I've observed between an NCBI 
provided GRCh38 reference and the one available from Ensembl (Release 
76; Homo_sapiens.GRCh38.dna.toplevel.fa). I initially believed they 
should be identical except for chromosome names and N padding, but they 
don't appear to be. I've done my best to find answers online already, 
but I'm coming up with nothing.

NCBI has provided a reference for use in mapping NGS reads here: 
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/ 
and I believe 1000 Genomes has also utilized a very similar reference.

Typically, I would utilize Ensembl to assemble my reference sequence, 
but I would like to include the alternate haplotypes and the padding of 
Ns in those sequences seemed likely to cause problems for the creation 
of the necessary indices. Therefore, I was looking into using the NCBI 
reference, but wanted to confirm that the coordinate systems and 
sequence were the same so that I could continue to utilize VEP 
downstream, even for alternative haplotypes. I found some unexpected 
differences and I was hoping I could get more information.

1) Sequences in Ensembl are masked (as Ns) within the primary 
chromosomes, but are not in the NCBI reference. For example:

diff /tmp/ensembl_chr12.fa /tmp/genbank_chr12.fa  | head -100
579551,579554c579551,579554
< ACGGGATTTCTTCATATAATGTTAGACAGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
< NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
< NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
< NNNNNNNNNNNNNNNNNNNNNATCATTCTCAGAAACTACTTTGTGATGTGTGCGTTCAAC
---
> ACGGGATTTCTTCATATAATGTTAGACAGAAGAATTCTCAGAAACTTATTTGTGTTATAT
> TTATTCAACTAGCAGAATTGAAACTTCCTTTTGACAGAGCAGATTTGATACACTCCTTTT
> GTGGAATTTCCAGGTGCAGATTTCATTCGCTTTGAGGCCAATGGTAGAAAAGGACATATA
> TTCGTAGAAAAACAAGAGAGAATCATTCTCAGAAACTACTTTGTGATGTGTGCGTTCAAC

Does Ensembl perform additional N masking, even for the toplevel file? 
Under what criteria?

2) Some of the alternative haplotypes appear to be reverse complemented 
from what I see in the NCBI reference.

For example, chr6_KI270758v1_alt aka CHR_HSCHR6_8_CTG1 is reversed in 
Ensembl from the sequence in NCBI 
(http://www.ncbi.nlm.nih.gov/nuccore/KI270758). Is there some way to 
determine which contigs this is performed on through the API? Why/how 
are these decided to be reversed?

3) I am also seeing differences on many other chromosomes, but some are 
expected (e.g. large sections where NCBI has masked out sequence 
intentionally for the aligners, changes of IUPAC ambiguity bases to N in 
Ensembl etc). Any other differences that should be expected?

Thanks,

Dave

____
This email message is a private communication. The information transmitted, including attachments, is intended only for the person or entity to which it is addressed and may contain confidential, privileged, and/or proprietary material. Any review, duplication, retransmission, distribution, or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is unauthorized by the sender and is prohibited. If you have received this message in error, please contact the sender immediately by return email and delete the original message from all computer systems. Thank you.