[ensembl-dev] Differences in Ensembl GRCh38 fasta and NCBI GRCh38 fasta

Mon Jan 4 14:48:21 GMT 2016

Hi Dave,

Thank you for your feedback.

Issue number 1 seems to be a bug.
I do find the same masked out region in the file as you do, but I do not 
understand why it is there.
The only explanation I can find is that this file accidentally used the 
repeatmasked region rather than the basic one.

The good news is, the file is correct in release 83 
(ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz) 
as well as in the chromosome only file 
(ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz), 
both in release 76 and the latest release.
I would recommend using one of those if you can.

For the reverse complemented sequences, this happens if the patch is 
mapped to the reverse strand against the genome.
As you had correctly identified, the fasta files we provide for patches 
represent what the chromosome would look like with the patch integrated, 
so the sequence is reverse complemented where needed and we add N 
padding to represent the full length chromosome.

To identify contigs which will be reverse complemented, you can check 
the file provided by the NCBI for contigs where the origin is '-'
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/GCA_000001405.15_GRCh38_assembly_structure/all_alt_scaffold_placement.txt

Otherwise, the other difference I can think of is with the PAR regions.
Currently, we remove duplicated regions from the Y chromosome.
As a result, the fasta sequence for the Y chromosome only contains the 
unique regions of sequence, the rest is padded with Ns.

We are however working on providing different versions of the same 
files, as it is clear that we cannot provide a format that fits all 
analysis tools.
One of the main areas we want to improve is how the alternative 
sequences are represented.
It made sense to us at the time to provide them as complete chromosomes, 
but it is clearly not compatible with mapping reads.
Your input is really valuable to us, as it gives us a concrete example 
of a format that would work for you.

HTH.

Regards,
Magali

On 23/12/2015 21:40, Dave Larson wrote:
> I have some questions about differences I've observed between an NCBI 
> provided GRCh38 reference and the one available from Ensembl (Release 
> 76; Homo_sapiens.GRCh38.dna.toplevel.fa). I initially believed they 
> should be identical except for chromosome names and N padding, but 
> they don't appear to be. I've done my best to find answers online 
> already, but I'm coming up with nothing.
>
> NCBI has provided a reference for use in mapping NGS reads here: 
> ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/ 
> and I believe 1000 Genomes has also utilized a very similar reference.
>
> Typically, I would utilize Ensembl to assemble my reference sequence, 
> but I would like to include the alternate haplotypes and the padding 
> of Ns in those sequences seemed likely to cause problems for the 
> creation of the necessary indices. Therefore, I was looking into using 
> the NCBI reference, but wanted to confirm that the coordinate systems 
> and sequence were the same so that I could continue to utilize VEP 
> downstream, even for alternative haplotypes. I found some unexpected 
> differences and I was hoping I could get more information.
>
> 1) Sequences in Ensembl are masked (as Ns) within the primary 
> chromosomes, but are not in the NCBI reference. For example:
>
> diff /tmp/ensembl_chr12.fa /tmp/genbank_chr12.fa  | head -100
> 579551,579554c579551,579554
> < ACGGGATTTCTTCATATAATGTTAGACAGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> < NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> < NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> < NNNNNNNNNNNNNNNNNNNNNATCATTCTCAGAAACTACTTTGTGATGTGTGCGTTCAAC
> ---
>> ACGGGATTTCTTCATATAATGTTAGACAGAAGAATTCTCAGAAACTTATTTGTGTTATAT
>> TTATTCAACTAGCAGAATTGAAACTTCCTTTTGACAGAGCAGATTTGATACACTCCTTTT
>> GTGGAATTTCCAGGTGCAGATTTCATTCGCTTTGAGGCCAATGGTAGAAAAGGACATATA
>> TTCGTAGAAAAACAAGAGAGAATCATTCTCAGAAACTACTTTGTGATGTGTGCGTTCAAC
>
>
> Does Ensembl perform additional N masking, even for the toplevel file? 
> Under what criteria?
>
> 2) Some of the alternative haplotypes appear to be reverse 
> complemented from what I see in the NCBI reference.
>
> For example, chr6_KI270758v1_alt aka CHR_HSCHR6_8_CTG1 is reversed in 
> Ensembl from the sequence in NCBI 
> (http://www.ncbi.nlm.nih.gov/nuccore/KI270758). Is there some way to 
> determine which contigs this is performed on through the API? Why/how 
> are these decided to be reversed?
>
> 3) I am also seeing differences on many other chromosomes, but some 
> are expected (e.g. large sections where NCBI has masked out sequence 
> intentionally for the aligners, changes of IUPAC ambiguity bases to N 
> in Ensembl etc). Any other differences that should be expected?
>
> Thanks,
>
> Dave
>
>
>
>
> ____
> This email message is a private communication. The information 
> transmitted, including attachments, is intended only for the person or 
> entity to which it is addressed and may contain confidential, 
> privileged, and/or proprietary material. Any review, duplication, 
> retransmission, distribution, or other use of, or taking of any action 
> in reliance upon, this information by persons or entities other than 
> the intended recipient is unauthorized by the sender and is 
> prohibited. If you have received this message in error, please contact 
> the sender immediately by return email and delete the original message 
> from all computer systems. Thank you.
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: 
> http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/