[ensembl-dev] Sorted chromosomes in genome FASTA + chr prefix in GRCh38 + dbSNP updates

Joel Fillon, Mr joel.fillon at mcgill.ca
Wed Oct 15 15:00:04 BST 2014


Dear Fiona,

OK, that's much clearer now. Thanks a lot for the info and for the quick answer, much appreciated!

Joël

________________________________________
De : dev-bounces at ensembl.org [dev-bounces at ensembl.org] de la part de Fiona Cunningham [fiona at ebi.ac.uk]
Envoyé : mercredi 15 octobre 2014 07:05
À : Ensembl developers list
Objet : Re: [ensembl-dev] Sorted chromosomes in genome FASTA + chr prefix in GRCh38 + dbSNP updates

Dear Joel,

Many thanks for your feedback and comments. I will address the
question about the dbSNP update.

We decided not to import dbSNP141 into the Ensembl variation databases
because of the known errors in dbSNP141 published on the NCBI website:
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/misc/known_issues/b141/
and also as approximately 300,000 rsIDs are deleted (many because they
map multiple times).

Instead, Ensembl has projected the variants from dbSNP138 (the
previous dbSNP release of human variants) onto GRCh38 for the Ensembl
76 and 77 releases.
http://www.ensembl.info/blog/2014/07/30/variation-annotation-for-grch38/

As soon as the data files for dbSNP142 are released, the we will
import these into Ensembl after we evaluate the data.  This process
does indeed take a few months as we have a QC process and variation
annotation of these data using the GENOCDE transcript set.

Best wishes,
Fiona

-------------------------------------------------------------------------------------------------
Fiona Cunningham, Variation Annotation Coordinator
Ensembl project, Ensembl Variation Project Leader.
European Bioinformatics Institute (EMBL-EBI)
Genome Campus, Hinxton,  CB10 1SD, UK
www.ensembl.org || www.lrg-sequence.org || t: +44 1223 494612


On 14 October 2014 19:58, Joel Fillon, Mr <joel.fillon at mcgill.ca> wrote:
> Hi Ensembl admins,
>
> 3 unrelated questions (should maybe posted in different messages):
>
> 1. Would it be possible to sort chromosomes in genome FASTA files by "biological" order
> instead of lexicographic order or other order (sequence length?) e.g.:
> ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
>
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> 8
> 9
> 10
> 11
> 12
> 13
> 14
> 15
> 16
> 17
> 18
> 19
> 20
> 21
> 22
> X
> Y
> M
>
>
> instead of:
> 1
> 10
> 11
> 12
> 13
> 14
> 15
> 16
> 17
> 18
> 19
> 2
> 20
> 21
> 22
> 3
> 4
> 5
> 6
> 7
> 8
> 9
> MT
> X
> Y
>
> or in ftp://ftp.ensemblgenomes.org/pub/plants/release-23/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.23.dna.genome.fa.gz
>
> 1
> 2
> 3
> 4
> 5
> Mt
> Pt
>
> instead of:
> Pt
> Mt
> 4
> 2
> 3
> 5
> 1
>
> since random order can cause problems with tools like GATK.
>
> 2. Would it be possible in Homo sapiens GRCh38 genome to prefix chromosome names with "chr" like NCBI and UCSC versions,
> to match them with Ensembl GTF chromosome IDs?
>
> 3.  Regarding dbSNP updates included in Ensembl releases, I understand from this page http://useast.ensembl.org/Help/Faq?id=432
> that it takes several months to curate dbSNP entries. Do you have any rough idea of when dbSNP build 141 would be available for Homo sapiens GRCh38?
> By the end of this year or not before 2015?
>
> Thanks a lot for your help and for the hard work!
> Joël
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

_______________________________________________
Dev mailing list    Dev at ensembl.org
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/




More information about the Dev mailing list