[ensembl-dev] Sorted chromosomes in genome FASTA + chr prefix in GRCh38 + dbSNP updates

Mon Oct 20 21:44:51 BST 2014

Dear Andy,

Thanks for your answer. No pb for the wait.

1) OK I checked again and it seems the latest GATK version deals with non-specific order for genome sequences (probably by reordering
them in karyotypic order before further analysis).

2) OK, we actually deal internally with Ensembl/UCSC mismatch. I guess the ideal way would be to convince UCSC and NCBI to adopt YOUR standard :D
(which I prefer personally and is indeed consistent with other species).
Yes, it"s OK to contact us to discuss usage at a later date.

Thanks again for your answer.
Joël

________________________________________
De : dev-bounces at ensembl.org [dev-bounces at ensembl.org] de la part de Andy Yates [ayates at ebi.ac.uk]
Envoyé : lundi 20 octobre 2014 06:19
À : Ensembl developers list
Objet : Re: [ensembl-dev] Sorted chromosomes in genome FASTA + chr prefix       in GRCh38 + dbSNP updates

Dear Joël,

Sorry for the long wait for a reply to your other questions.

1). We are currently looking into applying the more biological ordering to our DNA files. Could I ask what particular problems is GATK having with the files in their current state?

2). During our 76 release we spent quite a large discussing the adoption of chr prefixes for Ensembl. A summary of our discussion was:

- GRCh38 & release 76 as a good time to make a switch if we were to
- This would have been applied to just human and not retrospectively to other species. It would have been an inconsistent change.
- Other large scale groups have bought into the Ensembl naming convention as much as groups have bought into the UCSC naming. Changing those names risks causing large scale unknown implications for downstream users
- A number of users already deal with the Ensembl/UCSC mismatch by the addition/removal of "chr" from the name. We felt there was a danger of names like chrchr1 being created and causing bugs in users pipelines. In a great number of cases this technique is wrong

The decisions we made were:

- Not to change because it would be inconsistent & affect downstream Ensembl users
- Provide an easy way to map between the namespaces
        1). Loading UCSC names where available as synonyms (we do this for human but other species are missing)
        2). Using a flat tab delim file
        3). Chain files so remaps can be doing using liftover or CrossMap
- Provide documentation about how to use Ensembl data in common pipelines

We are producing the mappings for human now and can make this available to you to help converting to & from UCSC names. Point 3 is in development and as someone using our data I hope it's ok to contact you to discuss your usage at a later date.

Thanks,

Andy

------------
Andrew Yates - Ensembl Support Coordinator
European Molecular Biology Laboratory
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridge
CB10 1SD, United Kingdom
Tel: +44-(0)1223-492538
Fax: +44-(0)1223-494468
Skype: andrewyatz
http://www.ensembl.org/

On 15 Oct 2014, at 12:05, Fiona Cunningham <fiona at ebi.ac.uk> wrote:

> Dear Joel,
>
> Many thanks for your feedback and comments. I will address the
> question about the dbSNP update.
>
> We decided not to import dbSNP141 into the Ensembl variation databases
> because of the known errors in dbSNP141 published on the NCBI website:
> ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/misc/known_issues/b141/
> and also as approximately 300,000 rsIDs are deleted (many because they
> map multiple times).
>
> Instead, Ensembl has projected the variants from dbSNP138 (the
> previous dbSNP release of human variants) onto GRCh38 for the Ensembl
> 76 and 77 releases.
> http://www.ensembl.info/blog/2014/07/30/variation-annotation-for-grch38/
>
> As soon as the data files for dbSNP142 are released, the we will
> import these into Ensembl after we evaluate the data.  This process
> does indeed take a few months as we have a QC process and variation
> annotation of these data using the GENOCDE transcript set.
>
> Best wishes,
> Fiona
>
> -------------------------------------------------------------------------------------------------
> Fiona Cunningham, Variation Annotation Coordinator
> Ensembl project, Ensembl Variation Project Leader.
> European Bioinformatics Institute (EMBL-EBI)
> Genome Campus, Hinxton,  CB10 1SD, UK
> www.ensembl.org || www.lrg-sequence.org || t: +44 1223 494612
>
>
> On 14 October 2014 19:58, Joel Fillon, Mr <joel.fillon at mcgill.ca> wrote:
>> Hi Ensembl admins,
>>
>> 3 unrelated questions (should maybe posted in different messages):
>>
>> 1. Would it be possible to sort chromosomes in genome FASTA files by "biological" order
>> instead of lexicographic order or other order (sequence length?) e.g.:
>> ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
>>
>> 1
>> 2
>> 3
>> 4
>> 5
>> 6
>> 7
>> 8
>> 9
>> 10
>> 11
>> 12
>> 13
>> 14
>> 15
>> 16
>> 17
>> 18
>> 19
>> 20
>> 21
>> 22
>> X
>> Y
>> M
>>
>>
>> instead of:
>> 1
>> 10
>> 11
>> 12
>> 13
>> 14
>> 15
>> 16
>> 17
>> 18
>> 19
>> 2
>> 20
>> 21
>> 22
>> 3
>> 4
>> 5
>> 6
>> 7
>> 8
>> 9
>> MT
>> X
>> Y
>>
>> or in ftp://ftp.ensemblgenomes.org/pub/plants/release-23/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.23.dna.genome.fa.gz
>>
>> 1
>> 2
>> 3
>> 4
>> 5
>> Mt
>> Pt
>>
>> instead of:
>> Pt
>> Mt
>> 4
>> 2
>> 3
>> 5
>> 1
>>
>> since random order can cause problems with tools like GATK.
>>
>> 2. Would it be possible in Homo sapiens GRCh38 genome to prefix chromosome names with "chr" like NCBI and UCSC versions,
>> to match them with Ensembl GTF chromosome IDs?
>>
>> 3.  Regarding dbSNP updates included in Ensembl releases, I understand from this page http://useast.ensembl.org/Help/Faq?id=432
>> that it takes several months to curate dbSNP entries. Do you have any rough idea of when dbSNP build 141 would be available for Homo sapiens GRCh38?
>> By the end of this year or not before 2015?
>>
>> Thanks a lot for your help and for the hard work!
>> Joël
>>
>> _____________________________________________________
>> Joël Fillon
>> McGill University and Génome Québec Innovation Centre
>> 740, Dr. Penfield Avenue, Room 4200
>> Montréal (QC) H3A 0G1
>> CANADA
>>
>> Phone: 514-398-3311 ext. 00721
>> E-mail: joel.fillon at mcgill.ca
>>
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
>> Ensembl Blog: http://www.ensembl.info/
>
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

_______________________________________________
Dev mailing list    Dev at ensembl.org
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/