[ensembl-dev] Intentions for Ensembl release 61

Stephen Trevanion st3 at sanger.ac.uk
Thu Nov 25 17:12:31 GMT 2010


Please see below a summary for the intentions declared for Ensembl 61 
(scheduled for 19th January). Note these are intentions and are not 
guaranteed to be in the release

Regards,

Steve

------------------------------------------------

Compara:

Families-
Updated MCL families including all Ensembl transcript isoforms and 
newest Uniprot Metazoa.
   * Clustering by MCL
   * Multiple Sequence Alignments with MAFFT
   * Family stable ID mapping

Gene Homologies-
GeneTrees with new/updated genebuilds and assemblies
   * Updated build of ncRNA trees
   * Clustering using hcluster_sg      * Multiple Sequence Alignments 
using consistency-based MCoffee meta-aligner
   * Homology inference including the recent 'possible_ortholog' type 
and 'putative gene split' and
     'contiguous gene split' exceptions
   * Pairwise gene-based dN/dS calculations for high coverage species 
pairs only
   * GeneTree stable ID mapping

Pairwise Alignments-
Human - Lizard tBlat - net
Human - Turkey tBlat net
Turkey - Chicken Lastz
Lizard - Chicken Lastz
Dog - Horse Lastz
**Removing chicken - zebrafinch tBlat
Chicken - Turkey -Zebrafinch EPO multiple alignment


Core:

seq region synonyms-
New table seq_region_synonym added to allow multiple names for sequence 
regions.
Species: all species

external database references-
Human, mouse, rat and tree shrew will be updated.

GO term and gene name projections-
Gene display names and GO terms will be projected from high-coverage 
species to those with lower coverage.

Ontology database-
The Ensembl Ontology database will as usual be populated with the latest 
available versions of the
   * Gene Ontology (GO)
   * Sequence Ontology (SO)

embl and genbank dumps-
Onlt the reference sequence will be dumped in the main directory for 
embl and genbank. Unique non-reference regions(haplotype/par regions) 
will now be dumped in a subdirectory and only contain the unique regions.


Funcgen:

Array Mapping-
Array mapping was updated on all species which have had an update to 
their genome assemblies or gene builds. The probe/set to transcript 
xrefs were recalculated across all species.

Mouse Regulatory Build-
The mouse RegulatoryBuild was re-run to re-introduce some data which had 
been omitted in the previous build.


Genebuild:

Human cDNA update-
Updated set of cDNA alignments to the human genome.

Haplotype correction-
Correction of an error that added one extra N to the end of the 
alternative versions of the chromosomes for five of the haplotypes. The 
altered alternative chromosomes are: HSCHR6_MHC_MANN, HSCHR6_MHC_MCF, 
HSCHR6_MHC_SSTO, HSCHR4_1 and HSCHR17_1.
Species: Human

Zebrafish Havana merge-
A merge of the zebrafish core gene set with Havana manual annotation. 
The core gene set has been altered to include missing genes that were 
lost in e60 due to a problem in gene clustering.

GENCODE gene set update (release 6)-
Update to the Ensembl/Havana GENCODE gene set using the latest Vega gene 
set

Updates to mouse and human Vega annotation-
The Vega annotation for both human and mouse has been updated. This 
matches the annotation presented in Vega release 41.

new rnaseq database-
I will provide a new databases which consists of the core tables ; the 
data will data from the human bodymap project ( rnasesq data ). This is 
a new database which has not been released before. This  was originally 
planned for e60.

mouse cDNA update-
mouse cDNA update

Zebrafish Vega annotation-
Manual annotation of zebrafish from Havana is now present in Ensembl. 
This represetns the annotation presented in Vega release 40

Mouse gene set update-
A merge of Ensembl core gene set and Vega manual annotation.
The core gene set has been improved by incorporating new data resources 
which had become available since the last NCBIM37 genebuild (April 
2007), resulting in the correction of existing gene models and the 
recovery of new mouse genes with human orthologues.
A new otherfeatures database is also available.

New assembly for lizard-
A new assembly for lizard

Turkey-
The first genebuild for turkey

New Canonical Transcript definition-
For previous releases, the canonical transcript of a gene has been set 
to the transcript with the longest translation (for coding genes) or to 
the transcript with the longest mRNA (for noncoding genes). From release 
61, the canonical transcript for human and mouse will now be set to the 
longest CCDS transcript. Where no CCDS transcript exists for the gene, 
the longest Ensembl-HAVANA merge transcript will be used.
Species: Human, Mouse

Removal of ambiguous bases from human DNA sequence-
Ambiguous bases have been replaced with 'N' for the following two human 
contigs:
   * contig::AF152363.1:1:185763:1. This contig held 28 ambiguous bases: 
S(4), W(6), M(5), K(4), R(5), Y(4).
   * contig::AF152364.1:1:170452:1. This contig held 4 ambiguous bases: 
S(1), W(1), Y(1), K(1).

Updated CCDS-
Updated CCDS databases for Human and Mouse. Populates other_features 
with new gene models and serves data for CCDS Public Note DAS track.


Mart:

Ensembl Marts for release 61-
Full build of all 7 marts for all species.


Variation:

- import dbSNP 132 (human)
- import dbSNP for further species if available in time (mouse, rat, 
zebrafish, cat, opossum)
- import new release of HGMD database
- corrections to Affymetrix CNV probe data
- import PorcineSNP60 BeadChip
- update of zebrafish variation consequences for new gene build
- variations will now be flagged and retained instead of failed and 
deleted for species with a new import of dbSNP
- produce GVF file dumps of all variants and their consequence  by species


Wormbase:

C.elegans WS220-
A new version of the C.elegans database based on the official frozen 
WS220 WormBase release.




More information about the Dev mailing list