[ensembl-announce] Ensembl Release 60 - summary of declarations of intentions

wm2 at ebi.ac.uk wm2 at ebi.ac.uk
Mon Sep 20 16:17:45 BST 2010


Below is the summary of declarations of intentions for Ensembl
release 60. Please note these are intentions and are not guaranteed
to be in the release, which is currently scheduled for the 26th of October.

Regards,

William McLaren


====================================================
Summary of declarations of intentions for Ensembl 60
====================================================

### Compara

# Families
- Updated MCL families including all Ensembl transcript isoforms and newest
  Uniprot Metazoa
- Clustering by MCL
- Multiple Sequence Alignments with MAFFT
- Family stable ID mapping

# Gene Homologies
- GeneTrees with new/updated genebuilds and assemblies
- Updated build of ncRNA trees
- Clustering using hcluster_sg
- Multiple Sequence Alignments using consistency-based MCoffee meta-aligner
  (mafftgins+muscle+kalign+probcons) and exon-skipping aware "skipper"
algorithm
- Homology inference including the recent 'possible_ortholog' type and
'putative
  gene split' and 'contiguous gene split' exceptions
- Pairwise gene-based dN/dS calculations for high coverage species pairs only
- GeneTree stable ID mapping

# Pairwise Alignments

-- Lastz-net alignments
- H.sap-A.mel
- H.sap-O.cun
- C.fam-A.mel

-- Blat-alignments
- H.sap-D.rer
- M.mus-D.rer
- R.nor-D.rer
- G.gal-D.rer
- T.rub-D.rer
- D.rer-X.tro
- C.int-D.rer
- C.sav-D.rer
- G.acu-D.rer
- O.lat-D.rer
- D.rer-T.nig

-- Non-reference alignments for human vs high coverage blastz-net alignments
- H.sap-P.tro
- H.sap-G.gor
- H.sap-P.pyg
- H.sap-M.mul
- H.sap-M.mus
- H.sap-R.nor
- H.sap-C.fam
- H.sap-B.tau
- H.sap-S.scr
- H.sap-E.cab
- H.sap-O.ana
- H.sap-M.dom
- H.sap-G.gal

# Multiple alignments
- 34 way epo low coverage
- 14 way epo eutherian mammals
- 5 way epo fish

# Synteny
- H.sap-C.jac
- H.sap-O.cun



### Core

# Ontology database
- A new ontology database ("ensembl_ontology_60") will be built using the
latest
  data from GO and SO.

# Gene name and GO term projections
- Gene names and GO xrefs will be projected from species where there is high
  coverage to species where there is lower coverage. Panda will be
included as a
  target for these projections.

# external database references
- Update external database references for human, mouse and Xenopus

# GO Xrefs are now Ontology Xrefs
- The go_xref table is renamed to ontology_xref. The Bio::EnsEMBL::GoXref
Perl
  module is renamed to Bio::EnsEMBL::OntologyXref.



### Funcgen

# Array Mapping
- The array mapping pipeline will be run for those species which have new
  assemblies, gene build or new array designs. This includes an update to the
  latest version of the Phalanx OneArray for human.

# BindingMatrix
- A new BindingMatrix class will represent position weight matrices (PWMs)
  loaded from Jaspar or inferred directly from Chip-Seq data. This will
  ultimately be able to identify the consequence of a sequence change at a
given
  location, with respect to the PWM score. patch_59_60_c.sql contains the
  relevant changes to update the schema to support this data.

# MotifFeature
- A new MotifFeature class has been added to represent the genomic mapping
of a
  position weight matrix (BindingMatrix). patch_59_60_c.sql contains the
  relevant schema updates.

# Schema patch: Schema version
- patch_59_60_a.sql updates the meta table, changing the schema_version
  meta_value to 60.

# Schema patch: associated_feature_type
- patch_59_60_b.sql updates the associated_feature_type table to support
  feature_type to feature_type associations. The relevant adaptors have also
  been updated to reflect the new table fields and values.

# RegulatoryBuild update
- The human RegulatoryBuild has been updated and re-annotated based on the
new
  ChIP-Seq data sets.

# Position Weight Matrix (PWM) mapping and visualisation
- PWM mappings which used to be associated with the RegulatoryFeatures,
are now
  associated with the AnnotatedFeatures representing the specific =
  Transcription Factor Binding Site predictions. This utilises the new
  MotifFeature and BindingMatrix classes. These new data are available as new
  tracks in the Regulation panel as well as Region in Detail.

# New chip-seq datasets from ENCODE
- 93 new ENCODE Chip-Seq datasets for existing cell lines will be added.

# probe_feature.cigar_line patch
- patch_59_60_d.sql The probe_feature table has been patched to change the
  cigar_line field to a varchar from a free text field.
Species: Anole lizard, Cow, C.elegans, Marmoset, Dog, Guinea Pig, Sloth,
C.intestinalis, C.savignyi, Zebrafish, Armadillo, Kangaroo rat, Fly, Tenrec,
Horse, Hedgehog, Cat, Chicken, Stickleback, Gorilla, Human, Elephant,
Macaque,
Wallaby, Mouse Lemur, Opossum, Mouse, Microbat, Pika, Platypus, Rabbit,
Medaka,
Bushbaby, Chimp, Orangutan, Rock Hyrax, Megabat, Rat, Yeast, Shrew, Ground
Squirrel, Pig, Zebra Finch, Fugu, Tarsier, Tetraodon, Tree Shrew, Dolphin,
Alpaca, Xenopus, Panda



### Genebuild

# Update to human vega annotation
- An update to Vega human annotation

# Gencode gene set update
- Update to the Ensembl/Havana Gencode gene set using the latest Vega gene
set.

# Human cDNA update
- Updated set of cDNA alignments to the human genome.

# Rabbit chromosomes
- Chromosome mapping added for the rabbit genome Coordinates updated
accordingly

# Human (GRCh37) assembly patch release 2
- Addition of the GRCh37 patch release 2 patches. These are toplevel,
  non-reference regions of the assembly.

# Updated human otherfeatures db: EST alignments
- Human ESTs were realigned. New EST-based genes were produced from these EST
  alignments.

# Panda genebuild
- The first genebuild for the panda genome

# Update human otherfeatures db: new CCDS import
- Update to CCDS set for human

# Updated mouse otherfeatures db: New CCDS import
- Update to CCDS set for mouse

# cDNA based gene annotation of human assembly patches
- Annotate the human assembly patches using Exonerate's cDNA2genome model,
which
  aligns cDNAs to the genome using annotation identifying the coding
regions of
  the cDNAs.

# Zebrafish genebuild
- Full genebuild on the new Zv9 assembly

# Mouse cDNA update
- Updated set of cDNA alignments to the mouse genome

# Flagging Translation attribute where the evidence was removed
- Add a flag to the translation where a human Ensembl translation used as
  evidence was removed from the current human database.
Species: Sloth, Armadillo, Kangaroo rat, Tenrec, Hedgehog, Cat, Wallaby,
Mouse
Lemur, Microbat, Pika, Bushbaby, Chimp, Rock Hyrax, Megabat, Shrew, Ground
Squirrel, Tarsier, Tree Shrew, Dolphin, Alpaca

# Flagging Translation attribute where the Uniprot evidence was removed
- Add a flag to the translation where a supporting evidence from Uniprot was
  removed from Uniprot database
Species: Anole lizard, Cow, C.elegans, Marmoset, Dog, Guinea Pig, Sloth,
C.intestinalis, C.savignyi, Zebrafish, Armadillo, Kangaroo rat, Fly, Tenrec,
Horse, Hedgehog, Cat, Chicken, Stickleback, Gorilla, Human, Elephant,
Macaque,
Wallaby, Mouse Lemur, Opossum, Mouse, Microbat, Pika, Platypus, Rabbit,
Medaka,
Bushbaby, Chimp, Orangutan, Rock Hyrax, Megabat, Rat, Yeast, Shrew, Ground
Squirrel, Pig, Zebra Finch, Fugu, Tarsier, Tetraodon, Tree Shrew, Dolphin,
Alpaca, Xenopus, Panda

# Updating the ENCODE excluded regions
- Update of the ENCODE excluded regions

# Fix duplicate transcript attributes
- Duplicate transcript attributes removed
Species: Anole lizard, Armadillo, Chicken, Human, Mouse, Platypus, Zebra
Finch

# homo_sapiens rnaseq data
- Rnaseq data from transcriptome sequencing done by illumina on human tissues
  will be provided in a stand-alone database, ie no mart / compara
  relationships.



### Mart

# Ensembl marts for release 60
- Full build of the seven marts: Ensembl Mart, SNP Mart, Functional Genomics
  Mart, Genomic Features Mart, Ontology Mart, Vega Mart, Sequence Mart



### Variation

# Data
- update of UniProt identifier links including phenotype information
- import of new information from NHGRI and EGA Genome Wide Association
Studies
- import of new data sets for structural variants from DGVa
- import of an expanded data set for all short somatic sequence variants from
  COSMIC
- GVF (Genome Variation Format) dumps for all variants
- update of variant consequences for new human gene set
- update of variant consequences for new zebrafish assembly and gene set
- import new set of 150,000 Zebrafish variants

# API and schema change
- schema change for ensembl genomes to store the population size for each
  frequency calculation





More information about the Announce mailing list