[ensembl-dev] Intentions for Ensembl release 62

Daniel Sobral sobral at ebi.ac.uk
Thu Feb 24 13:34:15 GMT 2011

Please see below a list of intentions declared for Ensembl 62 (scheduled 
for mid April).
Note these are intentions and are not guaranteed to be in the release.

Daniel Sobral

Declarations of Intentions - Ensembl 62


Families (all species)
Updated MCL families including all Ensembl transcript isoforms and 
newest Uniprot Metazoa.
* Clustering by MCL
* Multiple Sequence Alignments with MAFFT
* Family stable ID mapping

Gene Homologies (all species)
GeneTrees (protein-coding) with new/updated genebuilds and assemblies
* Clustering using hcluster_sg
* Multiple sequence alignments using MCoffee
* Phylogenetic reconstruction using TreeBeST
* Homology inference including the recent 'possible_ortholog', 'putative 
gene split' and 'contiguous gene split' exceptions * Pairwise gene-based 
dN/dS scores for high coverage species pairs only
* GeneTree stable ID mapping

GeneTrees (ncRNA) with new/updated genebuilds and assemblies (all species)
* Classification based on RFAM model
* Multiple sequence alignments with infernal
* Phylogenetic reconstruction using RaxML
* Additional multiple sequence alignments with Prank (w/ genomic flanks)
* Additional phylogenetic reconstruction using PhyML and NJ
* Phylogenetic tree merging using TreeBeST
* Homology inference

Pairwise Alignments (all species)
* Non-reference alignments for human vs high coverage blastz-net
* human vs gibbon lastz.
* human vs marmoset lastz
* human vs rabbit lastz
* xenopus vs mouse tblat-net
* xenopus vs chicken tblat-net
* xenopus vs tetraodon tblat-net
* xenopus vs human tblat-net
* xenopus vs danio tblat-net

Multiple alignments (all species)
* update 6way-primate-epo alignments to incorporate new marmoset 
seq_region names
* update 12way-mammal-epo alignments to incorporate new marmoset 
seq_region names
* update 19way-amniota-pecan alignments to incorporate new marmoset 
seq_region names
* 35way-mammal low-coverage-epo alignments (addition of gibbon and new 
marmoset seq_region names)

schema changes (all species)
* meta.meta_value has been extended to TEXT (previously it was VARCHAR) 
and the corresponding indexes have been fixed.
* analysis.module has been extended to VARCHAR(255) - previously it was 
* mapping_session.prefix column has been added to allow EnsEmblGenomes 
to track their different types of stable_ids


Bio::EnsEMBL::DBFile::FileAdaptor (all species)
A new base class for accessing data from flat files

Bio::EnsEMBL::DBFile::CollectionAdaptor (all species)
A new class to access Collection Feature data stored in flat files.

patch_61_62_a: Schema version patch (all species)
Patch file patch_61_62_a.sql, updates the schema version of a core 
database to 62.

patch_61_62_b: synonym field extension (all species)
Patch file patch_61_62_b.sql, extends field synonym in external_synonym 
table to 100 chars.

patch_61_62_c: index for db_name (all species)
Patch file patch_61_62_c.sql adds unique index to db_name field in 
external_db table.

Ontology database (all species)
Database ensembl_ontology_62 with latest available GO, SO, and EFO 
ontologies. Synonyms will now be included in a new 'synonym' table.

Schema diagrams for online documentation (all species)
Schema diagrams for online for core database documentation.

Xrefs (Zebrafish)
Update external database references.

xref projection (all species)
Project GO ids and gene names to species. Make alterations to zebrafish 

EMBL/Genbank dumps (all species)
EMBL & Genbank dumps for all species

patch_61_62_d: remove field display_label_linkable (all species)
Patch file patch_61_62_d.sql removes field display_label_linkable from 
table external_db.

Import of LRG sequences (Human)
Newly published LRG sequences will be imported

Ontology API (all species)
Addition of fetch_all_by_name() method to the OntologyTermAdaptor to 
fetch ontology terms by their names or synonyms. Additional synonym() 
method for OntologyTerm objects to get their synonyms.

xrefs (Human)
Update human external database references.

xrefs (Mouse)
Update external database references


patch_61_62a Update meta schema version (all species)
meta.schema_version will be updated to 62

patch_61_62_b motif_feature.stable_id (all species)
A stable_id will be added to the motif_feature table. NOTE: This is not 
an 'Ensembl stable ID', and will only be used internally to enable 
inter-DB linking between the variation and funcgen schemas.

patch_61_62_c feature_type Sequence Ontology fields (all species)
so_name and so_accession will be added to the feature_type table to 
enable display of Sequence Ontology information and linking to the 
ensembl_ontology DB

Patch_61_62_d: Experimental Group Description (all species)
This change serves to support a better annotation of data sources.

ResultFeature DBFile Collections (Human, Mouse)
Where possible data from the result_feature table has been moved outside 
of the database to indexed binary '.col' files. The ResultFeatureAdaptor 
now uses the new core DBFile::CollectionAdaptor and DBFile::FileAdaptor 
to access these data directly.

Array Mapping (all species)
Genomic and transcript alignments and transcript xref annotation has 
been re-run for all species with new genome assemblies or genebuilds.

Ilumina Methylation Arrays (Human)
HumanMethylation27K and HumanMethylation450K have now been imported.

Update of Human functional genomics data (Human)
New datasets from ENCODE and the Epigenomics Roadmap, covering existing 
cell lines. The Regulatory Build was rerun for cell lines with new data.

Binding Matrix: simpler representation of matrix frequencies (all species)
This change intends to make the representation simpler, towards 
something that can applied to different formats.

patch_61_62_e Addition of dbfile_regsitry table (all species)
A dbfile_registry table has been added to store the filepaths of result 
feature collection (.col) files

PolIII Transcription Associated Regulatory Features (all species)
The Regulatory Build now also annotates Regulatory Features associated 
to PolIII Transcription.


Patch for panda (Panda)
Transcript supporting features added for pseudogenes

Patch for rabbit (Rabbit)
Geneset re-clustered Transcript supporting features added for 
pseudogenes Assembly updated to match the official ncbi one

Patch for mouse (Mouse)
Patched the mouse Ensembl-Havana merged gene set to maintain its 
consistency with the latest CCDS gene set (as of 9 February 2011).

Human Vega annotation (Human)
Manual annotation of human from Havana has been updated. This represents 
the annotation presented in Vega release 42

Patch for marmoset (Marmoset)
Deprecated contig sequences removed
Raw-computes re-run
Geneset re-clustered
Mapping added
Transcript supporting features for pseudogenes added
New seq region synonyms

Human otherfeatures (Human)
Removed EST alignments with hcoverage <90 and perc_ident <94.

GENCODE gene set update (release 7) (Human)
Update to the Ensembl/Havana GENCODE gene set based on a complete 
re-annotation of the Ensembl gene set and combined with the latest Vega 
gene set

Human cDNA update (Human)
New cDNA db for human.

GRCh37.p3 (Human)
Adding the third patch release for the human assembly. This alters the 
assembly information in all human databases.

GRCh37.p3 annotation (Human)
Annotation of the patches in the other features db.

Gibbon build (Gibbon)
First release of gene build for Gibbon, Nomascus leucogenys (Northern 
white-cheeked gibbon). Assembly: Nleu1.0.

Zebrafish WGS/clone assembly track (Zebrafish)
Added a WGS/clone assembly track.

Flagging obsolete Uniprot proteins (all species)
Flagging Transcript attribute where the Uniprot evidence was removed

Flagging obsolete Ensembl proteins (Sloth, Armadillo, Kangaroo rat, 
Tenrec, Hedgehog, Cat, Wallaby, Mouse Lemur, Pika, Bushbaby, Chimp, 
Orangutan, Rock Hyrax, Megabat, Shrew, Ground Squirrel, Tarsier, Tree 
Shrew, Dolphin, Alpaca)
Flagging Transcript attribute where the evidence was removed from 2x genomes

Mouse RefSeq import (Mouse)
RefSeq annotations imported into the mouse otherfeatures database

Xenopus tropicalis new assembly 4.2 (Xenopus)
New assembly of Xenopus tropicalis version 4.2

Human Body Map missing liver (Human)
Add the liver models

Mouse cDNA update (Mouse)
New cDNA db for mouse.

Updated human otherfeatures db: new CCDS import (Human)
Update to CCDS set for human

Updated mouse otherfeatures db: New CCDS import (Mouse)
Update to CCDS set for mouse


Mart databases (all species)
Full build of all 7 marts for all species


New variation consequences (all species)
New variation consequences due to a schema change linking consequences 
to allele and transcript rather than just to a variation and transcript

HGVS coordinates stored in database (all species)
HGVS coordinates for variant alleles will be pre-calculated and stored 
in the database. These were previously calculated on the fly.

New variation database (Human)
The human variation database will be built fresh from dbSNP release 132 
due to data updates by dbSNP.

Data import/update from external sources (Human)
Allele frequencies from 1000 Genomes Project. Variation submissions on 
LRGs from UniProt. Structural variation data from DGVa. Somatic mutation 
data from Cosmic. Variation phenotype data from OMIM, NHGRI, UniProt and 
EGA. Variation synonyms from UniProt.

New variation database (Mouse)
Fresh build from dbSNP 132.

Data import/update from external sources (Dog, Mouse, Pig)
Structural variation data from DGVa.

patch_61_62_a: Meta schema version (all species)
Meta schema version update

patch_61_62_b: Alter failed_variation (all species)
Drop the subsnp_id column from failed_variation

patch_61_62_c: Introduce failed_allele table (all species)
Add a table to store failed alleles

patch_61_62_d: Add type column to source table (all species)
Introduce a type column (enum) to indicate the type of a source

patch: Table to store study data (all species)
A new table to store description of studies will be introduced and 
foreign keys to this table will be introduced in variation_annotation 
and structural_variation tables.

patch: Rationalize data type for allele columns (all species)
The data type of allele columns in e.g. allele, variation and 
variation_feature will be harmonized to use varchar.

patch: Table to store supporting structural variations (all species)
A new table to store supporting structural variations will be introduced

patch: Table to store variation consequences on regulatory regions (all 
A table to support storing variation consequences on regulatory regions 
will be introduced

patch: Re-design of the transcript_variation table (all species)
Variation consequences will be stored by allele instead of by variation. 
The transcript_variation table will be modified to accommodate this. In 
addition, HGVS coordinates will be stored as well.

patch: Drop somatic column from source table (all species)
The somatic column will be dropped from source and instead introduced in 
the variation table.

API changes (all species)
The API will be updated to accommodate schema patches.

SIFT and PolyPhen consequences (all species)
Non-synonymous coding consequences evaluated by SIFT and PolyPhen will 
be calculated

Add a variation set for variations flagged as failed (Cat, Opossum, Pig, 
Zebra Finch, Tetraodon)
Variations that have been flagged as failed will be grouped in a 
variation set named 'Failed variations'


Support for BigWig format (all species)
In addition to BAM format, the Ensembl website now supports attachment 
of BigWig data via URL. Click on "Manage Your Data" then select "Attach 
Remote File" from the lefthand menu.

Export data on structural variation (all species)
Enabling data to be exported for the variation page. (same 
functionalities as on location, gene and transcript)

Export on Karyotype (all species)
Will try to get the karyotype exported to PDF and other formats. Export 
button just below the karyotype image

BED Format export (all species)
Adding BED format to the export functionality on location, genes, 
transcript and variation.

Highlighting row in feature table for variation (all species)
When clicking on a SNP on the karyotype for phenotype, the corresponding 
row (variation) is highlighted in the feature table


Rebuild otherfeatures database for Yeast. (Yeast)
Rebuild otherfeatures database.

Rerun Xrefs pipeline for Yeast (Yeast)
Update the external_db table, and rerun the xrefs pipeline

new variation saccharomyces_cerevisiae database (Yeast)
Provide the variation saccharomyces_cerevisiae database

New funcgen saccharomyces_cerevisiae database (Yeast)
Provide the funcgen saccharomyces_cerevisiae database

BLAT patch (C.elegans)
for aesthetic reasons, we will flip the strand of paired 3'-ESTs

More information about the Dev mailing list