[ensembl-dev] Transcript Biotype

Kieron Taylor ktaylor at ebi.ac.uk
Tue Apr 23 10:19:27 BST 2019

> On 11 Apr 2019, at 11:36, Olson, Andrew <olson at cshl.edu> wrote:
> Is there a mapping between biotypes and Sequence Ontology terms?
> Andrew

Hi Andrew,

For several years now, the Havana annotation team (the originator of "biotype") have collaborated with Sequence Ontology maintainers to ensure approximate parity between the two nomenclatures. Where possible, there should be exact matches between the biotype and the stringified SO term name (not the SO:accession). Sadly the two vocabularies are not in complete agreement.

Ensembl now maintains a mapping between the two which is accessible in several ways:

1) REST API https://rest.ensembl.org/documentation/info/biotypes_name
2) master_biotype held in ensembl_production DB as found on our public MySQL servers
3) Perl API access via the BioType adaptor: https://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1DBSQL_1_1BiotypeAdaptor.html

I have attached the results from release 96 below to illustrate. The functionality was only added in the last few releases, so it is not available in the majority of archives. The mappings are fairly stable, so you can perhaps use contemporary biotype/SO-term mappings for inference on older data. In release 97 the SO term name will be present in the REST API responses along with the accession.

Hopefully that is useful to you and other users.


Kieron Taylor

Ensembl Developer

SELECT name, so_acc FROM master_biotype;

IG_C_gene	SO:0001217
IG_C_gene	SO:0000478
IG_D_gene	SO:0001217
IG_D_gene	SO:0000458
IG_J_gene	SO:0001217
IG_J_gene	SO:0000470
IG_J_pseudogene	SO:0000336
IG_J_pseudogene	SO:0000516
IG_V_gene	SO:0001217
IG_V_gene	SO:0000466
IG_V_pseudogene	SO:0000336
IG_V_pseudogene	SO:0000516
IG_gene	SO:0001217
IG_gene	SO:3000000
IG_pseudogene	SO:0000336
IG_pseudogene	SO:0000516
Mt_rRNA	SO:0001263
Mt_rRNA	SO:0000252
Mt_tRNA	SO:0001263
Mt_tRNA	SO:0000253
Mt_tRNA_pseudogene	SO:0000336
Mt_tRNA_pseudogene	SO:0000516
RNA-Seq_gene	NULL
RNA-Seq_gene	NULL
TEC	SO:0002139
TR_gene	SO:0001217
TR_gene	SO:3000000
TR_pseudogene	SO:0000336
TR_pseudogene	SO:0000516
ambiguous_orf	SO:0001877
cdna_update	NULL
cdna_update	NULL
cdna	SO:0000756
cdna	SO:0000756
disrupted_domain	SO:0000516
est	NULL
est	SO:0000345
lincRNA	SO:0001263
lincRNA	SO:0001877
miRNA	SO:0001263
miRNA	SO:0000276
miRNA_pseudogene	SO:0000336
miRNA_pseudogene	SO:0000516
misc_RNA	SO:0001263
misc_RNA	SO:0000655
misc_RNA_pseudogene	SO:0000336
misc_RNA_pseudogene	SO:0000516
ncRNA	SO:0001263
ncRNA	SO:0000655
non_coding	SO:0001263
non_coding	SO:0001877
nonsense_mediated_decay	SO:0000234
polymorphic	SO:0001217
polymorphic_pseudogene	SO:0001217
polymorphic_pseudogene	SO:0000234
processed_pseudogene	SO:0000336
processed_pseudogene	SO:0000516
processed_transcript	SO:0001263
processed_transcript	SO:0001877
protein_coding	SO:0001217
protein_coding	SO:0000234
pseudogene	SO:0000336
pseudogene	SO:0000516
rRNA	SO:0001263
rRNA	SO:0000252
rRNA_pseudogene	SO:0000336
rRNA_pseudogene	SO:0000516
retained_intron	SO:0001877
retrotransposed	SO:0000569
retrotransposed	SO:0000569
scRNA_pseudogene	SO:0000336
scRNA_pseudogene	SO:0000516
snRNA	SO:0001263
snRNA	SO:0000274
snRNA_pseudogene	SO:0000336
snRNA_pseudogene	SO:0000516
snlRNA	SO:0001263
snlRNA	SO:0000274
snoRNA	SO:0001263
snoRNA	SO:0000275
snoRNA_pseudogene	SO:0000336
snoRNA_pseudogene	SO:0000516
tRNA	SO:0001263
tRNA	SO:0000253
tRNA_pseudogene	SO:0000336
tRNA_pseudogene	SO:0000516
transcribed_processed_pseudogene	SO:0000336
transcribed_processed_pseudogene	SO:0000516
transcribed_unitary_pseudogene	SO:0000336
transcribed_unprocessed_pseudogene	SO:0000336
transcribed_unprocessed_pseudogene	SO:0000516
unitary_pseudogene	SO:0000336
unitary_pseudogene	SO:0000516
unprocessed_pseudogene	SO:0000336
unprocessed_pseudogene	SO:0000516
ccds_gene	NULL
protein_coding_in_progress	NULL
IG_Z_gene	SO:0001217
IG_M_gene	SO:0001217
ncRNA_host	NULL
TR_V_pseudogene	SO:0000336
TR_V_gene	SO:0001217
IG_C_pseudogene	SO:0000336
TR_C_gene	SO:0001217
TR_J_gene	SO:0001217
TR_V_pseudogene	SO:0000516
TR_V_gene	SO:0000466
IG_C_pseudogene	SO:0000516
TR_C_gene	SO:0000478
TR_J_gene	SO:0000470
protein_coding_in_progress	NULL
IG_M_gene	SO:3000000
IG_Z_gene	SO:3000000
3prime_overlapping_ncRNA	SO:0002120
antisense_RNA	SO:0001263
antisense_RNA	SO:0001877
scRNA	SO:0001263
scRNA	SO:0000013
RNase_MRP_RNA	SO:0001263
RNase_MRP_RNA	SO:0000385
RNase_P_RNA	SO:0001263
RNase_P_RNA	SO:0000386
telomerase_RNA	SO:0001263
telomerase_RNA	SO:0000390
sense_intronic	SO:0001877
sense_overlapping	SO:0001877
sense_intronic	SO:0001263
ambiguous_orf	SO:0001263
retained_intron	SO:0001263
3prime_overlapping_ncRNA	NULL
ncRNA_host	NULL
sense_overlapping	SO:0001263
TR_D_gene	SO:0000458
TR_J_pseudogene	SO:0000516
TR_D_gene	SO:0001217
TR_J_pseudogene	SO:0000336
ncbi_pseudogene	SO:0000336
ncbi_pseudogene	SO:0000516
ncbigene	NULL
non_stop_decay	SO:0000234
pre_miRNA	SO:0001244
tmRNA	SO:0001263
tmRNA	SO:0000584
SRP_RNA	SO:0001263
SRP_RNA	SO:0000590
ribozyme	SO:0001877
ncRNA_pseudogene	SO:0000336
ncRNA_pseudogene	SO:0000516
IG_LV_gene	SO:0001217
IG_LV_gene	SO:3000000
translated_processed_pseudogene	SO:0000336
translated_processed_pseudogene	SO:0000516
nontranslating_CDS	SO:0001217
nontranslating_CDS	SO:0000234
translated_unprocessed_pseudogene	SO:0000336
translated_unprocessed_pseudogene	SO:0000516
mRNA	SO:0001217
mRNA	SO:0000234
pre_miRNA	SO:0001263
artifact	NULL
artifact	NULL
lncRNA	SO:0001263
class_I_RNA	SO:0001263
class_I_RNA	SO:0000990
class_II_RNA	SO:0001263
class_II_RNA	SO:0000989
known_ncRNA	NULL
known_ncRNA	SO:0000655
transcribed_unitary_pseudogene	SO:0000516
piRNA	SO:0001263
piRNA	SO:0001035
IG_D_pseudogene	SO:0000336
macro_lncRNA	SO:0001263
vaultRNA	SO:0001263
scaRNA	SO:0001263
scaRNA	SO:0000013
sRNA	SO:0000274
sRNA	SO:0001263
CRISPR	SO:0001263
CRISPR	SO:0001459
antitoxin	SO:0001877
antitoxin	SO:0001263
ribozyme	SO:0001263
vaultRNA	SO:0002040
macro_lncRNA	SO:0001877
IG_D_pseudogene	SO:0000516
guide_RNA	SO:0001263
guide_RNA	SO:0000602
Y_RNA	SO:0001263
Y_RNA	SO:0000405
transposable_element	SO:0000101
transposable_element	SO:0000111
bidirectional_promoter_lncRNA	NULL
bidirectional_promoter_lncRNA	SO:0002185
unknown_likely_coding	NULL
unknown_likely_coding	NULL
other	NULL
lncRNA	SO:0001877
aligned_transcript	NULL
aligned_transcript	NULL
antisense	SO:0001263
antisense	SO:0001877
vault_RNA	SO:0001263
vault_RNA	SO:0001877
rnaseq_putative_cds	NULL
rnaseq_putative_cds	NULL
transcribed_pseudogene	NULL
transcribed_pseudogene	NULL

More information about the Dev mailing list