[ensembl-dev] How transcriptome fasta files are created 2.0

Julien Wollbrett julien.wollbrett at unil.ch
Wed Oct 3 15:07:19 BST 2018


Hello,

As I do not receive answers of my previous email I will try to describe a bit more my goals and questions.

I generated my own transcriptome fasta file using gtf from ensembl, genome fasta file from ensembl (same release) and a tool like gtf_to_fasta (TopHat) or gffread (Cufflinks). Once I did that I compared the generated file with the cdna transcriptome fasta file available at ensembl ftp. Unfortunately I found some differences between my transcriptome fasta file and the one provided by ensembl. That is why I tried to determine the origin of these differences.
All my tests have been run on different species (human, D. melanogaster, ...) and different releases (84 and 93)

I used the approach described below to define differences :
- take all transcript_ids from transcriptome fasta file of ensembl
- take all transcript_ids from gtf file of ensembl
- detect number of transcript in common in both files and transcript specific to each file
- map transcript_id to the gtf annotation and detect gene_biotype associated to transcripts of each file.

results for human release 84:

gtf file : http://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
fasta file : http://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

transcripts common in both files : 161150
transcripts present only in gtf : 38034
transcripts present only in fasta file : 15091
number of different gene biotypes for transcripts present in gtf: 44
number of different gene biotypes for transcripts in fasta file : 23
list of biotypes present only in gtf and their count :

gene_biotype  freq
3prime_overlapping_ncrna    32
antisense 10183
bidirectional_promoter_lncrna     5
lincRNA 12648
macro_lncRNA     1
miRNA  4198
misc_RNA  2306
Mt_rRNA     2
Mt_tRNA    22
non_coding     3
processed_transcript  2760
ribozyme     8
rRNA   549
scaRNA    49
sense_intronic   978
sense_overlapping   334
snoRNA   961
snRNA  1905
sRNA    20
TEC  1069
vaultRNA     1

All the 38034 transcripts present only in gtf have a gene_biotype not present anymore in ensembl transcriptome.

results for D. melanogaster release 93:

gtf file : http://ftp.ensembl.org/pub/release-93/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.93.gtf.gz
fasta file : http://ftp.ensembl.org/pub/release-93/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz

transcripts in both files : 30819
transcripts present only in gtf : 3948
transcripts present only in fasta : 9
number of different gene biotypes for transcripts present in gtf: 8
number of different gene biotypes for transcripts present in fasta file : 2
list of biotypes present only in gtf and their count :

gene_biotype    freq
ncRNA    2941
pre_miRNA    259
rRNA    115
snoRNA    289
snRNA    32
tRNA    312

All the 3948 transcripts present only in gtf have a gene_biotype not present anymore in ensembl transcriptome.


Could someone please explain to me :

    1. Why all the transcripts with these gene biotypes are removed during the creation of the transcriptome ?
    2. Where do the transcripts present in the transcriptome fasta file but not in the gtf file (15091 in human, 9 in D. melanogaster) come from ?
    3. How does the cdna transcriptome fasta file is generated ?
    4. Should I generate my own transcriptome fasta file or take the ensembl cdna fasta file ?

Sorry for such a long email.... and thank you for your answers.

Best Regards,


Julien Wollbrett




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20181003/048bcf7c/attachment.html>


More information about the Dev mailing list