[ensembl-dev] How transcriptome fasta files are created 2.0
Julien Wollbrett
julien.wollbrett at unil.ch
Thu Oct 4 11:54:17 BST 2018
Hello,
Thank you Premanand for this extremely usefull answer.
If I well understand, one fasta file is created using the _patch_hapl_scaff.gtf.gz file for mouse and human or the gtf.gz file for other species. This file is then splitted using gene biotypes to create cdna, cds, ncrna, and pep fasta files.
Do you know where I can find a list of all biotypes used to split to each fasta file ?
In my previous email I described that I found 9 transcripts (FBtr0307759, FBtr0084079, FBtr0307760, FBtr0084081, FBtr0084085, FBtr0084083, FBtr0084084, FBtr0084080, FBtr0084082) that are present in the cdna.all.fa.gz file but not in the gtf.gz file for D. melanogaster (release 93). All these transcripts are from the same gene (FBgn0002781).
Do you know where the annotation of these transcripts comes from ?
Thanks,
Julien
Le 03.10.18 à 16:39, Premanand Achuthan a écrit :
Hi Julien
Apologies for the delay. This might possibly answer your previous and current questions.
The split in the gtf file is based on the chromosomal regions. Also please note that the “Homo_sapiens.GRCh38.84.gtf” file includes all the top level chromosomal regions (1..22, X,Y, MT) and also the scaffolds, as you can see below.
cut -f1 Homo_sapiens.GRCh38.84.gtf | sort | uniq
1
10
11
12
13
14
15
16
17
18
19
2
20
21
22
3
4
5
6
7
8
9
GL000008.2
GL000009.2
GL000194.1
GL000195.1
GL000205.2
GL000213.1
GL000216.2
GL000218.1
GL000219.1
GL000220.1
GL000224.1
GL000225.1
KI270442.1
KI270706.1
KI270707.1
KI270708.1
KI270711.1
KI270713.1
KI270714.1
KI270721.1
KI270722.1
KI270723.1
KI270724.1
KI270726.1
KI270727.1
KI270728.1
KI270731.1
KI270733.1
KI270734.1
KI270741.1
KI270743.1
KI270744.1
KI270750.1
KI270752.1
MT
X
Y
Also, please note that the 'Homo_sapiens.GRCh38.84.chr.gtf.gz' contains the features only from the primary assemblies
The 'Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz' also contains the features from haplotypes and patches (for human and mouse only).
More info about haplotype and patches:
https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html
- The split in the fasta file is based on the biotype. So we have cdna, cds, ncrna, pep etc., (http://ftp.ensemblorg.ebi.ac.uk/pub/release-84/fasta/homo_sapiens/)
If you are interested only in the primary assembly, then he should be using the chr.gtf file and ignore any accessions in cdna fasta not in gtf as the cdna fasta includes haplotypes.
Hope it helps.
Thanks
Prem
On 03/10/2018 15:07, Julien Wollbrett wrote:
Hello,
As I do not receive answers of my previous email I will try to describe a bit more my goals and questions.
I generated my own transcriptome fasta file using gtf from ensembl, genome fasta file from ensembl (same release) and a tool like gtf_to_fasta (TopHat) or gffread (Cufflinks). Once I did that I compared the generated file with the cdna transcriptome fasta file available at ensembl ftp. Unfortunately I found some differences between my transcriptome fasta file and the one provided by ensembl. That is why I tried to determine the origin of these differences.
All my tests have been run on different species (human, D. melanogaster, ...) and different releases (84 and 93)
I used the approach described below to define differences :
- take all transcript_ids from transcriptome fasta file of ensembl
- take all transcript_ids from gtf file of ensembl
- detect number of transcript in common in both files and transcript specific to each file
- map transcript_id to the gtf annotation and detect gene_biotype associated to transcripts of each file.
results for human release 84:
gtf file : http://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
fasta file : http://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
transcripts common in both files : 161150
transcripts present only in gtf : 38034
transcripts present only in fasta file : 15091
number of different gene biotypes for transcripts present in gtf: 44
number of different gene biotypes for transcripts in fasta file : 23
list of biotypes present only in gtf and their count :
gene_biotype freq
3prime_overlapping_ncrna 32
antisense 10183
bidirectional_promoter_lncrna 5
lincRNA 12648
macro_lncRNA 1
miRNA 4198
misc_RNA 2306
Mt_rRNA 2
Mt_tRNA 22
non_coding 3
processed_transcript 2760
ribozyme 8
rRNA 549
scaRNA 49
sense_intronic 978
sense_overlapping 334
snoRNA 961
snRNA 1905
sRNA 20
TEC 1069
vaultRNA 1
All the 38034 transcripts present only in gtf have a gene_biotype not present anymore in ensembl transcriptome.
results for D. melanogaster release 93:
gtf file : http://ftp.ensembl.org/pub/release-93/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.93.gtf.gz
fasta file : http://ftp.ensembl.org/pub/release-93/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz
transcripts in both files : 30819
transcripts present only in gtf : 3948
transcripts present only in fasta : 9
number of different gene biotypes for transcripts present in gtf: 8
number of different gene biotypes for transcripts present in fasta file : 2
list of biotypes present only in gtf and their count :
gene_biotype freq
ncRNA 2941
pre_miRNA 259
rRNA 115
snoRNA 289
snRNA 32
tRNA 312
All the 3948 transcripts present only in gtf have a gene_biotype not present anymore in ensembl transcriptome.
Could someone please explain to me :
1. Why all the transcripts with these gene biotypes are removed during the creation of the transcriptome ?
2. Where do the transcripts present in the transcriptome fasta file but not in the gtf file (15091 in human, 9 in D. melanogaster) come from ?
3. How does the cdna transcriptome fasta file is generated ?
4. Should I generate my own transcriptome fasta file or take the ensembl cdna fasta file ?
Sorry for such a long email.... and thank you for your answers.
Best Regards,
Julien Wollbrett
_______________________________________________
Dev mailing list Dev at ensembl.org<mailto:Dev at ensembl.org>
Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
Ensembl Blog: http://www.ensembl.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20181004/a793477f/attachment.html>
More information about the Dev
mailing list