[ensembl-dev] How transcriptome fasta files are created 2.0
Premanand Achuthan
prem at ebi.ac.uk
Thu Oct 4 15:10:16 BST 2018
Hi Julien
Checked and can confirm that those were missing from the GTF file in the
current release as well.
One thing I noticed that is common in all the missing 9 transcripts is
that they are all "trans-spliced" transcripts.
http://www.ensembl.org/Drosophila_melanogaster/Transcript/Summary?db=core;g=FBgn0002781;r=3R:21375060-21377399;t=FBtr0307759
Scroll down to the end of the page.
“Trans-spliced This is a trans-spliced transcript”. (A single RNA
transcript derived from multiple precursor mRNAs)
One possible explanation is that as it is difficult to represent
trans-spliced transcripts (single transcript multiple parents) in
standard GTF file format, they might have been skipped.
However, have added a jira ticket to look in to them in detail.
Thanks
Prem
On 04/10/2018 13:20, Julien Wollbrett wrote:
> Hi Premanand,
>
> By "where the annotation of these transcripts comes from" my question
> was why these transcripts are not present in the gtf and present in
> the cdna.all.fa
> I was wondering that all transcripts of the fasta file should be
> present in the gtf file.
>
> Julien
>
>
> Le 04.10.18 à 13:19, Premanand Achuthan a écrit :
>>
>> Thanks Julien,
>>
>> To get more info about the biotypes use our rest endpoints and play
>> with the parameters for filtering.
>>
>> http://rest.ensembl.org/info/biotypes/groups/?content-type=application/json
>> (list of available biotype groups)
>>
>> http://rest.ensembl.org/info/biotypes/groups/coding?content-type=application/json
>> (list within the coding group)
>>
>> http://rest.ensembl.org/info/biotypes/groups/coding/gene?content-type=application/json
>> (list within the coding group for gene object type)
>>
>> To get more info about the transcripts, use our lookup endpoint.
>>
>> eg:
>>
>> http://rest.ensembl.org/lookup/id/*FBtr0307760*?content-type=application/json
>>
>> {
>>
>> *
>> Parent:"FBgn0002781",
>> *
>> display_name:"mod(mdg4)-RAE",
>> *
>> db_type:"core",
>> *
>> id:"FBtr0307760",
>> *
>> is_canonical:0,
>> *
>> assembly_name:"BDGP6",
>> *
>> end:21377399,
>> *
>> object_type:"Transcript",
>> *
>> species:"drosophila_melanogaster",
>> *
>> biotype:"protein_coding",
>> *
>> strand:-1,
>> *
>> seq_region_name:"3R",
>> *
>> start:21375060,
>> *
>> source:"FlyBase",
>> *
>> logic_name:"flybase"
>>
>> }
>>
>> You can see that the source is from 'Flybase'.
>>
>> Please note that our rest service is running currently on Ensembl
>> release 94.
>> (http://rest.ensembl.org/info/software?content-type=application/json)
>>
>> Hope it helps.
>>
>> Thanks
>> Prem
>>
>> On 04/10/2018 11:54, Julien Wollbrett wrote:
>>> Hello,
>>>
>>> Thank you Premanand for this extremely usefull answer.
>>>
>>> If I well understand, one fasta file is created using the
>>> *_patch_hapl_scaff.gtf.gz *file for mouse and human or the *gtf.gz*
>>> file for other species. This file is then splitted using gene
>>> biotypes to create cdna, cds, ncrna, and pep fasta files.
>>> Do you know where I can find a list of all biotypes used to split to
>>> each fasta file ?
>>>
>>> In my previous email I described that I found 9 transcripts
>>> (FBtr0307759, FBtr0084079, FBtr0307760, FBtr0084081, FBtr0084085,
>>> FBtr0084083, FBtr0084084, FBtr0084080, FBtr0084082) that are present
>>> in the *cdna.all.fa.gz* file but not in the *gtf.gz* file for D.
>>> melanogaster (release 93). All these transcripts are from the same
>>> gene (FBgn0002781).
>>> Do you know where the annotation of these transcripts comes from ?
>>>
>>> Thanks,
>>>
>>> Julien
>>>
>>> Le 03.10.18 à 16:39, Premanand Achuthan a écrit :
>>>>
>>>> Hi Julien
>>>>
>>>> Apologies for the delay. This might possibly answer your previous
>>>> and current questions.
>>>>
>>>> The split in the gtf file is based on the chromosomal regions.
>>>> Also please note that the “*Homo_sapiens.GRCh38.84.gtf*” file
>>>> includes all the top level chromosomal regions (1..22, X,Y, MT) and
>>>> also the scaffolds, as you can see below.
>>>>
>>>> cut -f1 Homo_sapiens.GRCh38.84.gtf | sort | uniq
>>>>
>>>> 1
>>>> 10
>>>> 11
>>>> 12
>>>> 13
>>>> 14
>>>> 15
>>>> 16
>>>> 17
>>>> 18
>>>> 19
>>>> 2
>>>> 20
>>>> 21
>>>> 22
>>>> 3
>>>> 4
>>>> 5
>>>> 6
>>>> 7
>>>> 8
>>>> 9
>>>> GL000008.2
>>>> GL000009.2
>>>> GL000194.1
>>>> GL000195.1
>>>> GL000205.2
>>>> GL000213.1
>>>> GL000216.2
>>>> GL000218.1
>>>> GL000219.1
>>>> GL000220.1
>>>> GL000224.1
>>>> GL000225.1
>>>> KI270442.1
>>>> KI270706.1
>>>> KI270707.1
>>>> KI270708.1
>>>> KI270711.1
>>>> KI270713.1
>>>> KI270714.1
>>>> KI270721.1
>>>> KI270722.1
>>>> KI270723.1
>>>> KI270724.1
>>>> KI270726.1
>>>> KI270727.1
>>>> KI270728.1
>>>> KI270731.1
>>>> KI270733.1
>>>> KI270734.1
>>>> KI270741.1
>>>> KI270743.1
>>>> KI270744.1
>>>> KI270750.1
>>>> KI270752.1
>>>> MT
>>>> X
>>>> Y
>>>>
>>>> Also, please note that the '*Homo_sapiens.GRCh38.84.chr.gtf.gz*'
>>>> contains the features only from the primary assemblies
>>>>
>>>> The '*Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz*' also
>>>> contains the features from haplotypes and patches (for human and
>>>> mouse only).
>>>>
>>>> More info about haplotype and patches:
>>>> https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html
>>>>
>>>> - The split in the fasta file is based on the *biotype*. So we have
>>>> cdna, cds, ncrna, pep etc.,
>>>> (http://ftp.ensemblorg.ebi.ac.uk/pub/release-84/fasta/homo_sapiens/)
>>>> If you are interested only in the primary assembly, then he should
>>>> be using the chr.gtf file and ignore any accessions in cdna fasta
>>>> not in gtf as the cdna fasta includes haplotypes.
>>>>
>>>> Hope it helps.
>>>>
>>>> Thanks
>>>> Prem
>>>>
>>>>
>>>>
>>>> On 03/10/2018 15:07, Julien Wollbrett wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> As I do not receive answers of my previous email I will try to
>>>>> describe a bit more my goals and questions.
>>>>>
>>>>> I generated my own transcriptome fasta file using gtf from
>>>>> ensembl, genome fasta file from ensembl (same release) and a tool
>>>>> like gtf_to_fasta (TopHat) or gffread (Cufflinks). Once I did that
>>>>> I compared the generated file with the cdna transcriptome fasta
>>>>> file available at ensembl ftp. Unfortunately I found some
>>>>> differences between my transcriptome fasta file and the one
>>>>> provided by ensembl. That is why I tried to determine the origin
>>>>> of these differences.
>>>>> All my tests have been run on different species (human, D.
>>>>> melanogaster, ...) and different releases (84 and 93)
>>>>>
>>>>> I used the approach described below to define differences :
>>>>> - take all transcript_ids from transcriptome fasta file of ensembl
>>>>> - take all transcript_ids from gtf file of ensembl
>>>>> - detect number of transcript in common in both files and
>>>>> transcript specific to each file
>>>>> - map transcript_id to the gtf annotation and detect gene_biotype
>>>>> associated to transcripts of each file.
>>>>>
>>>>> *results for human release 84:*
>>>>>
>>>>> gtf file :
>>>>> http://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
>>>>> fasta file :
>>>>> http://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
>>>>>
>>>>> transcripts common in both files : 161150
>>>>> transcripts present only in gtf : 38034
>>>>> transcripts present only in fasta file : 15091
>>>>> number of different gene biotypes for transcripts present in gtf: 44
>>>>> number of different gene biotypes for transcripts in fasta file : 23
>>>>> list of biotypes present only in gtf and their count :
>>>>>
>>>>> gene_biotype freq
>>>>> 3prime_overlapping_ncrna 32
>>>>> antisense 10183
>>>>> bidirectional_promoter_lncrna 5
>>>>> lincRNA 12648
>>>>> macro_lncRNA 1
>>>>> miRNA 4198
>>>>> misc_RNA 2306
>>>>> Mt_rRNA 2
>>>>> Mt_tRNA 22
>>>>> non_coding 3
>>>>> processed_transcript 2760
>>>>> ribozyme 8
>>>>> rRNA 549
>>>>> scaRNA 49
>>>>> sense_intronic 978
>>>>> sense_overlapping 334
>>>>> snoRNA 961
>>>>> snRNA 1905
>>>>> sRNA 20
>>>>> TEC 1069
>>>>> vaultRNA 1
>>>>>
>>>>> All the 38034 transcripts present only in gtf have a gene_biotype
>>>>> not present anymore in ensembl transcriptome.
>>>>>
>>>>> *results for D. melanogaster release 93:*
>>>>>
>>>>> gtf file :
>>>>> http://ftp.ensembl.org/pub/release-93/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.93.gtf.gz
>>>>> fasta file :
>>>>> http://ftp.ensembl.org/pub/release-93/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz
>>>>>
>>>>> transcripts in both files : 30819
>>>>> transcripts present only in gtf : 3948
>>>>> transcripts present only in fasta : 9
>>>>> number of different gene biotypes for transcripts present in gtf: 8
>>>>> number of different gene biotypes for transcripts present in fasta
>>>>> file : 2
>>>>> list of biotypes present only in gtf and their count :
>>>>>
>>>>> gene_biotype freq
>>>>> ncRNA 2941
>>>>> pre_miRNA 259
>>>>> rRNA 115
>>>>> snoRNA 289
>>>>> snRNA 32
>>>>> tRNA 312
>>>>>
>>>>> All the 3948 transcripts present only in gtf have a gene_biotype
>>>>> not present anymore in ensembl transcriptome.
>>>>>
>>>>>
>>>>> Could someone please explain to me :
>>>>>
>>>>> 1. Why all the transcripts with these gene biotypes are
>>>>> removed during the creation of the transcriptome ?
>>>>> 2. Where do the transcripts present in the transcriptome fasta
>>>>> file but not in the gtf file (15091 in human, 9 in D.
>>>>> melanogaster) come from ?
>>>>> 3. How does the cdna transcriptome fasta file is generated ?
>>>>> 4. Should I generate my own transcriptome fasta file or take
>>>>> the ensembl cdna fasta file ?
>>>>>
>>>>> Sorry for such a long email.... and thank you for your answers.
>>>>>
>>>>> Best Regards,
>>>>>
>>>>>
>>>>> Julien Wollbrett
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing listDev at ensembl.org
>>>>> Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog:http://www.ensembl.info/
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20181004/a30e1156/attachment.html>
More information about the Dev
mailing list