[ensembl-dev] How transcriptome fasta files are created 2.0

Premanand Achuthan prem at ebi.ac.uk
Thu Oct 4 15:10:16 BST 2018


Hi Julien

Checked and can confirm that those were missing from the GTF file in the 
current release as well.

One thing I noticed that is common in all the missing 9 transcripts is 
that they are all "trans-spliced" transcripts.

http://www.ensembl.org/Drosophila_melanogaster/Transcript/Summary?db=core;g=FBgn0002781;r=3R:21375060-21377399;t=FBtr0307759

Scroll down to the end of the page.

“Trans-spliced  This is a trans-spliced transcript”. (A single RNA 
transcript derived from multiple precursor mRNAs)

One possible explanation is that as it is difficult to represent 
trans-spliced transcripts (single transcript multiple parents) in 
standard GTF file format, they might have been  skipped.

However, have added a jira ticket to look in to them in detail.

Thanks
Prem

On 04/10/2018 13:20, Julien Wollbrett wrote:
> Hi Premanand,
>
> By "where the annotation of these transcripts comes from" my question 
> was why these transcripts are not present in the gtf and present in 
> the cdna.all.fa
> I was wondering that all transcripts of the fasta file should be 
> present in the gtf file.
>
> Julien
>
>
> Le 04.10.18 à 13:19, Premanand Achuthan a écrit :
>>
>> Thanks Julien,
>>
>> To get more info about the biotypes use our rest endpoints and play 
>> with the parameters for filtering.
>>
>> http://rest.ensembl.org/info/biotypes/groups/?content-type=application/json 
>> (list of available biotype groups)
>>
>> http://rest.ensembl.org/info/biotypes/groups/coding?content-type=application/json 
>> (list within the coding group)
>>
>> http://rest.ensembl.org/info/biotypes/groups/coding/gene?content-type=application/json 
>> (list within the coding group for gene object type)
>>
>> To get more info about the transcripts, use our lookup endpoint.
>>
>> eg:
>>
>> http://rest.ensembl.org/lookup/id/*FBtr0307760*?content-type=application/json
>>
>> {
>>
>>  *
>>     Parent:"FBgn0002781",
>>  *
>>     display_name:"mod(mdg4)-RAE",
>>  *
>>     db_type:"core",
>>  *
>>     id:"FBtr0307760",
>>  *
>>     is_canonical:0,
>>  *
>>     assembly_name:"BDGP6",
>>  *
>>     end:21377399,
>>  *
>>     object_type:"Transcript",
>>  *
>>     species:"drosophila_melanogaster",
>>  *
>>     biotype:"protein_coding",
>>  *
>>     strand:-1,
>>  *
>>     seq_region_name:"3R",
>>  *
>>     start:21375060,
>>  *
>>     source:"FlyBase",
>>  *
>>     logic_name:"flybase"
>>
>> }
>>
>> You can see that the source is from 'Flybase'.
>>
>> Please note that our rest service is running currently on Ensembl 
>> release 94. 
>> (http://rest.ensembl.org/info/software?content-type=application/json)
>>
>> Hope it helps.
>>
>> Thanks
>> Prem
>>
>> On 04/10/2018 11:54, Julien Wollbrett wrote:
>>> Hello,
>>>
>>> Thank you Premanand for this extremely usefull answer.
>>>
>>> If I well understand, one fasta file is created using the 
>>> *_patch_hapl_scaff.gtf.gz *file for mouse and human or the *gtf.gz* 
>>> file for other species. This file is then splitted using gene 
>>> biotypes to create cdna, cds, ncrna, and pep fasta files.
>>> Do you know where I can find a list of all biotypes used to split to 
>>> each fasta file ?
>>>
>>> In my previous email I described that I found 9 transcripts 
>>> (FBtr0307759, FBtr0084079, FBtr0307760, FBtr0084081, FBtr0084085, 
>>> FBtr0084083, FBtr0084084, FBtr0084080, FBtr0084082) that are present 
>>> in the *cdna.all.fa.gz* file but not in the *gtf.gz* file for D. 
>>> melanogaster (release 93). All these transcripts are from the same 
>>> gene (FBgn0002781).
>>> Do you know where the annotation of these transcripts comes from ?
>>>
>>> Thanks,
>>>
>>> Julien
>>>
>>> Le 03.10.18 à 16:39, Premanand Achuthan a écrit :
>>>>
>>>> Hi Julien
>>>>
>>>> Apologies for the delay. This might possibly answer your previous 
>>>> and current questions.
>>>>
>>>> The split in the gtf file is based on the chromosomal regions.  
>>>> Also please note that the “*Homo_sapiens.GRCh38.84.gtf*” file 
>>>> includes all the top level chromosomal regions (1..22, X,Y, MT) and 
>>>> also the scaffolds, as you can see below.
>>>>
>>>> cut -f1 Homo_sapiens.GRCh38.84.gtf | sort | uniq
>>>>
>>>> 1
>>>> 10
>>>> 11
>>>> 12
>>>> 13
>>>> 14
>>>> 15
>>>> 16
>>>> 17
>>>> 18
>>>> 19
>>>> 2
>>>> 20
>>>> 21
>>>> 22
>>>> 3
>>>> 4
>>>> 5
>>>> 6
>>>> 7
>>>> 8
>>>> 9
>>>> GL000008.2
>>>> GL000009.2
>>>> GL000194.1
>>>> GL000195.1
>>>> GL000205.2
>>>> GL000213.1
>>>> GL000216.2
>>>> GL000218.1
>>>> GL000219.1
>>>> GL000220.1
>>>> GL000224.1
>>>> GL000225.1
>>>> KI270442.1
>>>> KI270706.1
>>>> KI270707.1
>>>> KI270708.1
>>>> KI270711.1
>>>> KI270713.1
>>>> KI270714.1
>>>> KI270721.1
>>>> KI270722.1
>>>> KI270723.1
>>>> KI270724.1
>>>> KI270726.1
>>>> KI270727.1
>>>> KI270728.1
>>>> KI270731.1
>>>> KI270733.1
>>>> KI270734.1
>>>> KI270741.1
>>>> KI270743.1
>>>> KI270744.1
>>>> KI270750.1
>>>> KI270752.1
>>>> MT
>>>> X
>>>> Y
>>>>
>>>> Also, please note that the '*Homo_sapiens.GRCh38.84.chr.gtf.gz*' 
>>>> contains the features only from the primary assemblies
>>>>
>>>> The '*Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz*' also 
>>>> contains the features from haplotypes and patches (for human and 
>>>> mouse only).
>>>>
>>>> More info about haplotype and patches:
>>>> https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html
>>>>
>>>> - The split in the fasta file is based on the *biotype*. So we have 
>>>> cdna, cds, ncrna, pep etc., 
>>>> (http://ftp.ensemblorg.ebi.ac.uk/pub/release-84/fasta/homo_sapiens/)
>>>> If you are interested only in the primary assembly, then he should 
>>>> be using the chr.gtf file and ignore any accessions in cdna fasta 
>>>> not in gtf as the cdna fasta includes haplotypes.
>>>>
>>>> Hope it helps.
>>>>
>>>> Thanks
>>>> Prem
>>>>
>>>>
>>>>
>>>> On 03/10/2018 15:07, Julien Wollbrett wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> As I do not receive answers of my previous email I will try to 
>>>>> describe a bit more my goals and questions.
>>>>>
>>>>> I generated my own transcriptome fasta file using gtf from 
>>>>> ensembl, genome fasta file from ensembl (same release) and a tool 
>>>>> like gtf_to_fasta (TopHat) or gffread (Cufflinks). Once I did that 
>>>>> I compared the generated file with the cdna transcriptome fasta 
>>>>> file available at ensembl ftp. Unfortunately I found some 
>>>>> differences between my transcriptome fasta file and the one 
>>>>> provided by ensembl. That is why I tried to determine the origin 
>>>>> of these differences.
>>>>> All my tests have been run on different species (human, D. 
>>>>> melanogaster, ...) and different releases (84 and 93)
>>>>>
>>>>> I used the approach described below to define differences :
>>>>> - take all transcript_ids from transcriptome fasta file of ensembl
>>>>> - take all transcript_ids from gtf file of ensembl
>>>>> - detect number of transcript in common in both files and 
>>>>> transcript specific to each file
>>>>> - map transcript_id to the gtf annotation and detect gene_biotype 
>>>>> associated to transcripts of each file.
>>>>>
>>>>> *results for human release 84:*
>>>>>
>>>>> gtf file : 
>>>>> http://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
>>>>> fasta file : 
>>>>> http://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
>>>>>
>>>>> transcripts common in both files : 161150
>>>>> transcripts present only in gtf : 38034
>>>>> transcripts present only in fasta file : 15091
>>>>> number of different gene biotypes for transcripts present in gtf: 44
>>>>> number of different gene biotypes for transcripts in fasta file : 23
>>>>> list of biotypes present only in gtf and their count :
>>>>>
>>>>> gene_biotype  freq
>>>>> 3prime_overlapping_ncrna    32
>>>>> antisense 10183
>>>>> bidirectional_promoter_lncrna     5
>>>>> lincRNA 12648
>>>>> macro_lncRNA     1
>>>>> miRNA  4198
>>>>> misc_RNA  2306
>>>>> Mt_rRNA     2
>>>>> Mt_tRNA    22
>>>>> non_coding     3
>>>>> processed_transcript  2760
>>>>> ribozyme     8
>>>>> rRNA   549
>>>>> scaRNA    49
>>>>> sense_intronic   978
>>>>> sense_overlapping   334
>>>>> snoRNA   961
>>>>> snRNA  1905
>>>>> sRNA    20
>>>>> TEC  1069
>>>>> vaultRNA     1
>>>>>
>>>>> All the 38034 transcripts present only in gtf have a gene_biotype 
>>>>> not present anymore in ensembl transcriptome.
>>>>>
>>>>> *results for D. melanogaster release 93:*
>>>>>
>>>>> gtf file : 
>>>>> http://ftp.ensembl.org/pub/release-93/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.93.gtf.gz
>>>>> fasta file : 
>>>>> http://ftp.ensembl.org/pub/release-93/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz
>>>>>
>>>>> transcripts in both files : 30819
>>>>> transcripts present only in gtf : 3948
>>>>> transcripts present only in fasta : 9
>>>>> number of different gene biotypes for transcripts present in gtf: 8
>>>>> number of different gene biotypes for transcripts present in fasta 
>>>>> file : 2
>>>>> list of biotypes present only in gtf and their count :
>>>>>
>>>>> gene_biotype    freq
>>>>> ncRNA    2941
>>>>> pre_miRNA    259
>>>>> rRNA    115
>>>>> snoRNA    289
>>>>> snRNA    32
>>>>> tRNA    312
>>>>>
>>>>> All the 3948 transcripts present only in gtf have a gene_biotype 
>>>>> not present anymore in ensembl transcriptome.
>>>>>
>>>>>
>>>>> Could someone please explain to me :
>>>>>
>>>>>     1. Why all the transcripts with these gene biotypes are 
>>>>> removed during the creation of the transcriptome ?
>>>>>     2. Where do the transcripts present in the transcriptome fasta 
>>>>> file but not in the gtf file (15091 in human, 9 in D. 
>>>>> melanogaster) come from ?
>>>>>     3. How does the cdna transcriptome fasta file is generated ?
>>>>>     4. Should I generate my own transcriptome fasta file or take 
>>>>> the ensembl cdna fasta file ?
>>>>>
>>>>> Sorry for such a long email.... and thank you for your answers.
>>>>>
>>>>> Best Regards,
>>>>>
>>>>>
>>>>> Julien Wollbrett
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing listDev at ensembl.org
>>>>> Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>>>> Ensembl Blog:http://www.ensembl.info/
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20181004/a30e1156/attachment.html>


More information about the Dev mailing list