[ensembl-dev] How transcriptome fasta files are created 2.0

Premanand Achuthan prem at ebi.ac.uk
Thu Oct 4 12:18:32 BST 2018


Thanks Julien,

To get more info about the biotypes use our rest endpoints and play with 
the parameters for filtering.

http://rest.ensembl.org/info/biotypes/groups/?content-type=application/json 
(list of available biotype groups)

http://rest.ensembl.org/info/biotypes/groups/coding?content-type=application/json 
(list within the coding group)

http://rest.ensembl.org/info/biotypes/groups/coding/gene?content-type=application/json 
(list within the coding group for gene object type)

To get more info about the transcripts, use our lookup endpoint.

eg:

http://rest.ensembl.org/lookup/id/*FBtr0307760*?content-type=application/json

{

  *
    Parent:"FBgn0002781",
  *
    display_name:"mod(mdg4)-RAE",
  *
    db_type:"core",
  *
    id:"FBtr0307760",
  *
    is_canonical:0,
  *
    assembly_name:"BDGP6",
  *
    end:21377399,
  *
    object_type:"Transcript",
  *
    species:"drosophila_melanogaster",
  *
    biotype:"protein_coding",
  *
    strand:-1,
  *
    seq_region_name:"3R",
  *
    start:21375060,
  *
    source:"FlyBase",
  *
    logic_name:"flybase"

}

You can see that the source is from 'Flybase'.

Please note that our rest service is running currently on Ensembl 
release 94. 
(http://rest.ensembl.org/info/software?content-type=application/json)

Hope it helps.

Thanks
Prem

On 04/10/2018 11:54, Julien Wollbrett wrote:
> Hello,
>
> Thank you Premanand for this extremely usefull answer.
>
> If I well understand, one fasta file is created using the 
> *_patch_hapl_scaff.gtf.gz *file for mouse and human or the *gtf.gz* 
> file for other species. This file is then splitted using gene biotypes 
> to create cdna, cds, ncrna, and pep fasta files.
> Do you know where I can find a list of all biotypes used to split to 
> each fasta file ?
>
> In my previous email I described that I found 9 transcripts 
> (FBtr0307759, FBtr0084079, FBtr0307760, FBtr0084081, FBtr0084085, 
> FBtr0084083, FBtr0084084, FBtr0084080, FBtr0084082) that are present 
> in the *cdna.all.fa.gz* file but not in the *gtf.gz* file for D. 
> melanogaster (release 93). All these transcripts are from the same 
> gene (FBgn0002781).
> Do you know where the annotation of these transcripts comes from ?
>
> Thanks,
>
> Julien
>
> Le 03.10.18 à 16:39, Premanand Achuthan a écrit :
>>
>> Hi Julien
>>
>> Apologies for the delay. This might possibly answer your previous and 
>> current questions.
>>
>> The split in the gtf file is based on the chromosomal regions.  Also 
>> please note that the “*Homo_sapiens.GRCh38.84.gtf*” file includes all 
>> the top level chromosomal regions (1..22, X,Y, MT) and also the 
>> scaffolds, as you can see below.
>>
>> cut -f1 Homo_sapiens.GRCh38.84.gtf | sort | uniq
>>
>> 1
>> 10
>> 11
>> 12
>> 13
>> 14
>> 15
>> 16
>> 17
>> 18
>> 19
>> 2
>> 20
>> 21
>> 22
>> 3
>> 4
>> 5
>> 6
>> 7
>> 8
>> 9
>> GL000008.2
>> GL000009.2
>> GL000194.1
>> GL000195.1
>> GL000205.2
>> GL000213.1
>> GL000216.2
>> GL000218.1
>> GL000219.1
>> GL000220.1
>> GL000224.1
>> GL000225.1
>> KI270442.1
>> KI270706.1
>> KI270707.1
>> KI270708.1
>> KI270711.1
>> KI270713.1
>> KI270714.1
>> KI270721.1
>> KI270722.1
>> KI270723.1
>> KI270724.1
>> KI270726.1
>> KI270727.1
>> KI270728.1
>> KI270731.1
>> KI270733.1
>> KI270734.1
>> KI270741.1
>> KI270743.1
>> KI270744.1
>> KI270750.1
>> KI270752.1
>> MT
>> X
>> Y
>>
>> Also, please note that the '*Homo_sapiens.GRCh38.84.chr.gtf.gz*' 
>> contains the features only from the primary assemblies
>>
>> The '*Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz*' also 
>> contains the features from haplotypes and patches (for human and 
>> mouse only).
>>
>> More info about haplotype and patches:
>> https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html
>>
>> - The split in the fasta file is based on the *biotype*. So we have 
>> cdna, cds, ncrna, pep etc., 
>> (http://ftp.ensemblorg.ebi.ac.uk/pub/release-84/fasta/homo_sapiens/)
>> If you are interested only in the primary assembly, then he should be 
>> using the chr.gtf file and ignore any accessions in cdna fasta not in 
>> gtf as the cdna fasta includes haplotypes.
>>
>> Hope it helps.
>>
>> Thanks
>> Prem
>>
>>
>>
>> On 03/10/2018 15:07, Julien Wollbrett wrote:
>>>
>>> Hello,
>>>
>>> As I do not receive answers of my previous email I will try to 
>>> describe a bit more my goals and questions.
>>>
>>> I generated my own transcriptome fasta file using gtf from ensembl, 
>>> genome fasta file from ensembl (same release) and a tool like 
>>> gtf_to_fasta (TopHat) or gffread (Cufflinks). Once I did that I 
>>> compared the generated file with the cdna transcriptome fasta file 
>>> available at ensembl ftp. Unfortunately I found some differences 
>>> between my transcriptome fasta file and the one provided by ensembl. 
>>> That is why I tried to determine the origin of these differences.
>>> All my tests have been run on different species (human, D. 
>>> melanogaster, ...) and different releases (84 and 93)
>>>
>>> I used the approach described below to define differences :
>>> - take all transcript_ids from transcriptome fasta file of ensembl
>>> - take all transcript_ids from gtf file of ensembl
>>> - detect number of transcript in common in both files and transcript 
>>> specific to each file
>>> - map transcript_id to the gtf annotation and detect gene_biotype 
>>> associated to transcripts of each file.
>>>
>>> *results for human release 84:*
>>>
>>> gtf file : 
>>> http://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
>>> fasta file : 
>>> http://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
>>>
>>> transcripts common in both files : 161150
>>> transcripts present only in gtf : 38034
>>> transcripts present only in fasta file : 15091
>>> number of different gene biotypes for transcripts present in gtf: 44
>>> number of different gene biotypes for transcripts in fasta file : 23
>>> list of biotypes present only in gtf and their count :
>>>
>>> gene_biotype  freq
>>> 3prime_overlapping_ncrna    32
>>> antisense 10183
>>> bidirectional_promoter_lncrna     5
>>> lincRNA 12648
>>> macro_lncRNA     1
>>> miRNA  4198
>>> misc_RNA  2306
>>> Mt_rRNA     2
>>> Mt_tRNA    22
>>> non_coding     3
>>> processed_transcript  2760
>>> ribozyme     8
>>> rRNA   549
>>> scaRNA    49
>>> sense_intronic   978
>>> sense_overlapping   334
>>> snoRNA   961
>>> snRNA  1905
>>> sRNA    20
>>> TEC  1069
>>> vaultRNA     1
>>>
>>> All the 38034 transcripts present only in gtf have a gene_biotype 
>>> not present anymore in ensembl transcriptome.
>>>
>>> *results for D. melanogaster release 93:*
>>>
>>> gtf file : 
>>> http://ftp.ensembl.org/pub/release-93/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.93.gtf.gz
>>> fasta file : 
>>> http://ftp.ensembl.org/pub/release-93/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz
>>>
>>> transcripts in both files : 30819
>>> transcripts present only in gtf : 3948
>>> transcripts present only in fasta : 9
>>> number of different gene biotypes for transcripts present in gtf: 8
>>> number of different gene biotypes for transcripts present in fasta 
>>> file : 2
>>> list of biotypes present only in gtf and their count :
>>>
>>> gene_biotype    freq
>>> ncRNA    2941
>>> pre_miRNA    259
>>> rRNA    115
>>> snoRNA    289
>>> snRNA    32
>>> tRNA    312
>>>
>>> All the 3948 transcripts present only in gtf have a gene_biotype not 
>>> present anymore in ensembl transcriptome.
>>>
>>>
>>> Could someone please explain to me :
>>>
>>>     1. Why all the transcripts with these gene biotypes are removed 
>>> during the creation of the transcriptome ?
>>>     2. Where do the transcripts present in the transcriptome fasta 
>>> file but not in the gtf file (15091 in human, 9 in D. melanogaster) 
>>> come from ?
>>>     3. How does the cdna transcriptome fasta file is generated ?
>>>     4. Should I generate my own transcriptome fasta file or take the 
>>> ensembl cdna fasta file ?
>>>
>>> Sorry for such a long email.... and thank you for your answers.
>>>
>>> Best Regards,
>>>
>>>
>>> Julien Wollbrett
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing listDev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog:http://www.ensembl.info/
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20181004/4e011eda/attachment.html>


More information about the Dev mailing list