[ensembl-dev] How transcriptome fasta files are created 2.0
Premanand Achuthan
prem at ebi.ac.uk
Thu Oct 4 12:18:32 BST 2018
Thanks Julien,
To get more info about the biotypes use our rest endpoints and play with
the parameters for filtering.
http://rest.ensembl.org/info/biotypes/groups/?content-type=application/json
(list of available biotype groups)
http://rest.ensembl.org/info/biotypes/groups/coding?content-type=application/json
(list within the coding group)
http://rest.ensembl.org/info/biotypes/groups/coding/gene?content-type=application/json
(list within the coding group for gene object type)
To get more info about the transcripts, use our lookup endpoint.
eg:
http://rest.ensembl.org/lookup/id/*FBtr0307760*?content-type=application/json
{
*
Parent:"FBgn0002781",
*
display_name:"mod(mdg4)-RAE",
*
db_type:"core",
*
id:"FBtr0307760",
*
is_canonical:0,
*
assembly_name:"BDGP6",
*
end:21377399,
*
object_type:"Transcript",
*
species:"drosophila_melanogaster",
*
biotype:"protein_coding",
*
strand:-1,
*
seq_region_name:"3R",
*
start:21375060,
*
source:"FlyBase",
*
logic_name:"flybase"
}
You can see that the source is from 'Flybase'.
Please note that our rest service is running currently on Ensembl
release 94.
(http://rest.ensembl.org/info/software?content-type=application/json)
Hope it helps.
Thanks
Prem
On 04/10/2018 11:54, Julien Wollbrett wrote:
> Hello,
>
> Thank you Premanand for this extremely usefull answer.
>
> If I well understand, one fasta file is created using the
> *_patch_hapl_scaff.gtf.gz *file for mouse and human or the *gtf.gz*
> file for other species. This file is then splitted using gene biotypes
> to create cdna, cds, ncrna, and pep fasta files.
> Do you know where I can find a list of all biotypes used to split to
> each fasta file ?
>
> In my previous email I described that I found 9 transcripts
> (FBtr0307759, FBtr0084079, FBtr0307760, FBtr0084081, FBtr0084085,
> FBtr0084083, FBtr0084084, FBtr0084080, FBtr0084082) that are present
> in the *cdna.all.fa.gz* file but not in the *gtf.gz* file for D.
> melanogaster (release 93). All these transcripts are from the same
> gene (FBgn0002781).
> Do you know where the annotation of these transcripts comes from ?
>
> Thanks,
>
> Julien
>
> Le 03.10.18 à 16:39, Premanand Achuthan a écrit :
>>
>> Hi Julien
>>
>> Apologies for the delay. This might possibly answer your previous and
>> current questions.
>>
>> The split in the gtf file is based on the chromosomal regions. Also
>> please note that the “*Homo_sapiens.GRCh38.84.gtf*” file includes all
>> the top level chromosomal regions (1..22, X,Y, MT) and also the
>> scaffolds, as you can see below.
>>
>> cut -f1 Homo_sapiens.GRCh38.84.gtf | sort | uniq
>>
>> 1
>> 10
>> 11
>> 12
>> 13
>> 14
>> 15
>> 16
>> 17
>> 18
>> 19
>> 2
>> 20
>> 21
>> 22
>> 3
>> 4
>> 5
>> 6
>> 7
>> 8
>> 9
>> GL000008.2
>> GL000009.2
>> GL000194.1
>> GL000195.1
>> GL000205.2
>> GL000213.1
>> GL000216.2
>> GL000218.1
>> GL000219.1
>> GL000220.1
>> GL000224.1
>> GL000225.1
>> KI270442.1
>> KI270706.1
>> KI270707.1
>> KI270708.1
>> KI270711.1
>> KI270713.1
>> KI270714.1
>> KI270721.1
>> KI270722.1
>> KI270723.1
>> KI270724.1
>> KI270726.1
>> KI270727.1
>> KI270728.1
>> KI270731.1
>> KI270733.1
>> KI270734.1
>> KI270741.1
>> KI270743.1
>> KI270744.1
>> KI270750.1
>> KI270752.1
>> MT
>> X
>> Y
>>
>> Also, please note that the '*Homo_sapiens.GRCh38.84.chr.gtf.gz*'
>> contains the features only from the primary assemblies
>>
>> The '*Homo_sapiens.GRCh38.84.chr_patch_hapl_scaff.gtf.gz*' also
>> contains the features from haplotypes and patches (for human and
>> mouse only).
>>
>> More info about haplotype and patches:
>> https://www.ensembl.org/info/genome/genebuild/haplotypes_patches.html
>>
>> - The split in the fasta file is based on the *biotype*. So we have
>> cdna, cds, ncrna, pep etc.,
>> (http://ftp.ensemblorg.ebi.ac.uk/pub/release-84/fasta/homo_sapiens/)
>> If you are interested only in the primary assembly, then he should be
>> using the chr.gtf file and ignore any accessions in cdna fasta not in
>> gtf as the cdna fasta includes haplotypes.
>>
>> Hope it helps.
>>
>> Thanks
>> Prem
>>
>>
>>
>> On 03/10/2018 15:07, Julien Wollbrett wrote:
>>>
>>> Hello,
>>>
>>> As I do not receive answers of my previous email I will try to
>>> describe a bit more my goals and questions.
>>>
>>> I generated my own transcriptome fasta file using gtf from ensembl,
>>> genome fasta file from ensembl (same release) and a tool like
>>> gtf_to_fasta (TopHat) or gffread (Cufflinks). Once I did that I
>>> compared the generated file with the cdna transcriptome fasta file
>>> available at ensembl ftp. Unfortunately I found some differences
>>> between my transcriptome fasta file and the one provided by ensembl.
>>> That is why I tried to determine the origin of these differences.
>>> All my tests have been run on different species (human, D.
>>> melanogaster, ...) and different releases (84 and 93)
>>>
>>> I used the approach described below to define differences :
>>> - take all transcript_ids from transcriptome fasta file of ensembl
>>> - take all transcript_ids from gtf file of ensembl
>>> - detect number of transcript in common in both files and transcript
>>> specific to each file
>>> - map transcript_id to the gtf annotation and detect gene_biotype
>>> associated to transcripts of each file.
>>>
>>> *results for human release 84:*
>>>
>>> gtf file :
>>> http://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
>>> fasta file :
>>> http://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
>>>
>>> transcripts common in both files : 161150
>>> transcripts present only in gtf : 38034
>>> transcripts present only in fasta file : 15091
>>> number of different gene biotypes for transcripts present in gtf: 44
>>> number of different gene biotypes for transcripts in fasta file : 23
>>> list of biotypes present only in gtf and their count :
>>>
>>> gene_biotype freq
>>> 3prime_overlapping_ncrna 32
>>> antisense 10183
>>> bidirectional_promoter_lncrna 5
>>> lincRNA 12648
>>> macro_lncRNA 1
>>> miRNA 4198
>>> misc_RNA 2306
>>> Mt_rRNA 2
>>> Mt_tRNA 22
>>> non_coding 3
>>> processed_transcript 2760
>>> ribozyme 8
>>> rRNA 549
>>> scaRNA 49
>>> sense_intronic 978
>>> sense_overlapping 334
>>> snoRNA 961
>>> snRNA 1905
>>> sRNA 20
>>> TEC 1069
>>> vaultRNA 1
>>>
>>> All the 38034 transcripts present only in gtf have a gene_biotype
>>> not present anymore in ensembl transcriptome.
>>>
>>> *results for D. melanogaster release 93:*
>>>
>>> gtf file :
>>> http://ftp.ensembl.org/pub/release-93/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.93.gtf.gz
>>> fasta file :
>>> http://ftp.ensembl.org/pub/release-93/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz
>>>
>>> transcripts in both files : 30819
>>> transcripts present only in gtf : 3948
>>> transcripts present only in fasta : 9
>>> number of different gene biotypes for transcripts present in gtf: 8
>>> number of different gene biotypes for transcripts present in fasta
>>> file : 2
>>> list of biotypes present only in gtf and their count :
>>>
>>> gene_biotype freq
>>> ncRNA 2941
>>> pre_miRNA 259
>>> rRNA 115
>>> snoRNA 289
>>> snRNA 32
>>> tRNA 312
>>>
>>> All the 3948 transcripts present only in gtf have a gene_biotype not
>>> present anymore in ensembl transcriptome.
>>>
>>>
>>> Could someone please explain to me :
>>>
>>> 1. Why all the transcripts with these gene biotypes are removed
>>> during the creation of the transcriptome ?
>>> 2. Where do the transcripts present in the transcriptome fasta
>>> file but not in the gtf file (15091 in human, 9 in D. melanogaster)
>>> come from ?
>>> 3. How does the cdna transcriptome fasta file is generated ?
>>> 4. Should I generate my own transcriptome fasta file or take the
>>> ensembl cdna fasta file ?
>>>
>>> Sorry for such a long email.... and thank you for your answers.
>>>
>>> Best Regards,
>>>
>>>
>>> Julien Wollbrett
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing listDev at ensembl.org
>>> Posting guidelines and subscribe/unsubscribe info:http://lists.ensembl.org/mailman/listinfo/dev
>>> Ensembl Blog:http://www.ensembl.info/
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20181004/4e011eda/attachment.html>
More information about the Dev
mailing list