[ensembl-dev] Homo_sapiens.GRCh38.92.chr.gtf contents compared to fasta files (cdna + ncrna)

Ben Moore bmoore at ebi.ac.uk
Mon Oct 1 16:18:47 BST 2018


Hi Vivek,

I wanted to send some further information to help answer your question based on some discussions with my colleagues, here in Ensembl.

The transcript set represented in the Homo_sapiens.GRCh38.92.chr.gtf is a subset of the transcripts represented in the cdna + ncrna FASTA combined, because the Homo_sapiens.GRCh38.92.chr.gtf file includes only the top level chromosomal regions (1..22, X,Y, MT). But the  Homo_sapiens.GRCh38.92.chr_patch_hapl_scaff.gtf includes all the top level sequences including the patches, haplotypes and scaffold.

However, the FASTA files represents features on all top-level sequences, including patches and haplotype regions.

I hope this helps, but please do get back in touch if you have any further questions.

Best wishes

Ben

> On 21 Sep 2018, at 14:27, Ben Moore <bmoore at ebi.ac.uk> wrote:
> 
> Hi Vivek,
> 
> Ensembl provides an automatic gene annotation for Homo sapiens. For some species (human, mouse, zebrafish, pig and rat), the annotation provided through Ensembl also includes manual annotation from HAVANA. In the case of human and mouse, the GTF files found here are equivalent to the GENCODE gene set. There should be a number of GTF files in the Ensembl92 human GTF folder:
> http://ftp.ensembl.org/pub/release-92/gtf/homo_sapiens/ <http://ftp.ensembl.org/pub/release-92/gtf/homo_sapiens/>
> 
> .gtf:
> This is the default file, it should contain the full annotation for all species except human and mouse. For human and mouse, it will contain all annotation on the primary assembly, ie excluding patch and haplotype regions. All species have one.
> 
> .chr.gtf:
> Contains only annotation on chromosomes, so toplevel scaffolds are excluded (patch and haplotypes are not included).
> 
> .chr_patch_hapl_scaff:
> Contains all annotation on all toplevel sequences, including patch and haplotype regions.
> It should only exist for human and mouse
> 
> Species with no chromosomes will have a single file, .gtf
> Species with only chromosomes but no scaffolds will have a single file, .gtf
> Species with chromosomes and scaffolds will have two files, .gtf and .chr.gtf 
> 
> Further information can be found in the README file:
> http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/README <http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/README>
> 
> Best wishes
> 
> Ben
> 
>> On 19 Sep 2018, at 10:50, Vivek Iyer <vvi at sanger.ac.uk <mailto:vvi at sanger.ac.uk>> wrote:
>> 
>> Hi all,
>> 
>> From the downloadable data on ftp://ftp.ensembl.org/pub/release-92/gtf/ <ftp://ftp.ensembl.org/pub/release-92/gtf/> I can see one gtf file for download (I’m using v92 at the moment): Homo_sapiens.GRCh38.92.chr.gtf 
>> 
>> Are the transcripts in here a superset / subset or the identical to the combined transcripts in the sum of these two fasta files under ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/: <ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/:>
>> Homo_sapiens.GRCh38.cdna.all.fa  
>> Homo_sapiens.GRCh38.ncrna.fa
>> 
>> Of course, I could resolve the IDs and do a simple comparison :-) I was hoping someone could point me at docs (along with a nudge to RTFM) or supply some motivation for the split. Both types of files are needed at different points of an RNAseq pipeline.
>> 
>> Thanks,
>> 
>> Vivek
>> _______________________________________________
>> Dev mailing list    Dev at ensembl.org <mailto:Dev at ensembl.org>
>> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev <http://lists.ensembl.org/mailman/listinfo/dev>
>> Ensembl Blog: http://www.ensembl.info/ <http://www.ensembl.info/>
> 
> Ben Moore
> Ensembl Outreach Officer
> 
> European Bioinformatics Institute (EMBL-EBI)
> European Molecular Biology Laboratory
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge
> CB10 1SD
> UK
> 
> bmoore at ebi.ac.uk <mailto:bmoore at ebi.ac.uk>
> +44 (0)1223 494265
> 

Ben Moore
Ensembl Outreach Officer

European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge
CB10 1SD
UK

bmoore at ebi.ac.uk
+44 (0)1223 494265

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20181001/7154b13f/attachment.html>


More information about the Dev mailing list