[ensembl-dev] Homo_sapiens.GRCh38.92.chr.gtf contents compared to fasta files (cdna + ncrna)

Ben Moore bmoore at ebi.ac.uk
Fri Sep 21 14:27:20 BST 2018


Hi Vivek,

Ensembl provides an automatic gene annotation for Homo sapiens. For some species (human, mouse, zebrafish, pig and rat), the annotation provided through Ensembl also includes manual annotation from HAVANA. In the case of human and mouse, the GTF files found here are equivalent to the GENCODE gene set. There should be a number of GTF files in the Ensembl92 human GTF folder:
http://ftp.ensembl.org/pub/release-92/gtf/homo_sapiens/

.gtf:
This is the default file, it should contain the full annotation for all species except human and mouse. For human and mouse, it will contain all annotation on the primary assembly, ie excluding patch and haplotype regions. All species have one.

.chr.gtf:
Contains only annotation on chromosomes, so toplevel scaffolds are excluded (patch and haplotypes are not included).

.chr_patch_hapl_scaff:
Contains all annotation on all toplevel sequences, including patch and haplotype regions.
It should only exist for human and mouse

Species with no chromosomes will have a single file, .gtf
Species with only chromosomes but no scaffolds will have a single file, .gtf
Species with chromosomes and scaffolds will have two files, .gtf and .chr.gtf 

Further information can be found in the README file:
http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/README

Best wishes

Ben

> On 19 Sep 2018, at 10:50, Vivek Iyer <vvi at sanger.ac.uk> wrote:
> 
> Hi all,
> 
> From the downloadable data on ftp://ftp.ensembl.org/pub/release-92/gtf/ <ftp://ftp.ensembl.org/pub/release-92/gtf/> I can see one gtf file for download (I’m using v92 at the moment): Homo_sapiens.GRCh38.92.chr.gtf 
> 
> Are the transcripts in here a superset / subset or the identical to the combined transcripts in the sum of these two fasta files under ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/: <ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/:>
> Homo_sapiens.GRCh38.cdna.all.fa  
> Homo_sapiens.GRCh38.ncrna.fa
> 
> Of course, I could resolve the IDs and do a simple comparison :-) I was hoping someone could point me at docs (along with a nudge to RTFM) or supply some motivation for the split. Both types of files are needed at different points of an RNAseq pipeline.
> 
> Thanks,
> 
> Vivek
> _______________________________________________
> Dev mailing list    Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/

Ben Moore
Ensembl Outreach Officer

European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge
CB10 1SD
UK

bmoore at ebi.ac.uk
+44 (0)1223 494265

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20180921/6b4eb158/attachment.html>


More information about the Dev mailing list