[ensembl-dev] Homo_sapiens.GRCh38.92.chr.gtf contents compared to fasta files (cdna + ncrna)
Ben Moore
bmoore at ebi.ac.uk
Fri Sep 21 14:27:20 BST 2018
Hi Vivek,
Ensembl provides an automatic gene annotation for Homo sapiens. For some species (human, mouse, zebrafish, pig and rat), the annotation provided through Ensembl also includes manual annotation from HAVANA. In the case of human and mouse, the GTF files found here are equivalent to the GENCODE gene set. There should be a number of GTF files in the Ensembl92 human GTF folder:
http://ftp.ensembl.org/pub/release-92/gtf/homo_sapiens/
.gtf:
This is the default file, it should contain the full annotation for all species except human and mouse. For human and mouse, it will contain all annotation on the primary assembly, ie excluding patch and haplotype regions. All species have one.
.chr.gtf:
Contains only annotation on chromosomes, so toplevel scaffolds are excluded (patch and haplotypes are not included).
.chr_patch_hapl_scaff:
Contains all annotation on all toplevel sequences, including patch and haplotype regions.
It should only exist for human and mouse
Species with no chromosomes will have a single file, .gtf
Species with only chromosomes but no scaffolds will have a single file, .gtf
Species with chromosomes and scaffolds will have two files, .gtf and .chr.gtf
Further information can be found in the README file:
http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/README
Best wishes
Ben
> On 19 Sep 2018, at 10:50, Vivek Iyer <vvi at sanger.ac.uk> wrote:
>
> Hi all,
>
> From the downloadable data on ftp://ftp.ensembl.org/pub/release-92/gtf/ <ftp://ftp.ensembl.org/pub/release-92/gtf/> I can see one gtf file for download (I’m using v92 at the moment): Homo_sapiens.GRCh38.92.chr.gtf
>
> Are the transcripts in here a superset / subset or the identical to the combined transcripts in the sum of these two fasta files under ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/: <ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/:>
> Homo_sapiens.GRCh38.cdna.all.fa
> Homo_sapiens.GRCh38.ncrna.fa
>
> Of course, I could resolve the IDs and do a simple comparison :-) I was hoping someone could point me at docs (along with a nudge to RTFM) or supply some motivation for the split. Both types of files are needed at different points of an RNAseq pipeline.
>
> Thanks,
>
> Vivek
> _______________________________________________
> Dev mailing list Dev at ensembl.org
> Posting guidelines and subscribe/unsubscribe info: http://lists.ensembl.org/mailman/listinfo/dev
> Ensembl Blog: http://www.ensembl.info/
Ben Moore
Ensembl Outreach Officer
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge
CB10 1SD
UK
bmoore at ebi.ac.uk
+44 (0)1223 494265
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ensembl.org/pipermail/dev_ensembl.org/attachments/20180921/6b4eb158/attachment.html>
More information about the Dev
mailing list