[ensembl-dev] Obtaining the genomic sequences for all the 5'UTR and CDS for mouse genome

Thomas Danhorn tdanhorn at gmail.com
Thu Sep 12 22:49:05 BST 2024


There are actually 70,611 genes annotated in Ensembl release 112, but only 
a fraction are protein coding (and would have the features you are looking 
for).  The reason you are getting so many entries ist because each can 
have several transcripts per gene (while some may only have one, others 
can have dozens), and each entry in the GFF3 (or GTF) is linked not only 
to a gene, but also to a transcript, so you may have the same e.g. 5'-UTR 
several times, and/or you may have several verrsions of it, since not all 
transcripts necessarily have the same one.  For the CDS entries, you not 
only have the multiplication factor of the transcript, but also the fact 
that each entry is the CDS-part of an exon, and each transcript has 
between one and several dozen coding exons.

So if you want 5'-UTR and CDS, you will get those per *transcript*, rather 
than per gene.  You'll have to decide it you want all of them, or focus on 
one of them (for that the concept of "canonical transcript" 
[https://mart.ensembl.org/info/genome/genebuild/canonical.html] may be 
helpful).

It might be easier for this application to work with GTFs than with GFF3s, 
because they don't have a required hierarchy, so they should be easier to 
filter (although the encoded information is the same). Try to understand 
the structure of a GTF, then you can come with a strategy for extracting 
only the features you want.  With those you can get the sequences from a 
genome FASTA (something like bedtools getfasta 
[https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html] 
could be used for that).

Hope this makes sense and is helpful.

Best,

Thomas


On Tue, 10 Sep 2024, Allan Kamau wrote:

> I would like to obtain the sequences for the 5' UTR and CDS for the mouse
> genome.
> I began by filtering all the records having "five_prime_UTR" from the
> chromosome.<chromosome_name>.gff3.gz files from "
> https://ftp.ensembl.org/pub/release-112/gff3/mus_musculus/", I obtain
> some 95358 records, it seems this number is too high as mouse genome has
> approximately 25,000 genes.
>
> I did the similar filtering for records having the value "CDS" as their
> third field and obtained some 522159 entries, which is a large number
> considered there are only 25,000 genes for the GRCm39 genome.
>
> What would be preferred way to obtain the 5' UTR and CDS for the entire
> mouse genome?
>
> Regards,
> -Allan.
>



More information about the Dev mailing list