[ensembl-dev] Quick question about flat file

Thomas Danhorn danhornt at njhealth.org
Wed Mar 20 15:04:52 GMT 2019


Hi Thomas,

I am not aware of such a file, I think there are just too many possible 
IDs and attributes that would make a comprensive table unwieldy, and if 
you pick a few, someone will always be missing something or the other.

Fortunately it is fairly straight forward to put together your own from 
the Ensembl BioMart by selecting all the IDs and metadata items you 
want/need, and you can do that for any release you use.  (Choose Ensembl 
Genes and your organism from the drop-down selector, then click on 
Attributes on the left and check/uncheck your items.  Click "Results" and 
download a compressed TSV file.)

A few things to be aware of:
1) If you are selecting any attributes that only apply to protein-coding 
genes (e.g. certain protein IDs, signal peptides, etc.), your entire table 
will *only* have protein-coding genes, all others are silently dropped. 
This is a known issue due to the fact how databases in BioMart are 
organized.  Always count your lines to make sure you have what you need! 
The work-around is to download all protein-specific attributes and all 
other attributes into two separate tables, which you can then merge (by 
gene ID).  BTW, the "Ensembl Family Description" *is* available for all 
genes, despite being listed under "Protein Domains and Families".
2) Certain attributes are tied to entities like transcripts, rather than 
genes, so you will get more than one line per gene if you include them. 
If this is undesirable, you will have to consolidate your table 
afterwards, to have the transcript-related attributes as e.g. a 
comma-separated list in a column.
3) If the results table is too large for the system to handle, it will not 
work.  In this case you will have to split it up by downloading some 
attributes in a separate table and merging afterwards.
4) The Ensembl BioMart does not contain synonyms of gene names.  If you 
want those, you can get them via the Perl API.  (I have a script for 
that, let me know if you want it.)

Alternatively, if the attributes you need are contained in a GTF/GFF file, 
you can extract them from there (you will almost certainly get IDs and 
symbols, but probably not much in terms of "description"; the Ensemble 
files do have biotypes).

Hope this helps,

Thomas


On Wed, 20 Mar 2019, Thomas Chaussepied wrote:

> Hello ,
>
> I wanted to know if there was a flat file containing all gene id, symbol
> and description on relaese 95.
>
> Thank you
>
> Thomas Chaussepied
>
> Bioinformatic Engineerchaussepied.thomas at gmail.com
>
> Inra - LPGP
> (Laboratoire de Physiologie et génomique des poissons)Campus de
> Beaulieu - Bâtiment 16A
> 35042 Rennes Cedex
> France
>


More information about the Dev mailing list