[ensembl-dev] species_prefix limits on 7 char

Lel Eory lel.eory at roslin.ed.ac.uk
Mon Aug 25 14:45:02 BST 2014


Dear Developers,

In the ensembl_production.species table the species_prefix field is 
limited to 7 characters. The first three are made up by ENS which is 
then followed by another 0-4 characters representing the species. Should 
and will this limitation to 7 chars remain there in the future? What are 
the rules set by the production team to define the species_prefix? What 
will happen when there will be no available four-letter abbreviation for 
a species that would be based on the binomial name?

Based on what is there in the production database species_prefixes are 
mostly given as ENS followed by 'G'enus 'SP'ecies. In some cases there 
are conflicts so there is need to use four chars for the species like in 
one of the cases for Melopsittacus undulatus and Mesitornis unicolor 
i.e.: MUND and MUNI. At present with over 40 species we still have ways 
to find unique abbreviations which reflect the binomial names, but with 
hundreds of new species the limitations caused by the 4 char 
representation will be inevitable. Having the same system in place to 
generate the abbrevs for all the "not yet abbreviated" species would be 
nice, but I am not sure that this is possible. E.g. genome assemblies 
are frequently defined for a species by the 'GEN'us 'SPE'cies 
abbreviation. However to use this kind of abbreviation we would need an 
extra two characters for the species_prefix field and I am sure with 
thousands of species there will be clashes even with 6 character long 
species abbreviations. Can a few more characters be added to 
species_prefix in place of the current definition of varchar (7)?
Also, what would be the best way to define the abbreviations?
E.g. we would like to be able to generate species_prefixes 
programmatically from the binom name. One option would be to check 
whether 'G'enus 'SP'ecies is available for an abbrev and, if not, then 
we can move on and check 'G'enus 'SPE'cies, or 'GEN'us 'SPE'cies etc...? 
There could be a rule to define the abbrevs as the shortest possible not 
yet available name based on the binom name? Any suggestions on this?

Thank you.

Best wishes,
Lel


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.





More information about the Dev mailing list