[ensembl-dev] species_prefix limits on 7 char
Lel Eory
lel.eory at roslin.ed.ac.uk
Mon Aug 25 14:45:02 BST 2014
Dear Developers,
In the ensembl_production.species table the species_prefix field is
limited to 7 characters. The first three are made up by ENS which is
then followed by another 0-4 characters representing the species. Should
and will this limitation to 7 chars remain there in the future? What are
the rules set by the production team to define the species_prefix? What
will happen when there will be no available four-letter abbreviation for
a species that would be based on the binomial name?
Based on what is there in the production database species_prefixes are
mostly given as ENS followed by 'G'enus 'SP'ecies. In some cases there
are conflicts so there is need to use four chars for the species like in
one of the cases for Melopsittacus undulatus and Mesitornis unicolor
i.e.: MUND and MUNI. At present with over 40 species we still have ways
to find unique abbreviations which reflect the binomial names, but with
hundreds of new species the limitations caused by the 4 char
representation will be inevitable. Having the same system in place to
generate the abbrevs for all the "not yet abbreviated" species would be
nice, but I am not sure that this is possible. E.g. genome assemblies
are frequently defined for a species by the 'GEN'us 'SPE'cies
abbreviation. However to use this kind of abbreviation we would need an
extra two characters for the species_prefix field and I am sure with
thousands of species there will be clashes even with 6 character long
species abbreviations. Can a few more characters be added to
species_prefix in place of the current definition of varchar (7)?
Also, what would be the best way to define the abbreviations?
E.g. we would like to be able to generate species_prefixes
programmatically from the binom name. One option would be to check
whether 'G'enus 'SP'ecies is available for an abbrev and, if not, then
we can move on and check 'G'enus 'SPE'cies, or 'GEN'us 'SPE'cies etc...?
There could be a rule to define the abbrevs as the shortest possible not
yet available name based on the binom name? Any suggestions on this?
Thank you.
Best wishes,
Lel
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the Dev
mailing list