-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Currently, the simple ensemble in Annif averages the scores from different backends with just weighting. However, the score scales of the backends are not necessarily uniform. For example, a score of 0.8 from MLLM might mean a suggestion is good, while the same score from Bonsai could mean it's very good, but in the lower score-range the MLLM scores could be "more representative". Thus, simply weighting of the scores from different algorithms while combining them is not necessarily optimal.
The score distributions of some algorithms for the JYX documents were plotted quite some time ago (see this Slack thread), see below. Notably, the fastText scores are much lower than the scores of other algorithms. (I assume MLLM gives similar results as Maui.)
An approach tried out in the GermEval task was to not just average the base suggestions' scores, but to first raise the score to some power x, and then multiply by a weight w: score**x * w. The exponent x could vary for each backend and could be optimized using hyperopt. This is similar to the NN ensemble, where all scores are square-rooted (x=0.5), but the exponent was made backend-specific.
There is now the exponentiate-scores branch including those changes.