Skip to content

Addressing differences in the backend score scales #862

@juhoinkinen

Description

@juhoinkinen

Currently, the simple ensemble in Annif averages the scores from different backends with just weighting. However, the score scales of the backends are not necessarily uniform. For example, a score of 0.8 from MLLM might mean a suggestion is good, while the same score from Bonsai could mean it's very good, but in the lower score-range the MLLM scores could be "more representative". Thus, simply weighting of the scores from different algorithms while combining them is not necessarily optimal.

The score distributions of some algorithms for the JYX documents were plotted quite some time ago (see this Slack thread), see below. Notably, the fastText scores are much lower than the scores of other algorithms. (I assume MLLM gives similar results as Maui.)

Image Image Image Image Image

An approach tried out in the GermEval task was to not just average the base suggestions' scores, but to first raise the score to some power x, and then multiply by a weight w: score**x * w. The exponent x could vary for each backend and could be optimized using hyperopt. This is similar to the NN ensemble, where all scores are square-rooted (x=0.5), but the exponent was made backend-specific.

There is now the exponentiate-scores branch including those changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions