Skip to content

[BUG] chrF++/chrF/TER metrics receive references in wrong format, causing incorrect corpus-level scoring #1112

@dyurchenko98

Description

@dyurchenko98

Describe the bug

CorpusLevelTranslationMetric passes references to sacrebleu in [sent_id][ref_id] shape (per-sample list), but sacrebleu expects [ref_id][sent_id] (reference streams). This causes chrF/chrF++/TER to score only the first min number of refs per sample hypotheses and compare them against a pooled set of references from the entire dataset. BLEU is special-cased but still drops extra references.

To Reproduce

Minimal example showing the shape issue (mirrors how lighteval currently passes refs):

from lighteval.metrics.metrics_corpus import CorpusLevelTranslationMetric
from lighteval.metrics.sample_preparator import GenerativeCorpusMetricInput
from lighteval.utils.utils import as_list

items = [
    GenerativeCorpusMetricInput(golds=["GOOD"], preds=["GOOD"]),
    GenerativeCorpusMetricInput(golds=["REF2"], preds=["PRED2"]),
]

metric = CorpusLevelTranslationMetric("chrf++")

# Mirrors compute_corpus(): each i.golds is Sequence[str], so this produces
# Sequence[Sequence[str]] in per‑sample orientation.
golds = [i.golds for i in items]  # [sent_id][ref_id]
preds = [as_list(i.preds)[0] for i in items]

# Shows only one hypothesis is being scored:
stats = metric.get_metric()._extract_corpus_statistics(preds, golds)
print(len(stats))  # 1 (should be 2)

score_wrong = metric.get_metric().corpus_score(preds, golds).score
print(score_wrong) # 100 despite 2nd hyp being wrong (0 for TER)

Expected behavior

Each hypothesis should be scored against its own reference set, and corpus statistics should include all hypotheses (len(stats) == len(hypotheses)).

Version info

  • lighteval: 0.13.0
  • Python: 3.13
  • Dependencies: sacrebleu 2.5.1

Suspected root cause

  • GenerativeCorpusMetricInput.golds is list[str] (per-sample refs) (src/lighteval/metrics/sample_preparator.py).
  • compute_corpus() does golds = [i.golds for i in items], producing list[list[str]]. This type passes, but the orientation is per-sample, not per-reference (src/lighteval/metrics/metrics_corpus.py).
  • sacrebleu expects [ref_id][sent_id] and builds per-segment refs via zip(*references), then pairs them with hypotheses using zip(hypotheses, ref_cache), truncating to the number of refs (sacrebleu/metrics/base.py).
  • chrF++ picks the best ref among those provided (sacrebleu/metrics/chrf.py _compute_segment_statistics, where best_f_score is updated per ref). The best match is usually the corresponding reference (e.g., ref1 for hyp1) but not necessarily, which can inflate scores. TER uses the same base machinery (sacrebleu/metrics/ter.py + sacrebleu/metrics/base.py).

Suggested fix

Transpose golds before calling sacrebleu so it matches [ref_id][sent_id]:

from itertools import zip_longest # zip_longest to account for variable num of refs per sample

# inside compute_corpus(), before corpus_score(...)

golds = [list(refs) for refs in zip_longest(*golds, fillvalue=None)] 

We can also consider applying the same transpose for BLEU to keep multi-reference support instead of dropping to gold[0].

If I missed something or this is intended behavior, please let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions