-
Notifications
You must be signed in to change notification settings - Fork 408
Description
Describe the bug
CorpusLevelTranslationMetric passes references to sacrebleu in [sent_id][ref_id] shape (per-sample list), but sacrebleu expects [ref_id][sent_id] (reference streams). This causes chrF/chrF++/TER to score only the first min number of refs per sample hypotheses and compare them against a pooled set of references from the entire dataset. BLEU is special-cased but still drops extra references.
To Reproduce
Minimal example showing the shape issue (mirrors how lighteval currently passes refs):
from lighteval.metrics.metrics_corpus import CorpusLevelTranslationMetric
from lighteval.metrics.sample_preparator import GenerativeCorpusMetricInput
from lighteval.utils.utils import as_list
items = [
GenerativeCorpusMetricInput(golds=["GOOD"], preds=["GOOD"]),
GenerativeCorpusMetricInput(golds=["REF2"], preds=["PRED2"]),
]
metric = CorpusLevelTranslationMetric("chrf++")
# Mirrors compute_corpus(): each i.golds is Sequence[str], so this produces
# Sequence[Sequence[str]] in per‑sample orientation.
golds = [i.golds for i in items] # [sent_id][ref_id]
preds = [as_list(i.preds)[0] for i in items]
# Shows only one hypothesis is being scored:
stats = metric.get_metric()._extract_corpus_statistics(preds, golds)
print(len(stats)) # 1 (should be 2)
score_wrong = metric.get_metric().corpus_score(preds, golds).score
print(score_wrong) # 100 despite 2nd hyp being wrong (0 for TER)Expected behavior
Each hypothesis should be scored against its own reference set, and corpus statistics should include all hypotheses (len(stats) == len(hypotheses)).
Version info
- lighteval: 0.13.0
- Python: 3.13
- Dependencies: sacrebleu 2.5.1
Suspected root cause
GenerativeCorpusMetricInput.goldsislist[str](per-sample refs) (src/lighteval/metrics/sample_preparator.py).compute_corpus()doesgolds = [i.golds for i in items], producinglist[list[str]]. This type passes, but the orientation is per-sample, not per-reference (src/lighteval/metrics/metrics_corpus.py).- sacrebleu expects
[ref_id][sent_id]and builds per-segment refs viazip(*references), then pairs them with hypotheses usingzip(hypotheses, ref_cache), truncating to the number of refs (sacrebleu/metrics/base.py). - chrF++ picks the best ref among those provided (sacrebleu/metrics/chrf.py
_compute_segment_statistics, wherebest_f_scoreis updated per ref). The best match is usually the corresponding reference (e.g., ref1 for hyp1) but not necessarily, which can inflate scores. TER uses the same base machinery (sacrebleu/metrics/ter.py + sacrebleu/metrics/base.py).
Suggested fix
Transpose golds before calling sacrebleu so it matches [ref_id][sent_id]:
from itertools import zip_longest # zip_longest to account for variable num of refs per sample
# inside compute_corpus(), before corpus_score(...)
golds = [list(refs) for refs in zip_longest(*golds, fillvalue=None)] We can also consider applying the same transpose for BLEU to keep multi-reference support instead of dropping to gold[0].
If I missed something or this is intended behavior, please let me know.