[BUG] chrF++/chrF/TER metrics receive references in wrong format, causing incorrect corpus-level scoring

## Describe the bug
`CorpusLevelTranslationMetric` passes references to sacrebleu in `[sent_id][ref_id]` shape (per-sample list), but sacrebleu expects `[ref_id][sent_id]` (reference streams). This causes chrF/chrF++/TER to score only the first min number of refs per sample hypotheses and compare them against a pooled set of references from the entire dataset. BLEU is special-cased but still drops extra references.

## To Reproduce
Minimal example showing the shape issue (mirrors how lighteval currently passes refs):

```python
from lighteval.metrics.metrics_corpus import CorpusLevelTranslationMetric
from lighteval.metrics.sample_preparator import GenerativeCorpusMetricInput
from lighteval.utils.utils import as_list

items = [
    GenerativeCorpusMetricInput(golds=["GOOD"], preds=["GOOD"]),
    GenerativeCorpusMetricInput(golds=["REF2"], preds=["PRED2"]),
]

metric = CorpusLevelTranslationMetric("chrf++")

# Mirrors compute_corpus(): each i.golds is Sequence[str], so this produces
# Sequence[Sequence[str]] in per‑sample orientation.
golds = [i.golds for i in items]  # [sent_id][ref_id]
preds = [as_list(i.preds)[0] for i in items]

# Shows only one hypothesis is being scored:
stats = metric.get_metric()._extract_corpus_statistics(preds, golds)
print(len(stats))  # 1 (should be 2)

score_wrong = metric.get_metric().corpus_score(preds, golds).score
print(score_wrong) # 100 despite 2nd hyp being wrong (0 for TER)
```
## Expected behavior
Each hypothesis should be scored against its own reference set, and corpus statistics should include all hypotheses (len(stats) == len(hypotheses)).
## Version info
* lighteval: 0.13.0
* Python: 3.13
* Dependencies: sacrebleu 2.5.1

## Suspected root cause
* `GenerativeCorpusMetricInput.golds` is `list[str]` (per-sample refs) (src/lighteval/metrics/sample_preparator.py).
* `compute_corpus()` does `golds = [i.golds for i in items]`, producing `list[list[str]]`. This type passes, but the orientation is per-sample, not per-reference (src/lighteval/metrics/metrics_corpus.py).
* sacrebleu expects `[ref_id][sent_id]` and builds per-segment refs via `zip(*references)`, then pairs them with hypotheses using `zip(hypotheses, ref_cache)`, truncating to the number of refs (sacrebleu/metrics/base.py).
* chrF++ picks the best ref among those provided (sacrebleu/metrics/chrf.py `_compute_segment_statistics`, where `best_f_score` is updated per ref). The best match is usually the corresponding reference (e.g., ref1 for hyp1) but not necessarily, which can inflate scores. TER uses the same base machinery (sacrebleu/metrics/ter.py + sacrebleu/metrics/base.py).
## Suggested fix
Transpose `golds` before calling sacrebleu so it matches `[ref_id][sent_id]`:
```python
from itertools import zip_longest # zip_longest to account for variable num of refs per sample

# inside compute_corpus(), before corpus_score(...)

golds = [list(refs) for refs in zip_longest(*golds, fillvalue=None)] 
```
We can also consider applying the same transpose for BLEU to keep multi-reference support instead of dropping to `gold[0]`.

If I missed something or this is intended behavior, please let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] chrF++/chrF/TER metrics receive references in wrong format, causing incorrect corpus-level scoring #1112

Describe the bug

To Reproduce

Expected behavior

Version info

Suspected root cause

Suggested fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] chrF++/chrF/TER metrics receive references in wrong format, causing incorrect corpus-level scoring #1112

Description

Describe the bug

To Reproduce

Expected behavior

Version info

Suspected root cause

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions