Fixing bugs in the baseline and evaluation #7

bond005 · 2026-01-16T12:00:40Z

Incorrect signatures in the examples/calculate_baseline_metrics.ipynb are fixed.
The rag-bench package is refactored, and errors in this package are corrected. Some of these errors are:
- found_ids in the evaluate_rag_results function were processed incorrectly, because they were represented as string instead of an int list;
- question types were not taken into account when calculating metrics, although they should be according to the evaluator's logic;
- and so on.
In addition to the jupyter notebook, a special python script examples/calculate_baseline_metrics.py is implemented for more convenient launch of the baseline and obtaining metrics.
The special readme.md for the rag-bench package is added.

Items 1 and 2 ensure that the benchmark can be used correctly (without the bug fixes described in these items, it is impossible to use the benchmark to evaluate RAG pipelines). Items 3 and 4 simply improve the usability of the benchmark.

…ok with examples is updated.

bond005 added 6 commits January 5, 2026 00:48

Bugs in the bBaseline metrics calculation are fixed.

71e5f62

Bugs in the baseline and in the evaluator are fixed.

e9cb2e7

Baseline is modified to support thinking models. Also, jupyter notebo…

63ea97d

…ok with examples is updated.

Baseline is modified to support thinking models. Also, jupyter notebo…

a889bb9

…ok with examples is updated.

Evaluation by category is supported.

06f5dd2

Documentation for rag-bench is prepared.

1838ba9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing bugs in the baseline and evaluation #7

Fixing bugs in the baseline and evaluation #7

Uh oh!

bond005 commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fixing bugs in the baseline and evaluation #7

Are you sure you want to change the base?

Fixing bugs in the baseline and evaluation #7

Uh oh!

Conversation

bond005 commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant