lm-eval v0.4.7 Release Notes

This release includes several bug fixes, minor improvements to model handling, and task additions.

⚠️ Python 3.8 End of Support Notice

Python 3.8 support will be dropped in future releases as it has reached its end of life. Users are encouraged to upgrade to Python 3.9 or newer.

Backwards Incompatibilities

Chat Template Delimiter Handling (in v0.4.6)

An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.

📝 For detailed documentation, please refer to docs/chat-template-readme.md

New Benchmarks & Tasks

Basque Integration: Added Basque translation of PIQA (piqa_eu) to BasqueBench by @naiarapm in #2531
SCORE Tasks: Added new subtask for non-greedy robustness evaluation by @rimashahbazyan in #2558

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)

What's Changed

Score tasks by @rimashahbazyan in #2452
Filters bugfix; add metrics and filter to logged sample by @baberabb in #2517
skip casting if predict_only by @baberabb in #2524
make utility function to handle until by @baberabb in #2518
Update Unitxt task to use locally installed unitxt and not download Unitxt code from Huggingface by @yoavkatz in #2514
add Basque translation of PIQA (piqa_eu) to BasqueBench by @naiarapm in #2531
avoid timeout errors with high concurrency in api_model by @dtrawins in #2307
Update README.md by @baberabb in #2534
better doc_to_test testing by @baberabb in #2535
Support pipeline parallel with OpenVINO models by @sstrehlk in #2349
Super little tiny fix doc by @fzyzcjy in #2546
[API] left truncate for generate_until by @baberabb in #2554
Update Lightning import by @maanug-nv in #2549
add optimum-intel ipex model by @yao-matrix in #2566
add warning to readme by @baberabb in #2568
Adding new subtask to SCORE tasks: non greedy robustness by @rimashahbazyan in #2558
batch loglikelihood_rolling across requests by @baberabb in #2559
fix DeprecationWarning: invalid escape sequence '\s' for whitespace filter by @baberabb in #2560
increment version to 4.6.7 by @baberabb in #2574

New Contributors

@rimashahbazyan made their first contribution in #2452
@naiarapm made their first contribution in #2531
@dtrawins made their first contribution in #2307
@sstrehlk made their first contribution in #2349
@fzyzcjy made their first contribution in #2546
@maanug-nv made their first contribution in #2549
@yao-matrix made their first contribution in #2566

Full Changelog: v0.4.6...v0.4.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.7