lm-eval v0.4.7 Release Notes
This release includes several bug fixes, minor improvements to model handling, and task additions.
⚠️ Python 3.8 End of Support Notice
Python 3.8 support will be dropped in future releases as it has reached its end of life. Users are encouraged to upgrade to Python 3.9 or newer.
Backwards Incompatibilities
Chat Template Delimiter Handling (in v0.4.6)
An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.
📝 For detailed documentation, please refer to docs/chat-template-readme.md
New Benchmarks & Tasks
- Basque Integration: Added Basque translation of PIQA (piqa_eu) to BasqueBench by @naiarapm in #2531
- SCORE Tasks: Added new subtask for non-greedy robustness evaluation by @rimashahbazyan in #2558
As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).
Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)
What's Changed
- Score tasks by @rimashahbazyan in #2452
- Filters bugfix; add
metrics
andfilter
to logged sample by @baberabb in #2517 - skip casting if predict_only by @baberabb in #2524
- make utility function to handle
until
by @baberabb in #2518 - Update Unitxt task to use locally installed unitxt and not download Unitxt code from Huggingface by @yoavkatz in #2514
- add Basque translation of PIQA (piqa_eu) to BasqueBench by @naiarapm in #2531
- avoid timeout errors with high concurrency in api_model by @dtrawins in #2307
- Update README.md by @baberabb in #2534
- better doc_to_test testing by @baberabb in #2535
- Support pipeline parallel with OpenVINO models by @sstrehlk in #2349
- Super little tiny fix doc by @fzyzcjy in #2546
- [API] left truncate for generate_until by @baberabb in #2554
- Update Lightning import by @maanug-nv in #2549
- add optimum-intel ipex model by @yao-matrix in #2566
- add warning to readme by @baberabb in #2568
- Adding new subtask to SCORE tasks: non greedy robustness by @rimashahbazyan in #2558
- batch
loglikelihood_rolling
across requests by @baberabb in #2559 - fix
DeprecationWarning: invalid escape sequence '\s'
for whitespace filter by @baberabb in #2560 - increment version to 4.6.7 by @baberabb in #2574
New Contributors
- @rimashahbazyan made their first contribution in #2452
- @naiarapm made their first contribution in #2531
- @dtrawins made their first contribution in #2307
- @sstrehlk made their first contribution in #2349
- @fzyzcjy made their first contribution in #2546
- @maanug-nv made their first contribution in #2549
- @yao-matrix made their first contribution in #2566
Full Changelog: v0.4.6...v0.4.7