lm-eval v0.4.4 Release Notes

New Additions

This release includes the Open LLM Leaderboard 2 official task implementations! These can be run by using --tasks leaderboard. Thank you to the HF team (@clefourrier, @NathanHB , @KonradSzafer, @lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release here.
API support is overhauled! Now: support for concurrent requests, chat templates, tokenization, batching and improved customization. This makes API support both more generalizable to new providers and should dramatically speed up API model inference.
- The url can be specified by passing the base_url to --model_args, for example, base_url=http://localhost:8000/v1/completions; concurrent requests are controlled with the num_concurrent argument; tokenization is controlled with tokenized_requests.
- Other arguments (such as top_p, top_k, etc.) can be passed to the API using --gen_kwargs as usual.
- Note: Instruct-tuned models, not just base models, can be used with local-completions using --apply_chat_template (either with or without tokenized_requests).
  - They can also be used with local-chat-completions (for e.g. with a OpenAI Chat API endpoint), but only the former supports loglikelihood tasks (e.g. multiple-choice). This is because ChatCompletion style APIs generally do not provide access to logits on prompt/input tokens, preventing easy measurement of multi-token continuations' log probabilities.
- example with OpenAI completions API (using vllm serve):
  - lm_eval --model local-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10,tokenized_requests=True,tokenizer_backend=huggingface,max_length=4096 --apply_chat_template --batch_size 1 --tasks mmlu
- example with chat API:
  - lm_eval --model local-chat-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10 --apply_chat_template --tasks gsm8k
- We recommend evaluating Llama-3.1-405B models via serving them with vllm then running under local-completions!
We've reworked the Task Grouping system to make it clearer when and when not to report an aggregated average score across multiple subtasks. See #Backwards Incompatibilities below for more information on changes and migration instructions.
A combination of data-parallel and model-parallel (using HF's device_map functionality for "naive" pipeline parallel) inference using --model hf is now supported, thank you to @NathanHB and team!

Other new additions include a number of miscellaneous bugfixes and much more. Thank you to all contributors who helped out on this release!

New Tasks

A number of new tasks have been contributed to the library.

As a further discoverability improvement, lm_eval --tasks list now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! Thank you to @anthony-dipofi for working on this.

New tasks as of v0.4.4 include:

Open LLM Leaderboard 2 tasks--see above!
Inverse Scaling tasks, contributed by @h-albert-lee in #1589
Unitxt tasks reworked by @elronbandel in #1933
MMLU-SR, contributed by @SkySuperCat in #2032
IrokoBench, contributed by @JessicaOjo @IsraelAbebe in #2042
MedConceptQA, contributed by @Ofir408 in #2010
MMLU Pro, contributed by @ysjprojects in #1961
GSM-Plus, contributed by @ysjprojects in #2103
Lingoly, contributed by @am-bean in #2198
GSM8k and Asdiv settings matching the Llama 3.1 evaluation settings, contributed by @Cameron7195 in #2215 #2236
TMLU, contributed by @adamlin120 in #2093
Mela, contributed by @Geralt-Targaryen in #1970

Backwards Incompatibilities

`tag`s versus `group`s, and how to migrate

Previously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for "parent" tasks like mmlu to aggregate and report a unified score across a set of component "subtasks".

There were two ways to add a task to a given group name: 1) to provide (a list of) values to the group field in a given subtask's config file:

# this is a *task* yaml file.
group: group_name1
task: my_task1
# rest of task config goes here...

or 2) to define a "group config file" and specify a group along with its constituent subtasks:

# this is a group's yaml file
group: group_name1
task:
  - subtask_name1
  - subtask_name2
  # ...

These would both have the same effect of reporting an averaged metric for group_name1 when calling lm_eval --tasks group_name1. However, in use-case 1) (simply registering a shorthand for a list of tasks one is interested in), reporting an aggregate score can be undesirable or ill-defined.

We've now separated out these two use-cases ("shorthand" groupings and hierarchical subtask collections) into a tag and group property separately!

To register a shorthand (now called a tag), simply change the group field name within your task's config to be tag (group_alias keys will no longer be supported in task configs.):

# this is a *task* yaml file.
tag: tag_name1
task: my_task1
# rest of task config goes here...

Group config files may remain as is if aggregation is not desired. To opt-in to reporting aggregated scores across a group's subtasks, add the following to your group config file:

# this is a group's yaml file
group: group_name1
task:
  - subtask_name1
  - subtask_name2
  # ...
 ### New! Needed to turn on aggregation ###
 aggregate_metric_list:
  - metric: acc # placeholder. Note that all subtasks in this group must report an `acc` metric key
  - weight_by_size: True # whether one wishes to report *micro* or *macro*-averaged scores across subtasks. Defaults to `True`.

Please see our documentation here for more information. We apologize for any headaches this migration may create--however, we believe separating out these two functionalities will make it less likely for users to encounter confusion or errors related to mistaken undesired aggregation.

Future Plans

We're planning to make more planning documents public and standardize on (likely) 1 new PyPI release per month! Stay tuned.

Thanks, the LM Eval Harness team (@haileyschoelkopf @lintangsutawika @baberabb)

What's Changed

fix wandb logger module import in example by @ToluClassics in #2041
Fix strip whitespace filter by @NathanHB in #2048
Gemma-2 also needs default add_bos_token=True by @haileyschoelkopf in #2049
Update trust_remote_code for Hellaswag by @haileyschoelkopf in #2029
Adds Open LLM Leaderboard Taks by @NathanHB in #2047
#1442 inverse scaling tasks implementation by @h-albert-lee in #1589
Fix TypeError in samplers.py by converting int to str by @uni2237 in #2074
Group agg rework by @lintangsutawika in #1741
Fix printout tests (N/A expected for stderrs) by @haileyschoelkopf in #2080
Easier unitxt tasks loading and removal of unitxt library dependancy by @elronbandel in #1933
Allow gating EvaluationTracker HF Hub results; customizability by @NathanHB in #2051
Minor doc fix: leaderboard README.md missing mmlu-pro group and task by @pankajarm in #2075
Revert missing utf-8 encoding for logged sample files (#2027) by @haileyschoelkopf in #2082
Update utils.py by @lintangsutawika in #2085
batch_size may be str if 'auto' is specified by @meg-huggingface in #2084
Prettify lm_eval --tasks list by @anthony-dipofi in #1929
Suppress noisy RougeScorer logs in truthfulqa_gen by @haileyschoelkopf in #2090
Update default.yaml by @waneon in #2092
Add new dataset MMLU-SR tasks by @SkySuperCat in #2032
Irokobench: Benchmark Dataset for African languages by @JessicaOjo in #2042
docs: remove trailing sentence from contribution doc by @nathan-weinberg in #2098
Added MedConceptsQA Benchmark by @Ofir408 in #2010
Also force BOS for "recurrent_gemma" and other Gemma model types by @haileyschoelkopf in #2105
formatting by @lintangsutawika in #2104
docs: align local test command to match CI by @nathan-weinberg in #2100
Fixed colon in Belebele _default_template_yaml by @jab13x in #2111
Fix haerae task groups by @jungwhank in #2112
fix: broken discord link in CONTRIBUTING.md by @nathan-weinberg in #2114
docs: update truthfulqa tasks by @CandiedCode in #2119
Hotfix lm_eval.caching module by @haileyschoelkopf in #2124
Refactor API models by @baberabb in #2008
bugfix and docs for API by @baberabb in #2139
[Bugfix] add temperature=0 to logprobs and seed args to API models by @baberabb in #2149
refactor: limit usage of scipy and skilearn dependencies by @nathan-weinberg in #2097
Update lm-eval-overview.ipynb by @haileyschoelkopf in #2118
fix typo. by @kargaranamir in #2169
Incorrect URL by @zhabuye in #2125
Dp and mp support by @NathanHB in #2056
[hotfix] API: messages were created twice by @baberabb in #2174
add okapi machine translated notice. by @kargaranamir in #2168
IrokoBench: Fix incorrect group assignments by @haileyschoelkopf in #2181
Mmlu Pro by @ysjprojects in #1961
added gsm_plus by @ysjprojects in #2103
Fix revision kwarg dtype in edge-cases by @haileyschoelkopf in #2184
Small README tweaks by @haileyschoelkopf in #2186
gsm_plus minor fix by @ysjprojects in #2191
keep new line for task description by @jungwhank in #2116
Update README.md by @ysjprojects in #2206
Update citation in README.md by @antonpolishko in #2083
New task: Lingoly by @am-bean in #2198
Created a new task for gsm8k which corresponds to the Llama cot settings… by @Cameron7195 in #2215
Lingoly README update by @am-bean in #2228
Update yaml to adapt to belebele dataset changes by @Uminosachi in #2216
Add TMLU Benchmark Dataset by @adamlin120 in #2093
Update IFEval dataset to official one by @lewtun in #2218
fix the leaderboard doc to reflect the tasks by @NathanHB in #2219
Add multiple chat template by @KonradSzafer in #2129
Update CODEOWNERS by @haileyschoelkopf in #2229
Fix Zeno Visualizer by @namtranase in #2227
mela by @Geralt-Targaryen in #1970
fix the regex string in mmlu_pro template by @lxning in #2238
Fix logging when resizing embedding layer in peft mode by @WPoelman in #2239
fix mmlu_pro typo by @baberabb in #2241
Fix typos in multiple places by @LSinev in #2244
fix group args of mmlu and mmlu_pro by @eyuansu62 in #2245
Created new task for testing Llama on Asdiv by @Cameron7195 in #2236
chat template hotfix by @baberabb in #2250
[Draft] More descriptive simple_evaluate() LM TypeError by @haileyschoelkopf in #2258
Update NLTK version in *ifeval tasks ( #2210 ) by @haileyschoelkopf in #2259
Fix loglikelihood_rolling caching ( #1821 ) by @haileyschoelkopf in #2187
API: fix maxlen; vllm: prefix_token_id bug by @baberabb in #2262
hotfix #2262 by @baberabb in #2264
Chat Template fix (cont. #2235) by @baberabb in #2269
Bump version to v0.4.4 ; Fixes to TMMLUplus by @haileyschoelkopf in #2280

New Contributors

@ToluClassics made their first contribution in #2041
@NathanHB made their first contribution in #2048
@uni2237 made their first contribution in #2074
@elronbandel made their first contribution in #1933
@pankajarm made their first contribution in #2075
@meg-huggingface made their first contribution in #2084
@waneon made their first contribution in #2092
@SkySuperCat made their first contribution in #2032
@JessicaOjo made their first contribution in #2042
@nathan-weinberg made their first contribution in #2098
@Ofir408 made their first contribution in #2010
@jab13x made their first contribution in #2111
@jungwhank made their first contribution in #2112
@CandiedCode made their first contribution in #2119
@kargaranamir made their first contribution in #2169
@ysjprojects made their first contribution in #1961
@antonpolishko made their first contribution in #2083
@am-bean made their first contribution in #2198
@Cameron7195 made their first contribution in #2215
@Uminosachi made their first contribution in #2216
@adamlin120 made their first contribution in #2093
@lewtun made their first contribution in #2218
@namtranase made their first contribution in #2227
@Geralt-Targaryen made their first contribution in #1970
@lxning made their first contribution in #2238
@WPoelman made their first contribution in #2239
@eyuansu62 made their first contribution in #2245

Full Changelog: v0.4.3...v0.4.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.4

lm-eval v0.4.4 Release Notes

New Additions

New Tasks

Backwards Incompatibilities

`tag`s versus `group`s, and how to migrate

Future Plans

What's Changed

New Contributors

Contributors

v0.4.4

lm-eval v0.4.4 Release Notes

New Additions

New Tasks

Backwards Incompatibilities

tags versus groups, and how to migrate

Future Plans

What's Changed

New Contributors

Contributors

`tag`s versus `group`s, and how to migrate