v0.4.4
lm-eval v0.4.4 Release Notes
New Additions
-
This release includes the Open LLM Leaderboard 2 official task implementations! These can be run by using
--tasks leaderboard
. Thank you to the HF team (@clefourrier, @NathanHB , @KonradSzafer, @lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release here. -
API support is overhauled! Now: support for concurrent requests, chat templates, tokenization, batching and improved customization. This makes API support both more generalizable to new providers and should dramatically speed up API model inference.
- The url can be specified by passing the
base_url
to--model_args
, for example,base_url=http://localhost:8000/v1/completions
; concurrent requests are controlled with thenum_concurrent
argument; tokenization is controlled withtokenized_requests
. - Other arguments (such as top_p, top_k, etc.) can be passed to the API using
--gen_kwargs
as usual. - Note: Instruct-tuned models, not just base models, can be used with
local-completions
using--apply_chat_template
(either with or withouttokenized_requests
).- They can also be used with
local-chat-completions
(for e.g. with a OpenAI Chat API endpoint), but only the former supports loglikelihood tasks (e.g. multiple-choice). This is because ChatCompletion style APIs generally do not provide access to logits on prompt/input tokens, preventing easy measurement of multi-token continuations' log probabilities.
- They can also be used with
- example with OpenAI completions API (using vllm serve):
lm_eval --model local-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10,tokenized_requests=True,tokenizer_backend=huggingface,max_length=4096 --apply_chat_template --batch_size 1 --tasks mmlu
- example with chat API:
lm_eval --model local-chat-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10 --apply_chat_template --tasks gsm8k
- We recommend evaluating Llama-3.1-405B models via serving them with vllm then running under
local-completions
!
- The url can be specified by passing the
-
We've reworked the Task Grouping system to make it clearer when and when not to report an aggregated average score across multiple subtasks. See #Backwards Incompatibilities below for more information on changes and migration instructions.
-
A combination of data-parallel and model-parallel (using HF's
device_map
functionality for "naive" pipeline parallel) inference using--model hf
is now supported, thank you to @NathanHB and team!
Other new additions include a number of miscellaneous bugfixes and much more. Thank you to all contributors who helped out on this release!
New Tasks
A number of new tasks have been contributed to the library.
As a further discoverability improvement, lm_eval --tasks list
now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! Thank you to @anthony-dipofi for working on this.
New tasks as of v0.4.4 include:
- Open LLM Leaderboard 2 tasks--see above!
- Inverse Scaling tasks, contributed by @h-albert-lee in #1589
- Unitxt tasks reworked by @elronbandel in #1933
- MMLU-SR, contributed by @SkySuperCat in #2032
- IrokoBench, contributed by @JessicaOjo @IsraelAbebe in #2042
- MedConceptQA, contributed by @Ofir408 in #2010
- MMLU Pro, contributed by @ysjprojects in #1961
- GSM-Plus, contributed by @ysjprojects in #2103
- Lingoly, contributed by @am-bean in #2198
- GSM8k and Asdiv settings matching the Llama 3.1 evaluation settings, contributed by @Cameron7195 in #2215 #2236
- TMLU, contributed by @adamlin120 in #2093
- Mela, contributed by @Geralt-Targaryen in #1970
Backwards Incompatibilities
tag
s versus group
s, and how to migrate
Previously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for "parent" tasks like mmlu
to aggregate and report a unified score across a set of component "subtasks".
There were two ways to add a task to a given group
name: 1) to provide (a list of) values to the group
field in a given subtask's config file:
# this is a *task* yaml file.
group: group_name1
task: my_task1
# rest of task config goes here...
or 2) to define a "group config file" and specify a group along with its constituent subtasks:
# this is a group's yaml file
group: group_name1
task:
- subtask_name1
- subtask_name2
# ...
These would both have the same effect of reporting an averaged metric for group_name1 when calling lm_eval --tasks group_name1
. However, in use-case 1) (simply registering a shorthand for a list of tasks one is interested in), reporting an aggregate score can be undesirable or ill-defined.
We've now separated out these two use-cases ("shorthand" groupings and hierarchical subtask collections) into a tag
and group
property separately!
To register a shorthand (now called a tag
), simply change the group
field name within your task's config to be tag
(group_alias
keys will no longer be supported in task configs.):
# this is a *task* yaml file.
tag: tag_name1
task: my_task1
# rest of task config goes here...
Group config files may remain as is if aggregation is not desired. To opt-in to reporting aggregated scores across a group's subtasks, add the following to your group config file:
# this is a group's yaml file
group: group_name1
task:
- subtask_name1
- subtask_name2
# ...
### New! Needed to turn on aggregation ###
aggregate_metric_list:
- metric: acc # placeholder. Note that all subtasks in this group must report an `acc` metric key
- weight_by_size: True # whether one wishes to report *micro* or *macro*-averaged scores across subtasks. Defaults to `True`.
Please see our documentation here for more information. We apologize for any headaches this migration may create--however, we believe separating out these two functionalities will make it less likely for users to encounter confusion or errors related to mistaken undesired aggregation.
Future Plans
We're planning to make more planning documents public and standardize on (likely) 1 new PyPI release per month! Stay tuned.
Thanks, the LM Eval Harness team (@haileyschoelkopf @lintangsutawika @baberabb)
What's Changed
- fix wandb logger module import in example by @ToluClassics in #2041
- Fix strip whitespace filter by @NathanHB in #2048
- Gemma-2 also needs default
add_bos_token=True
by @haileyschoelkopf in #2049 - Update
trust_remote_code
for Hellaswag by @haileyschoelkopf in #2029 - Adds Open LLM Leaderboard Taks by @NathanHB in #2047
- #1442 inverse scaling tasks implementation by @h-albert-lee in #1589
- Fix TypeError in samplers.py by converting int to str by @uni2237 in #2074
- Group agg rework by @lintangsutawika in #1741
- Fix printout tests (N/A expected for stderrs) by @haileyschoelkopf in #2080
- Easier unitxt tasks loading and removal of unitxt library dependancy by @elronbandel in #1933
- Allow gating EvaluationTracker HF Hub results; customizability by @NathanHB in #2051
- Minor doc fix: leaderboard README.md missing mmlu-pro group and task by @pankajarm in #2075
- Revert missing utf-8 encoding for logged sample files (#2027) by @haileyschoelkopf in #2082
- Update utils.py by @lintangsutawika in #2085
- batch_size may be str if 'auto' is specified by @meg-huggingface in #2084
- Prettify lm_eval --tasks list by @anthony-dipofi in #1929
- Suppress noisy RougeScorer logs in
truthfulqa_gen
by @haileyschoelkopf in #2090 - Update default.yaml by @waneon in #2092
- Add new dataset MMLU-SR tasks by @SkySuperCat in #2032
- Irokobench: Benchmark Dataset for African languages by @JessicaOjo in #2042
- docs: remove trailing sentence from contribution doc by @nathan-weinberg in #2098
- Added MedConceptsQA Benchmark by @Ofir408 in #2010
- Also force BOS for
"recurrent_gemma"
and other Gemma model types by @haileyschoelkopf in #2105 - formatting by @lintangsutawika in #2104
- docs: align local test command to match CI by @nathan-weinberg in #2100
- Fixed colon in Belebele _default_template_yaml by @jab13x in #2111
- Fix haerae task groups by @jungwhank in #2112
- fix: broken discord link in CONTRIBUTING.md by @nathan-weinberg in #2114
- docs: update truthfulqa tasks by @CandiedCode in #2119
- Hotfix
lm_eval.caching
module by @haileyschoelkopf in #2124 - Refactor API models by @baberabb in #2008
- bugfix and docs for API by @baberabb in #2139
- [Bugfix] add temperature=0 to logprobs and seed args to API models by @baberabb in #2149
- refactor: limit usage of
scipy
andskilearn
dependencies by @nathan-weinberg in #2097 - Update lm-eval-overview.ipynb by @haileyschoelkopf in #2118
- fix typo. by @kargaranamir in #2169
- Incorrect URL by @zhabuye in #2125
- Dp and mp support by @NathanHB in #2056
- [hotfix] API: messages were created twice by @baberabb in #2174
- add okapi machine translated notice. by @kargaranamir in #2168
- IrokoBench: Fix incorrect group assignments by @haileyschoelkopf in #2181
- Mmlu Pro by @ysjprojects in #1961
- added gsm_plus by @ysjprojects in #2103
- Fix
revision
kwarg dtype in edge-cases by @haileyschoelkopf in #2184 - Small README tweaks by @haileyschoelkopf in #2186
- gsm_plus minor fix by @ysjprojects in #2191
- keep new line for task description by @jungwhank in #2116
- Update README.md by @ysjprojects in #2206
- Update citation in README.md by @antonpolishko in #2083
- New task: Lingoly by @am-bean in #2198
- Created a new task for gsm8k which corresponds to the Llama cot settings… by @Cameron7195 in #2215
- Lingoly README update by @am-bean in #2228
- Update yaml to adapt to belebele dataset changes by @Uminosachi in #2216
- Add TMLU Benchmark Dataset by @adamlin120 in #2093
- Update IFEval dataset to official one by @lewtun in #2218
- fix the leaderboard doc to reflect the tasks by @NathanHB in #2219
- Add multiple chat template by @KonradSzafer in #2129
- Update CODEOWNERS by @haileyschoelkopf in #2229
- Fix Zeno Visualizer by @namtranase in #2227
- mela by @Geralt-Targaryen in #1970
- fix the regex string in mmlu_pro template by @lxning in #2238
- Fix logging when resizing embedding layer in peft mode by @WPoelman in #2239
- fix mmlu_pro typo by @baberabb in #2241
- Fix typos in multiple places by @LSinev in #2244
- fix group args of mmlu and mmlu_pro by @eyuansu62 in #2245
- Created new task for testing Llama on Asdiv by @Cameron7195 in #2236
- chat template hotfix by @baberabb in #2250
- [Draft] More descriptive
simple_evaluate()
LM TypeError by @haileyschoelkopf in #2258 - Update NLTK version in
*ifeval
tasks ( #2210 ) by @haileyschoelkopf in #2259 - Fix
loglikelihood_rolling
caching ( #1821 ) by @haileyschoelkopf in #2187 - API: fix maxlen; vllm: prefix_token_id bug by @baberabb in #2262
- hotfix #2262 by @baberabb in #2264
- Chat Template fix (cont. #2235) by @baberabb in #2269
- Bump version to v0.4.4 ; Fixes to TMMLUplus by @haileyschoelkopf in #2280
New Contributors
- @ToluClassics made their first contribution in #2041
- @NathanHB made their first contribution in #2048
- @uni2237 made their first contribution in #2074
- @elronbandel made their first contribution in #1933
- @pankajarm made their first contribution in #2075
- @meg-huggingface made their first contribution in #2084
- @waneon made their first contribution in #2092
- @SkySuperCat made their first contribution in #2032
- @JessicaOjo made their first contribution in #2042
- @nathan-weinberg made their first contribution in #2098
- @Ofir408 made their first contribution in #2010
- @jab13x made their first contribution in #2111
- @jungwhank made their first contribution in #2112
- @CandiedCode made their first contribution in #2119
- @kargaranamir made their first contribution in #2169
- @ysjprojects made their first contribution in #1961
- @antonpolishko made their first contribution in #2083
- @am-bean made their first contribution in #2198
- @Cameron7195 made their first contribution in #2215
- @Uminosachi made their first contribution in #2216
- @adamlin120 made their first contribution in #2093
- @lewtun made their first contribution in #2218
- @namtranase made their first contribution in #2227
- @Geralt-Targaryen made their first contribution in #1970
- @lxning made their first contribution in #2238
- @WPoelman made their first contribution in #2239
- @eyuansu62 made their first contribution in #2245
Full Changelog: v0.4.3...v0.4.4