Release v0.4.0 · EleutherAI/lm-evaluation-harness

What's Changed

Replace stale triviaqa dataset link by @jon-tow in #364
Update actions/setup-pythonin CI workflows by @jon-tow in #365
Bump triviaqa version by @jon-tow in #366
Update lambada_openai multilingual data source by @jon-tow in #370
Update Pile Test/Val Download URLs by @fattorib in #373
Added ToxiGen task by @Thartvigsen in #377
Added CrowSPairs by @aflah02 in #379
Add accuracy metric to crows-pairs by @haileyschoelkopf in #380
hotfix(gpt2): Remove vocab-size logits slice by @jon-tow in #384
Enable "low_cpu_mem_usage" to reduce the memory usage of HF models by @sxjscience in #390
Upstream hf-causal and hf-seq2seq model implementations by @haileyschoelkopf in #381
Hosting arithmetic dataset on HuggingFace by @fattorib in #391
Hosting wikitext on HuggingFace by @fattorib in #396
Change device parameter to cuda:0 to avoid runtime error by @Jeffwan in #403
Update README installation instructions by @haileyschoelkopf in #407
feat: evaluation using peft models with CLM by @zanussbaum in #414
Update setup.py dependencies by @ret2libc in #416
fix: add seq2seq peft by @zanussbaum in #418
Add support for load_in_8bit and trust_remote_code model params by @philwee in #422
Hotfix: patch issues with the huggingface.py model classes by @haileyschoelkopf in #427
Continuing work on refactor [WIP] by @haileyschoelkopf in #425
Document task name wildcard support in README by @haileyschoelkopf in #435
Add non-programmatic BIG-bench-hard tasks by @yurodiviy in #406
Updated handling for device in lm_eval/models/gpt2.py by @nikhilpinnaparaju in #447
[WIP, Refactor] Staging more changes by @haileyschoelkopf in #465
[Refactor, WIP] Multiple Choice + loglikelihood_rolling support for YAML tasks by @haileyschoelkopf in #467
Configurable-Tasks by @lintangsutawika in #438
single GPU automatic batching logic by @fattorib in #394
Fix bugs introduced in #394 #406 and max length bug by @juletx in #472
Sort task names to keep the same order always by @juletx in #474
Set PAD token to EOS token by @nikhilpinnaparaju in #448
[Refactor] Add decorator for registering YAMLs as tasks by @haileyschoelkopf in #486
fix adaptive batch crash when there are no new requests by @jquesnelle in #490
Add multilingual datasets (XCOPA, XStoryCloze, XWinograd, PAWS-X, XNLI, MGSM) by @juletx in #426
Create output path directory if necessary by @janEbert in #483
Add results of various models in json and md format by @juletx in #477
Update config by @lintangsutawika in #501
P3 prompt task by @lintangsutawika in #493
Evaluation Against Portion of Benchmark Data by @kenhktsui in #480
Add option to dump prompts and completions to a JSON file by @juletx in #492
Add perplexity task on arbitrary JSON data by @janEbert in #481
Update config by @lintangsutawika in #520
Data Parallelism by @fattorib in #488
Fix mgpt fewshot by @lintangsutawika in #522
Extend dtype command line flag to HFLM by @haileyschoelkopf in #523
Add support for loading GPTQ models via AutoGPTQ by @gakada in #519
Change type signature of quantized and its default value for python < 3.11 compatibility by @passaglia in #532
Fix LLaMA tokenization issue by @gakada in #531
[Refactor] Make promptsource an extra / not required for installation by @haileyschoelkopf in #542
Move spaces from context to continuation by @gakada in #546
Use max_length in AutoSeq2SeqLM by @gakada in #551
Fix typo by @kwikiel in #557
Add load_in_4bit and fix peft loading by @gakada in #556
Update task_guide.md by @haileyschoelkopf in #564
[Refactor] Non-greedy generation ; WIP GSM8k yaml by @haileyschoelkopf in #559
Dataset metric log [WIP] by @lintangsutawika in #560
Add Anthropic support by @zphang in #562
Add MultipleChoiceExactTask by @gakada in #537
Revert "Add MultipleChoiceExactTask" by @StellaAthena in #568
[Refactor] [WIP] New YAML advanced docs by @haileyschoelkopf in #567
Remove the registration of "GPT2" as a model type by @StellaAthena in #574
[Refactor] Docs update by @haileyschoelkopf in #577
Better docs by @lintangsutawika in #576
Update evaluator.py cache_db argument str if model is not str by @poedator in #575
Add --max_batch_size and --batch_size auto:N by @gakada in #572
[Refactor] ALL_TASKS now maintained (not static) by @haileyschoelkopf in #581
Fix seqlen issues for bloom, remove extraneous OPT tokenizer check by @haileyschoelkopf in #582
Fix non-callable attributes in CachingLM by @gakada in #584
Add error handling for calling .to(device) by @haileyschoelkopf in #585
fixes some minor issues on tasks. by @lintangsutawika in #580
Add - 4bit-related args by @SONG-WONHO in #579
Fix triviaqa task by @seopbo in #525
[Refactor] Addressing Feedback on new docs pages by @haileyschoelkopf in #578
Logging Samples by @farzanehnakhaee70 in #563
Merge master into big-refactor by @gakada in #590
[Refactor] Package YAMLs alongside pip installations of lm-eval by @haileyschoelkopf in #596
fixes for multiple_choice by @lintangsutawika in #598
add openbookqa config by @farzanehnakhaee70 in #600
[Refactor] Model guide docs by @haileyschoelkopf in #606
[Refactor] More MCQA fixes by @haileyschoelkopf in #599
[Refactor] Hellaswag by @nopperl in #608
[Refactor] Seq2Seq Models with Multi-Device Support by @fattorib in #565
[Refactor] CachingLM support via --use_cache by @haileyschoelkopf in #619
[Refactor] batch generation better for hf model ; deprecate hf-causal in new release by @haileyschoelkopf in #613
[Refactor] Update task statuses on tracking list by @haileyschoelkopf in #629
[Refactor] device_map options for hf model type by @haileyschoelkopf in #625
[Refactor] Misc. cleanup of dead code by @haileyschoelkopf in #609
[Refactor] Log request arguments to per-sample json by @haileyschoelkopf in #624
[Refactor] HellaSwag YAML fix by @nopperl in #639
[Refactor] Add caveats to parallelize=True docs by @haileyschoelkopf in #638
fixed super_glue and removed unused yaml config by @lintangsutawika in #645
[Refactor] Fix sample logging by @haileyschoelkopf in #646
Add PEFT, quantization, remote code, LLaMA fix by @gakada in #644
[Refactor] Handle cuda:0 device assignment by @haileyschoelkopf in #647
[refactor] Add prost config by @farzanehnakhaee70 in #640
[Refactor] Misc. bugfixes ; edgecase quantized models by @haileyschoelkopf in #648
Update init.py by @lintangsutawika in #650
[Refactor] Add Lambada Multilingual by @haileyschoelkopf in #658
[Refactor] Add: SWAG,RACE,Arithmetic,Winogrande,PubmedQA by @fattorib in #627
[refactor] Add qa4mre config by @farzanehnakhaee70 in #651
Update generation_kwargs by @lintangsutawika in #657
[Refactor] Move race dataset on HF to EleutherAI group by @fattorib in #661
[Refactor] Add Headqa by @haileyschoelkopf in #659
[Refactor] Add Unscramble ; Toxigen ; Hendrycks_Ethics ; MathQA by @haileyschoelkopf in #660
[Refactor] Port TruthfulQA (mc1 only) by @nopperl in #666
[Refactor] Miscellaneous fixes by @haileyschoelkopf in #676
[Refactor] Patch to revamp-process by @haileyschoelkopf in #678
Revamp process by @lintangsutawika in #671
[Refactor] Fix padding ranks by @haileyschoelkopf in #679
[Refactor] minor edits by @baberabb in #680
[Refactor] Migrate ANLI tasks to yaml by @yeoedward in #682
edited output_path and added help to args by @baberabb in #684
[Refactor] Minor changes by @haileyschoelkopf in #685
[Refactor] typo by @baberabb in #687
[Test] fix test_evaluator.py by @baberabb in #675
Fix dummy model not invoking super class constructor by @yeoedward in #688
[Refactor] Migrate webqs task to yaml by @yeoedward in #689
[Refactor] Fix tests by @baberabb in #693
[Refactor] Migrate xwinograd tasks to yaml by @yeoedward in #695
Early stop bug of greedy_until (primary_until should be a list of str) by @ZZR0 in #700
Remove condition to check for winograd_schema by @lintangsutawika in #690
[Refactor] Use console script by @lintangsutawika in #703
[Refactor] Fixes for when using num_fewshot by @lintangsutawika in #702
[Refactor] Updated anthropic to new API by @baberabb in #710
[Refactor] Cleanup for big-refactor by @haileyschoelkopf in #686
Update README.md by @lintangsutawika in #720
[Refactor] Benchmark scripts by @lintangsutawika in #612
[Refactor] Fix Max Length arg by @lintangsutawika in #723
Add note about MPS by @StellaAthena in #728
Update huggingface.py by @lintangsutawika in #730
Update README.md by @StellaAthena in #732
[Refactor] Port over Autobatching by @fattorib in #673
[Refactor] Fix Anthropic Import and other fixes by @lintangsutawika in #724
[Refactor] Remove Unused Variable in Make-Table by @lintangsutawika in #734
[Refactor] logiqav2 by @baberabb in #711
[Refactor] Fix task packaging by @yeoedward in #739
[Refactor] fixed openai by @baberabb in #736
[Refactor] added some typehints by @baberabb in #742
[Refactor] Port Babi task by @haileyschoelkopf in #752
[Refactor] CrowS-Pairs by @haileyschoelkopf in #751
Update README.md by @haileyschoelkopf in #745
[Refactor] add xcopa by @lintangsutawika in #749
Update README.md by @lintangsutawika in #764
[Refactor] Add Blimp by @lintangsutawika in #763
[Refactor] Use evaluation mode for accelerate to prevent OOM by @tju01 in #770
Patch Blimp by @lintangsutawika in #768
[Refactor] Speedup hellaswag context building by @haileyschoelkopf in #774
[Refactor] Patch crowspairs higher_is_better by @haileyschoelkopf in #766
[Refactor] XNLI by @lintangsutawika in #776
[Refactor] Update Benchmark by @lintangsutawika in #777
[WIP] Update API docs in README by @haileyschoelkopf in #747
[Refactor] Real Toxicity Prompts by @aflah02 in #725
[Refactor] XStoryCloze by @lintangsutawika in #759
[Refactor] Glue by @lintangsutawika in #761
[Refactor] Add triviaqa by @lintangsutawika in #758
[Refactor] Paws-X by @lintangsutawika in #779
[Refactor] MC Taco by @lintangsutawika in #783
[Refactor] Truthfulqa by @lintangsutawika in #782
[Refactor] fix doc_to_target processing by @lintangsutawika in #786
[Refactor] Add README.md by @lintangsutawika in #757
[Refactor] Don't always require Perspective API key to run by @haileyschoelkopf in #788
[Refactor] Added HF model test by @baberabb in #791
[Big refactor] HF test fixup by @baberabb in #793
[Refactor] Process Whitespace for greedy_until by @lintangsutawika in #781
[Refactor] Fix metrics in Greedy Until by @lintangsutawika in #780
Update README.md by @Wehzie in #803
Merge Fix metrics branch by @uSaiPrashanth in #802
[Refactor] Update docs by @lintangsutawika in #744
[Refactor] Superglue T5 Parity by @lintangsutawika in #769
Update main.py by @lintangsutawika in #817
[Refactor] Coqa by @lintangsutawika in #820
[Refactor] drop by @lintangsutawika in #821
[Refactor] Asdiv by @lintangsutawika in #813
[Refactor] Fix IndexError by @lintangsutawika in #819
[Refactor] toxicity: API inside function by @baberabb in #822
[Refactor] wsc273 by @lintangsutawika in #807
[Refactor] Bump min accelerate version and update documentation by @fattorib in #812
Add mypy baseline config by @ethanhs in #809
[Refactor] Fix wikitext task by @haileyschoelkopf in #833
[Refactor] Add WMT tasks by @haileyschoelkopf in #775
[Refactor] consolidated tasks tests by @baberabb in #831
Update README.md by @lintangsutawika in #838
[Refactor] mgsm by @lintangsutawika in #784
[Refactor] Add top-level import by @haileyschoelkopf in #830
Add pyproject.toml by @ethanhs in #810
[Refactor] Additions to docs by @haileyschoelkopf in #799
[Refactor] Fix MGSM by @lintangsutawika in #845
[Refactor] float16 MPS works in torch nightly by @baberabb in #853
[Refactor] Update benchmark by @lintangsutawika in #850
Switch to pyproject.toml based project metadata by @ethanhs in #854
Use Dict to make the code python 3.8 compatible by @chrisociepa in #857
[Refactor] NQopen by @baberabb in #859
[Refactor] NQ-open by @haileyschoelkopf in #798
Fix "local variable 'docs' referenced before assignment" error in write_out.py by @chrisociepa in #856
[Refactor] 3.8 test compatibility by @baberabb in #863
[Refactor] Cleanup dependencies by @haileyschoelkopf in #860
[Refactor] Qasper, MuTual, MGSM (Native CoT) by @lintangsutawika in #840
undefined type and output_type when using promptsource fixed by @Hojjat-Mokhtarabadi in #842
[Refactor] Deactivate select GH Actions by @haileyschoelkopf in #871
[Refactor] squadv2 by @lintangsutawika in #785
[Refactor] Set python3.8 as allowed version by @haileyschoelkopf in #862
Fix positional arguments in HF model generate by @chrisociepa in #877
[Refactor] MATH by @baberabb in #861
Create cot_yaml by @lintangsutawika in #870
[Refactor] Port CSATQA to refactor by @haileyschoelkopf in #865
[Refactor] CMMLU, C-Eval port ; Add fewshot config by @haileyschoelkopf in #864
[Refactor] README.md for Asdiv by @lintangsutawika in #878
[Refactor] Hotfixes to big-refactor by @haileyschoelkopf in #880
Change Python Version to 3.8 in .pre-commit-config.yaml and GitHub Actions by @chrisociepa in #895
[Refactor] Fix PubMedQA by @tmabraham in #890
[Refactor] Fix error when calling lm-eval by @lintangsutawika in #899
[Refactor] bigbench by @lintangsutawika in #852
[Refactor] Fix wildcards by @haileyschoelkopf in #900
Add transformation filters by @chrisociepa in #883
[Refactor] Flan benchmark by @lintangsutawika in #816
[Refactor] WIP: Add MMLU by @haileyschoelkopf in #753
Added notable contributors to the citation block by @StellaAthena in #907
[Refactor] Improve error logging by @baberabb in #908
[Refactor] Add _batch_scheduler in greedy_until by @AndyWolfZwei in #912
add belebele by @ManuelFay in #885
Update README.md by @StellaAthena in #917
[Refactor] Precommit formatting for Belebele by @lintangsutawika in #926
[Refactor] change all mentions of greedy_until to generate_until by @lintangsutawika in #927
[Refactor] Squadv2 updates by @lintangsutawika in #923
[Refactor] Verbose by @lintangsutawika in #910
[Refactor] Fix Unit Tests by @haileyschoelkopf in #905
Fix generate_until rename by @haileyschoelkopf in #929
[Refactor] Generate_until rename by @haileyschoelkopf in #931
Fix 'tqdm' object is not subscriptable" error in huggingface.py when batch size is auto by @jasonkrone in #916
[Refactor] Fix Default Metric Call by @lintangsutawika in #935
Big refactor write out adaption by @MicPie in #937
Update pyproject.toml by @lintangsutawika in #915
[Refactor] Fix whitespace warning by @haileyschoelkopf in #949
[Refactor] Update documentation by @haileyschoelkopf in #954
[Refactor]fix two bugs when ran with qasper_bool and toxigen by @AndyWolfZwei in #934
[Refactor] Describe local dataset usage in docs by @haileyschoelkopf in #956
[Refactor] Update README, documentation by @haileyschoelkopf in #955
[Refactor] Don't load MMLU auxiliary_train set by @haileyschoelkopf in #953
[Refactor] Patch for Generation Until by @lintangsutawika in #957
[Refactor] Model written eval by @lintangsutawika in #815
[Refactor] Bugfix: AttributeError: 'Namespace' object has no attribute 'verbose' by @haileyschoelkopf in #966
[Refactor] Mmlu subgroups and weight avg by @lintangsutawika in #922
[Refactor] Remove deprecated gold_alias task YAML option by @haileyschoelkopf in #965
[Refactor] Logging fixes by @haileyschoelkopf in #952
[Refactor] fixes for alternative MMLU tasks. by @lintangsutawika in #981
[Refactor] Alias fix by @lintangsutawika in #987
[Refactor] Minor cleanup on base Task subclasses by @haileyschoelkopf in #996
[Refactor] add squad from master by @lintangsutawika in #971
[Refactor] Squad misc by @lintangsutawika in #999
[Refactor] Fix CI tests by @haileyschoelkopf in #997
[Refactor] will check if group_name is None by @lintangsutawika in #1001
[Refactor] Bugfixes by @haileyschoelkopf in #1002
[Refactor] Verbosity rework by @lintangsutawika in #958
add description on task/group alias by @lintangsutawika in #979
[Refactor] Upstream ggml from big-refactor branch by @haileyschoelkopf in #967
[Refactor] Improve Handling of Stop-Sequences for HF Batched Generation by @haileyschoelkopf in #1009
[Refactor] Update README by @baberabb in #1020
[Refactor] Remove examples/ folder by @haileyschoelkopf in #1018
[Refactor] vllm support by @baberabb in #1011
Allow Generation arguments on greedy_until reqs by @uSaiPrashanth in #897
Social iqa by @StellaAthena in #1030
[Refactor] BBH fixup by @haileyschoelkopf in #1029
Rename bigbench.yml to default.yml by @StellaAthena in #1032
[Refactor] Num_fewshot process by @lintangsutawika in #985
[Refactor] Use correct HF model type for MBart-like models by @haileyschoelkopf in #1024
[Refactor] Urgent fix by @lintangsutawika in #1033
[Refactor] Versioning by @lintangsutawika in #1031
fixes for sampler by @baberabb in #1038
[Refactor] Update README.md by @lintangsutawika in #1046
[refactor] mps requirement by @baberabb in #1037
[Refactor] Additions to example notebook by @haileyschoelkopf in #1048
Miscellaneous documentation updates by @StellaAthena in #1047
[Refactor] add notebook for overview by @lintangsutawika in #1025
Update README.md by @StellaAthena in #1049
[Refactor] Openai completions by @lintangsutawika in #1008
[Refactor] Added support for OpenAI ChatCompletions by @DaveOkpare in #839
[Refactor] Update docs ToC by @haileyschoelkopf in #1051
[Refactor] Fix fewshot cot mmlu descriptions by @lintangsutawika in #1060

New Contributors

@fattorib made their first contribution in #373
@Thartvigsen made their first contribution in #377
@aflah02 made their first contribution in #379
@sxjscience made their first contribution in #390
@Jeffwan made their first contribution in #403
@zanussbaum made their first contribution in #414
@ret2libc made their first contribution in #416
@philwee made their first contribution in #422
@yurodiviy made their first contribution in #406
@nikhilpinnaparaju made their first contribution in #447
@lintangsutawika made their first contribution in #438
@juletx made their first contribution in #472
@janEbert made their first contribution in #483
@kenhktsui made their first contribution in #480
@passaglia made their first contribution in #532
@kwikiel made their first contribution in #557
@poedator made their first contribution in #575
@SONG-WONHO made their first contribution in #579
@seopbo made their first contribution in #525
@farzanehnakhaee70 made their first contribution in #563
@nopperl made their first contribution in #608
@yeoedward made their first contribution in #682
@ZZR0 made their first contribution in #700
@tju01 made their first contribution in #770
@Wehzie made their first contribution in #803
@uSaiPrashanth made their first contribution in #802
@ethanhs made their first contribution in #809
@chrisociepa made their first contribution in #857
@Hojjat-Mokhtarabadi made their first contribution in #842
@AndyWolfZwei made their first contribution in #912
@ManuelFay made their first contribution in #885
@jasonkrone made their first contribution in #916
@MicPie made their first contribution in #937
@DaveOkpare made their first contribution in #839

Full Changelog: v0.3.0...v0.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.0

What's Changed

New Contributors

Contributors