From 76a9f4e0e60abb994afa00f5757d97c062ab38fd Mon Sep 17 00:00:00 2001 From: Andrei Alexandru Date: Tue, 19 Mar 2024 13:53:10 +0000 Subject: [PATCH] Add skill acquisition eval (#1497) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit # Thank you for contributing an eval! ♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name Skill acquisition ### Eval description This eval tests models' ability to learn a skill with minimal human involvement. In the initial release, models are evaluated on questions related to the [Miskito language](https://en.wikipedia.org/wiki/Miskito_language). Some samples are translation and others are language manipulation exercises. ### What makes this a useful eval? - ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [x] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `mypy`, `black`, `isort`, `autoflake` and `ruff` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
View evals in JSON ### Eval ```jsonl INSERT_EVAL_HERE ```
--- evals/elsuite/skill_acquisition/eval.py | 428 +++ evals/elsuite/skill_acquisition/readme.md | 64 + .../scraping/human_rights.html | 2839 +++++++++++++++++ .../scraping/scrape_distractor_articles.py | 96 + .../scraping/scrape_miskito.py | 135 + .../skill_acquisition/scripts/make_plots.py | 204 ++ .../scripts/run_experiments.sh | 76 + evals/elsuite/skill_acquisition/solvers.py | 22 + .../skill_acquisition/task_description.py | 3 + .../test_skill_acquisition.py | 118 + evals/elsuite/skill_acquisition/utils.py | 179 ++ .../miskito/knowledge_base/honduras.jsonl | 3 + .../knowledge_base/human_rights_miskito.jsonl | 3 + .../knowledge_base/miskito_language.jsonl | 3 + .../knowledge_base/miskito_lessons.jsonl | 3 + .../knowledge_base/miskito_people.jsonl | 3 + .../miskito/knowledge_base/mosquito.jsonl | 3 + .../knowledge_base/mosquito_coast.jsonl | 3 + .../miskito/knowledge_base/nicaragua.jsonl | 3 + .../miskito/qa_pairs_by_lesson.jsonl | 3 + .../miskito/variants/miskito_test_all.jsonl | 3 + .../variants/miskito_test_all_fewshot.jsonl | 3 + .../variants/miskito_test_manipulation.jsonl | 3 + .../miskito_test_manipulation_fewshot.jsonl | 3 + .../variants/miskito_test_translation.jsonl | 3 + .../miskito_test_translation_fewshot.jsonl | 3 + .../miskito/variants/miskito_train_all.jsonl | 3 + .../variants/miskito_train_manipulation.jsonl | 3 + .../variants/miskito_train_translation.jsonl | 3 + evals/registry/evals/skill_acquisition.yaml | 107 + evals/registry/solvers/skill_acquisition.yaml | 287 ++ 31 files changed, 4612 insertions(+) create mode 100644 evals/elsuite/skill_acquisition/eval.py create mode 100644 evals/elsuite/skill_acquisition/readme.md create mode 100644 evals/elsuite/skill_acquisition/scraping/human_rights.html create mode 100644 evals/elsuite/skill_acquisition/scraping/scrape_distractor_articles.py create mode 100644 evals/elsuite/skill_acquisition/scraping/scrape_miskito.py create mode 100644 evals/elsuite/skill_acquisition/scripts/make_plots.py create mode 100755 evals/elsuite/skill_acquisition/scripts/run_experiments.sh create mode 100644 evals/elsuite/skill_acquisition/solvers.py create mode 100644 evals/elsuite/skill_acquisition/task_description.py create mode 100644 evals/elsuite/skill_acquisition/test_skill_acquisition.py create mode 100644 evals/elsuite/skill_acquisition/utils.py create mode 100644 evals/registry/data/skill_acquisition/miskito/knowledge_base/honduras.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/knowledge_base/human_rights_miskito.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/knowledge_base/miskito_language.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/knowledge_base/miskito_lessons.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/knowledge_base/miskito_people.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/knowledge_base/mosquito.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/knowledge_base/mosquito_coast.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/knowledge_base/nicaragua.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/qa_pairs_by_lesson.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/variants/miskito_test_all.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/variants/miskito_test_all_fewshot.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/variants/miskito_test_manipulation.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/variants/miskito_test_manipulation_fewshot.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/variants/miskito_test_translation.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/variants/miskito_test_translation_fewshot.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/variants/miskito_train_all.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/variants/miskito_train_manipulation.jsonl create mode 100644 evals/registry/data/skill_acquisition/miskito/variants/miskito_train_translation.jsonl create mode 100644 evals/registry/evals/skill_acquisition.yaml create mode 100644 evals/registry/solvers/skill_acquisition.yaml diff --git a/evals/elsuite/skill_acquisition/eval.py b/evals/elsuite/skill_acquisition/eval.py new file mode 100644 index 0000000000..52c770db7d --- /dev/null +++ b/evals/elsuite/skill_acquisition/eval.py @@ -0,0 +1,428 @@ +import json +import logging +import os +import random +from collections import defaultdict +from typing import Any, Dict, List, Optional, Union + +import evals +import evals.metrics +from evals.api import CompletionFn +from evals.elsuite.skill_acquisition.task_description import TASK_DESCRIPTION +from evals.elsuite.skill_acquisition.utils import ( + PROMPTS, + answer_detected, + get_accuracy, + get_average_bleu_score, + get_average_invalid_retrieval_calls, + get_average_retrieval_calls, + get_average_retrieval_precision, + get_bleu_score, + get_bootstrap_accuracy_std, + get_question_type, + get_std_of_difference, + process_answer, + process_view_instruction, + render_intermediate_prompt, + view_instruction_detected, +) +from evals.eval import SolverEval +from evals.solvers.solver import Solver +from evals.task_state import Message, TaskState + +TARGET_LANGUAGES = ["miskito"] +LESSON_FILE_SUFFIX = "_lessons.jsonl" + +logger = logging.getLogger(__name__) + + +class SkillAcquisition(SolverEval): + def __init__( + self, + completion_fns: List[CompletionFn], + samples_jsonl: str, + target_language: str, + knowledge_base_directory: str, + max_replies: int, + seed: int = 6122023, + n_samples: Optional[int] = None, + *args, + **kwargs, + ): + super().__init__(completion_fns, seed=seed, *args, **kwargs) + + assert ( + target_language.lower() in TARGET_LANGUAGES + ), f"Error: target language must be one of {TARGET_LANGUAGES}" + + self.samples_jsonl = samples_jsonl + self.n_samples = n_samples + self.task_description = TASK_DESCRIPTION.format(target_language=target_language) + self.rng = random.Random(seed) + + # Retrieval-related attributes. + self.knowledge_base_directory = self._prefix_registry_path(knowledge_base_directory) + self.files_available = os.listdir(self.knowledge_base_directory) + self.content_by_file: dict[str, dict] = {} + self.max_replies = max_replies # Used as timeout. + + def eval_sample(self, solver: Solver, sample: Dict, rng: random.Random) -> Dict[str, Any]: + """Runs the appropriate private evaluation function depending on the eval phase: retrieval or non-retrieval. + + Args: + solver (Solver): per-sample solver instantiated in parent. + sample (Dict): input to evaluate on. + rng (random.Random): random number generator, used for reproducibility. + + Returns: + Dict[str, Any]: metrics collected during evaluation. + """ + # since we run two discrete experiments per sample, we have to copy the solver ahead of time + non_retrieval_solver = solver.copy() + retrieval_solver = solver.copy() + non_retrieval_out = self._eval_non_retrieval_sample(non_retrieval_solver, sample) + retrieval_out = self._eval_retrieval_sample(retrieval_solver, sample) + metrics_obj = { + "non_retrieval": non_retrieval_out, + "retrieval": retrieval_out, + } + + evals.record.record_metrics(**metrics_obj) + return metrics_obj + + def _eval_non_retrieval_sample(self, solver: Solver, sample: Dict, *_) -> Dict[str, Any]: + """Evaluates the given sample without using retrieval, ie. using the solver directly. + + Args: + solver (Solver): any compatible solver, instantiated just for this sample. + sample (Dict): input to evaluate on. + + Returns: + Dict[str, Any]: metrics collected during evaluation. + """ + task_state = TaskState( + task_description=self.task_description, + messages=[Message(**msg) for msg in sample["input"]], + ) + + result = solver(task_state) + output = result.output + if answer_detected(output): + answer = process_answer(output) + logger.debug(f"Model answered {answer}") + else: + answer = "NO ANSWER DETECTED" + + picked = evals.record_and_check_match( + prompt=sample["input"], + sampled=answer, + expected=[sample["ideal"]], + ) + + out_obj = { + "prompt": sample["input"], + "raw_output": result.output, + "parsed_output": answer, + "expected": [sample["ideal"]], + "correct": picked is not None, + "bleu": get_bleu_score(sample["ideal"], answer), + "question_type": get_question_type(sample["input"][-1]["content"]), + } + return out_obj + + def _eval_retrieval_sample(self, solver: Solver, sample: Dict, *_) -> Dict[str, Any]: + """Evaluates the given sample using retrieval. The retrieval logic is implemented in the _conversation_loop function. + + Args: + solver (Solver): any compatible solver, instantiated just for this sample. + sample (Dict): input to evaluate on. + + Returns: + Dict[str, Any]: metrics collected during evaluation. + """ + files_available_paths = [ + self.knowledge_base_directory / file for file in self.files_available + ] + assert all([file.exists() for file in files_available_paths]) + task_state = TaskState( + task_description=self.task_description, + messages=[Message(**msg) for msg in sample["input"]], + current_state={"files": files_available_paths}, + ) + + output, metrics = self._conversation_loop(solver, task_state) + + if answer_detected(output): + answer = process_answer(output) + logging.debug(f"Model answered {answer}") + elif output == "Context length exceeded.": + answer = "NO ANSWER DETECTED" + logger.warn("Current interaction exceeded model context length.") + else: + answer = "NO ANSWER DETECTED" + logging.debug(f"Model timed out after {metrics['current_replies']} replies.") + + picked = evals.record_and_check_match( + prompt=sample["input"], + sampled=answer, + expected=[sample["ideal"]], + ) + + out_obj = { + "prompt": sample["input"], + "raw_output": output, + "parsed_output": answer, + "expected": [sample["ideal"]], + "correct": picked is not None, + "bleu": get_bleu_score(sample["ideal"], answer), + "ctx_len_exceeded": output == "Context length exceeded.", + "interaction_timed_out": metrics["current_replies"] >= self.max_replies, + "question_type": get_question_type(sample["input"][-1]["content"]), + "lesson_retrieval_calls": metrics["lesson_retrieval_calls"], + "correct_retrieval_calls": metrics["correct_retrieval_calls"], + "invalid_retrieval_calls": metrics["total_retrieval_calls"] + - metrics["correct_retrieval_calls"], + "total_retrieval_calls": metrics["total_retrieval_calls"], + } + return out_obj + + def run(self, recorder: evals.record.Recorder) -> dict[str, Union[float, int]]: + samples = self.get_samples() + self.rng.shuffle(samples) + samples = samples[: self.n_samples] if self.n_samples is not None else samples + + results = self.eval_all_samples(recorder, samples) + non_retrieval_results = [result["non_retrieval"] for result in results] + retrieval_results = [result["retrieval"] for result in results] + + baseline_accuracy = get_accuracy(non_retrieval_results) + baseline_std = get_bootstrap_accuracy_std(non_retrieval_results) + + retrieval_accuracy = get_accuracy(retrieval_results) + retrieval_std = get_bootstrap_accuracy_std(retrieval_results) + + delta_accuracy = retrieval_accuracy - baseline_accuracy + + # TODO: decide which metric to report – propagated standard deviation + # from bootstrapping or standard error of the mean estimated from repeats + # of the eval experiments. + delta_std = get_std_of_difference(baseline_std, retrieval_std) + + ctx_len_exceeded_rate = sum( + 1 for result in retrieval_results if result["ctx_len_exceeded"] + ) / len(retrieval_results) + timeout_rate = sum( + 1 for result in retrieval_results if result["interaction_timed_out"] + ) / len(retrieval_results) + + num_translation_samples = len( + [result for result in retrieval_results if result["question_type"] == "translation"] + ) + num_non_translation_samples = len( + [result for result in retrieval_results if result["question_type"] == "non-translation"] + ) + + result = { + "baseline_accuracy": baseline_accuracy, + "baseline_std": baseline_std, + "retrieval_accuracy": retrieval_accuracy, + "retrieval_std": retrieval_std, + "delta_accuracy": delta_accuracy, + "delta_std": delta_std, + "average_retrieval_precision": get_average_retrieval_precision(retrieval_results), + "average_non_retrieval_bleu_score": get_average_bleu_score(non_retrieval_results), + "average_retrieval_bleu_score": get_average_bleu_score(retrieval_results), + "average_retrieval_calls": get_average_retrieval_calls(retrieval_results), + "average_invalid_retrieval_calls": get_average_invalid_retrieval_calls( + retrieval_results + ), + "ctx_len_exceeded_rate": ctx_len_exceeded_rate, + "timeout_rate": timeout_rate, + "num_samples": len(retrieval_results), + "num_translation_samples": num_translation_samples, + "num_non_translation_samples": num_non_translation_samples, + } + + return result + + def _view_content( + self, + file_name: str, + section_title: str = None, + sections_visible_to_model: dict[str, set] = defaultdict(set), + sections_viewed: dict[str, set] = defaultdict(set), + ) -> tuple[str, dict[str, set], dict[str, set]]: + """Views content from a JSONL file in the knowledge base. + If a section is provided, only the contents of that section are returned. + If no section is specified, the function returns the table of contents of the file. + + Args: + file_name (str): Name of the file. Full directory prefixed automatically. + section_title (str, optional): Name of the section to view. Defaults to None. + sections_visible_to_model (dict[str, set], optional): Dictionary of sections visible to the model. Defaults to {}. Updated in-place. + sections_viewed (dict[str, set], optional): Dictionary of sections viewed by the model. Defaults to {}. Updated in-place. + + Returns: + tuple(str, dict[str, set], dict[str, set]): A tuple of + the content of the section (if specified) and + the updated dictionaries of sections visible to and viewed by the model. + """ + # TODO: more general file format. + + if file_name in self.content_by_file: + file_content_by_section = self.content_by_file[file_name] + else: + # This should never occur, but if it does it should stop the eval from running. + if not os.path.exists(self.knowledge_base_directory / file_name): + raise ValueError( + f"File {self.knowledge_base_directory / file_name} does not exist." + ) + + file_content_by_section = {} + with open(self.knowledge_base_directory / file_name, "r") as f: + for line in f: + line_dict = json.loads(line) + file_content_by_section[line_dict["title"]] = line_dict["content"] + self.content_by_file[file_name] = file_content_by_section + + if section_title is None: + sections = set(file_content_by_section.keys()) + sections_visible_to_model[file_name] = sections + sections_viewed[file_name].add("Table of Contents") + + return ( + f"Table of contents for {file_name}: {sections}.", + sections_visible_to_model, + sections_viewed, + ) + + sections_viewed[file_name].add(section_title) + return file_content_by_section[section_title], sections_visible_to_model, sections_viewed + + def _conversation_loop( + self, solver: Solver, task_state: TaskState + ) -> tuple[str, Dict[str, int]]: + """Maintains a conversation with the model until it outputs an answer or times out. + The model may request to read a file or a section of a file from the knowledge base. + + Args: + solver (Solver): any compatible solver, instantiated just for this sample. + task_state (TaskState): current task_state, which additionally contains a list of knowledge base files in `current_state`. + + Returns: + tuple[str, Dict[str, int]]: a tuple of the model's output and a dictionary of metrics collected during the conversation. + """ + output = "" + + # Not all retrieval calls are valid, e.g. if the file doesn't exist. + # These two metrics are analogous to an instruction-following rate. + metrics = { + "lesson_retrieval_calls": 0, + "correct_retrieval_calls": 0, + "total_retrieval_calls": 0, + "current_replies": 0, + } + sections_visible_to_model: dict[str, set] = defaultdict(set) + sections_viewed: dict[str, set] = defaultdict(set) + consecutive_instruction_failures = 0 + + while not answer_detected(output) and metrics["current_replies"] < self.max_replies: + if metrics["current_replies"] == 0: + # Beginning of the conversation, prepare instructions. + task_state.task_description = ( + task_state.task_description + + "\n\n" + + PROMPTS["retrieval_instructions"].format(list_of_files=self.files_available) + ) + if len(sections_viewed.items()) > 0: + intermediate_prompt = render_intermediate_prompt(sections_viewed) + task_state.messages += [Message(role="system", content=intermediate_prompt)] + + output = solver(task_state).output + task_state.messages += [Message(role="assistant", content=output)] + metrics["current_replies"] += 1 + + if view_instruction_detected(output) or answer_detected(output): + consecutive_instruction_failures = 0 + + if view_instruction_detected(output): + file, section = process_view_instruction(output) + metrics["total_retrieval_calls"] += 1 + + if file.endswith(LESSON_FILE_SUFFIX): + metrics["lesson_retrieval_calls"] += 1 + + # Handle any errors by logging and re-prompting the model. + if file not in self.files_available: + task_state.messages += [ + Message( + role="system", + content=PROMPTS["wrong_file"].format( + file=file, knowledge_base=self.files_available + ), + ) + ] + logger.debug( + f"Model tried to view {file}, which does not exist in the knowledge base:\n{json.dumps(self.files_available, indent=4)}." + ) + continue + + if section is not None and section not in sections_visible_to_model[file]: + task_state.messages += [ + Message( + role="system", + content=PROMPTS["wrong_section"].format( + file=file, + section=section, + table_of_contents=sections_visible_to_model[file], + ), + ) + ] + logger.debug( + f"Model tried to view section {section} in file {file}, which does not exist.\nAvailable sections are {json.dumps(list(sections_visible_to_model[file]), indent=4)}." + ) + continue + + # If no errors, view the content and update the task state. + content, sections_visible_to_model, sections_viewed = self._view_content( + file, section, sections_visible_to_model, sections_viewed + ) + task_state.messages += [ + Message( + role="system", + content=PROMPTS["present_content"].format( + file=file, + section=section if section is not None else "Table of Contents", + content=content, + ), + ), + ] + metrics["correct_retrieval_calls"] += 1 + if section is None: + logger.debug(f"Model viewed table of contents for file {file}: {content}") + else: + logger.debug(f"Model viewed section {section} in file {file}.") + elif not answer_detected(output): + if consecutive_instruction_failures >= 3: + return "Model failed to follow instructions.", metrics + + consecutive_instruction_failures += 1 + logger.debug( + f"Model output did not contain a view instruction or an answer: {output}" + ) + + # Flag & move onto next sample if context length exceeded. + if ( + "'code': 'context_length_exceeded'" in output + or "Please reduce your prompt; or completion length" in output + ): + return "Context length exceeded.", metrics + + task_state.messages += [ + Message( + role="system", + content="Your output did not contain a view instruction or an answer. Please try again.", + ) + ] + + return output, metrics diff --git a/evals/elsuite/skill_acquisition/readme.md b/evals/elsuite/skill_acquisition/readme.md new file mode 100644 index 0000000000..2d5a8fafcb --- /dev/null +++ b/evals/elsuite/skill_acquisition/readme.md @@ -0,0 +1,64 @@ +# Skill acquisition + +This eval tests models' ability to learn a skill with minimal human involvement. In the initial release, models are evaluated on questions related to the [Miskito language](https://en.wikipedia.org/wiki/Miskito_language). Some samples are translation and others are language manipulation exercises. + +## Usage +Run with: +```bash +oaieval skill_acquisition.miskito +``` + +Where the solver can be any generation solver in `evals/registry/solvers/defaults.yaml`, eg. `generation/cot/gpt-3.5-turbo-16k`. + +## Evaluation process +Every time the eval is run, the model is evaluated twice. The first time, it answers the question directly using whatever prompting technique is executed by the solver you choose. The second time the model runs in a loop, interacting with an interface which gives it access to a knowledge base. The knowledge base contains text files, some of which are relevant for answering the question, while others are unrelated. If models can use this interface to increase their performance on the task, we can say that they've improved or acquired their language translation and manipulation skills. + +## Prompts +See `skill_acquisition/utils.py` to review/adjust the prompts used in this eval. + +## Datasets + +The dataset is generated from [this language course](https://en.wikibooks.org/wiki/Miskito), which comprises 229 questions. We further split this into manipulation-only (`miskito_test_manipulation.jsonl`) and translation-only (`miskito_test_translation.jsonl`) subsets. + +## Variants + +We test zero-shot and few-shot prompting techniques on the dataset: + +| Dataset | Zero-shot | Few-shot | +| --------- | -------- | -------- | +| Miskito | `skill_acquisition.miskito.zero-shot.full`|`skill_acquisition.miskito.few-shot.full`| + +The `full` in this case refers to the size of the dataset – there are also variants for testing where only 5 examples are considered, called `dev5`. For full details, look at `evals/registry/skill_acquisition/skill_acquisition.yaml`. + +For the few-shot setting, use the eval-specific solvers in `evals/registry/solvers/skill_acquisition.yaml` to avoid train/test leakage. + +## Token Usage Estimates + +Below is a rough estimate of the total number of tokens consumed by some variations the eval, including both input and output tokens: + +| Model | Solver | Prompt tokens | Completion tokens | Total tokens +| --- | --- | --- | --- | --- | +| gpt-3.5-turbo | direct | 1,000,000 | 23,000 | 1,050,000 | +| gpt-3.5-turbo | cot | 930,000 | 120,000 | 1,050,000 | +| gpt-3.5-turbo | fewshot | 450,000 | 9,600 | 460,000 | +| gpt-3.5-turbo-16k | direct | 1,400,000 | 24,000 | 1,500,000 | +| gpt-3.5-turbo-16k | cot | 2,000,000 | 120,000 | 2,100,000 | +| gpt-3.5-turbo-16k | fewshot | 610,000 | 10,000 | 620,000 | +| gpt-4-base | direct | 1,800,000 | 420,000 | 2,200,000 | +| gpt-4-base | cot | 4,700,000 | 890,000 | 5,600,000 | +| gpt-4-base | fewshot | 1,400,000 | 320,000 | 1,700,000 | +| gpt-4-1106-preview | direct | 1,700,000 | 100,000 | 1,800,000 | +| gpt-4-1106-preview | cot | 1,600,000 | 99,000 | 1,700,000 | +| gpt-4-1106-preview | fewshot | 1,700,000 | 95,000 | 1,800,000 | +| gpt-4-32k | direct | 1,800,000 | 80,000 | 1,900,000 | +| gpt-4-32k | cot | 2,700,000 | 180,000 | 2,900,000 | +| gpt-4-32k | fewshot | 190,000 | 6,000 | 190,000 | + +## Version History +v0: Initial version released + + +## Contribution statement + +Eval design, implementation, and results evaluation were primarily conducted by Andrei Alexandru. Giulio Starace was responsible for code reviews throughout the implementation process, along with fine-grained feedback on the project in general. Additional guidance was provided by (alphabetically by last-name) Steven Adler, James Aung and Chan Jun Shern, who scoped and managed the broader research project, including input on evaluation design, results analysis, and interpretation. + diff --git a/evals/elsuite/skill_acquisition/scraping/human_rights.html b/evals/elsuite/skill_acquisition/scraping/human_rights.html new file mode 100644 index 0000000000..c6d49a320c --- /dev/null +++ b/evals/elsuite/skill_acquisition/scraping/human_rights.html @@ -0,0 +1,2839 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + OHCHR | Universal Declaration of Human Rights - Miskito + + + + + + + + + + + + + Skip to main content + + +
+
+ +
+ + + +
+ +
+ +
+
+ + + + + + +
+ +
+
+ + + +
+
+ + +

Universal Declaration of Human Rights - Miskito

+
+

SOURCE

+

Comité para la Defensa de los Derechos Humanos, Honduras

+
+
+ + +
+
+
+ +
+
+
+
+ + +
Miskito
+ +
+
+
Language Profile
+ + +

TOTAL SPEAKERS

160,000 (1982)

USAGE BY COUNTRY (OFFICIAL LANGUAGE)

Home Speakers: Nicaragua, Honduras

BACKGROUND

It belongs to the Misumalpan family (Macro-Chibchan subgroup) and is spoken by 11,000 people in Honduras and over 150,000 people in Nicaragua. It is a language of trade in Honduras, whereas it is widely used in Nicaragua, both in primary schools and among older people.

+ +
+
+
+
+
+
+ + +

Upla sut Raitka nani ba Tasba aiska laka ba Bapuia

+

Asla Takanka tara ba Naha Upla sut Raitka nani ba Tasba aiska laka ba Bapuia,

+

Sut lukanka baku, upla sut, kantri nani sut, trai kaikaia; baku, kKumi bani, dakni nani bani, nahara kat luki, tabaikaia, Smalkanka bak, kul nani bak, naha Raitka nani ba, Bara, Prika laka naniba, pramis kum Dauki, kantri laka nani bilkak, at apia, tasba, aiska, laka, nani bilkak, atsa, yaka kakaira takaia bara kulkaia wan kantri nani bui.

+

[Preamble]

+

Kulkanka 1

+

Upla sut ba kulkanka lakara, airaitka nanira bara pri, sin, aikuki, baku takisa. Bamna sins laka bri baku, lukanka bain pri baku aimuihni lakara, pana pana tabaikan kaiasa.

+

Kulkanka 2

+

Naha lakara pas taurá wisa, upla baniba airaitka brisa, bara sin, pri san: nisanka kulkras, taya maplika kulkras, mairin sapa waikna sapa kulkras, bila aisanka kulkras, ani gadkara mayuni sapa kulkras, aipulitikka lukanka ba, apia sa kaka, dia dia dukya kabia sin kulkras, wan tasbaya wina sapa, yuya kira sapa, wan tasbayara baikan sapa, apia kaka, dia dia walanatkara sin kulkras kira.

+

Baku sin wan kantri pulitik ka laka bui sin, wan kantri laka nani bui sin, apia kaka, tasba aiska laka mita sin, apia laka, ani tasbayara iwi ba bui sin upla kumira sin mayara kulkan kaia apia sa. Bamna kantri wala nani natkara iwi bara, kankri wala laka munhtara iwiba, alba laka natkara nanira iwiba sin saura baku kulkan kaia apia sa.

+

Kulkanka 3

+

Upla sut ba airaitka brisa airayaka kum brieia pri lakara iwaia upla baku, aimain kira kaia.

+

Kulkanka 4

+

Sip apia upla kum alba lakara kaia, bamna, baha natkara yus munan kaia sin, kan baha laka apu sa.

+

Kulkanka 5

+

Upla kumi sinra, sip apia sa uba saura munan kaia, silak mankan kaia, swira pask an, an upla apia baku munaia.

+

Kulkanka 6

+

Upla bani ai raitka brisa anira kabia sin lâkat upla baku kulkan kaia.

+

Kulkanka 7

+

La mawanra sut ba kumi kulkan sa, bara kumi bani la ba bui aikuki baku main kaikisa, upla sut ba airaitka brisa aikuki baku main kaikan kaia, wala nani bui mayara kulkan kabiara sin; naha laka tara bapanna kulkras baku.

+

Kulkanka 8

+

Upla sut la airaitka brisa tabaikanka uplika pain kum brikaia, wan kantri laka mawanra, baku mika sipsa airaitka nani kulkras munbia sin, la kulkan ka ba brih wabia.

+

Kulkanka 9

+

Upla kumi sin, sip apia sa ban kakalhni silak ra mangki saura munaia.

+

Kulkanka 10

+

Upla sutba, airaitka brisa, wala nani baku, upla sut mawanra, an la kum taibanka apu kira bui aiturka ba walan kaia. Baku mika airaitka nani, bara, witin daukaia dukia nani ba marikan kabia,m apia kaka, dia dia saurka dukiara munansa kapa sip kabia laki kaikaia.

+

Kulkanka 11

+
    +
  1. Upla bani ba, dia dia saurka dudiara laura lulkansa bara, airaitkabrisapas taura aiturkaba aisaia sip kabia, kau taibi munras bara, la tankaba kat, baku mika, bilka nani yâban kaiasa, upla sut mawanra, bapi buaia sip kaia.
  2. +
  3. Upla kumira sin, saura munan kaia apiasa pât kum dukiara, daukanka ba puyara, saura pali kulkan apia sa kaka, wan kantry raitka nani bui sin, ba wisi sin, saurka uba tara kulkan kaia apia sa, baha pât kaba dukiara.
+

Kulkanka 12

+

Upla bani Rayakaba, Wala bui turban kaia apiasa, Tâika nanira kabia sinm, watla bilara kabia sin, dukya nanira kabia sin ki, apia kaka, nina sauhkaia, rispik ka alahbaia; upla bani ba, airaitka brisa, baha nani saurka mapara la bui main kaikaia.

+

Kulkanka 13

+
    +
  1. Upla bani ba airaitka brisa, pri pali taukaia bara, kantri bilara tasba kum bri kaia, iwaia lahma.
  2. +
  3. Upla bani ba, airaitka brisa aitasbaya wina taki waia, bara, kli balaia, dimaia sin.
+

Kulkanka 14

+
    +
  1. Bankra dia dia patka dukiara nina blikisa kaka, upla bani ba, airaitka brisa, natka kumpliki, tasba wala kum distika makabaia, wala nani baku auya pah iwikaia dukiara.
  2. +
  3. La nani bui pat bahki nani kulkanba dukiara ban sin Tasba Aiska Asla Takanka brinka nani bara lukanka ta nani ba kulkras kira, naha Raitkana makabaia.
+

Kulkanka 15

+

Upla baniba, airaitka brisa kantri kumra iwikaia apia kaka, kantri walara iwaia lukankaba yabalka prakan kaiasa.

+

Kulkanka 16

+
    +
  1. Waikna bani, mairin bani ba, airaitka brisa pyua alkansa bara, nisanka kulkras, ani kantrikara iwiba kulkras, ani Gadkara mayuniba kulkras kira, sipsa marittakaia; sahwaia sin, baku mika, marit takansa bara, apia, mahka wal swibia sinki wal baku iwaiasa.
  2. +
  3. Marit laka daukan, kabia marit uplika naniba, aikupya wilinkira sakaka.
  4. +
  5. Panli laka, upia sut wina kau yamnika bak sakan dukia kumsa, bamna, upla sut bui, Gabament bui sin main pali kaikan kaiasa.
+

Kulkanka 17

+
    +
  1. Upla bani ba airairka brisa aidukia pawaia lahma brikaia, yakan lakara, bamna, upla wala nani aikuki asla lakara sin.
  2. +
  3. Upla kumi ra sin, Aidukia Pawaia lahma pat taki ba, yabalka prakaia.
+

Kulkanka 18

+

Upla bani ba, ai raitka brisa pri lakara dia dia lukaia, lukanka pain nani brikaia, bara, ani ani Gadkara lukaia, naha raitkana ra luki sipsa Gad Wala nani ra lukaia, upla nanira marikaia sipkabia; yakan kabia, upla wala sin, Aikulkanka aikuki kabia sin; upla nani mawanra, prakan ra dauki kaia, lahma, kulkaia, lahma, bara, laki kaikaia mata kabia sin.

+

Kulkanka 19

+

Upla sutba, airaitka brisa prikaia ailukankara aisankara; naha raitkara aisisa, dia dia lukanka dukiara, upla kumira sin warbras kaia sa tanka plikaia sip kaia, dia dia turiba nu kaia, bara, wala nanira maisapakaia, kantry ka kulkras, dia dia bilkak kat kabia sin.

+

Kulkanka 20

+
    +
  1. Upla bani ba, airaitka brisa pri lakara aslatakanka kum daukaia, bara, aslatakanka lamni laka kat brikaia.
  2. +
  3. Upla kumi sin, sip apia sa, taibi munankaia asla takanka kumra tilara kaia.
+

Kulkanka 21

+
    +
  1. Upla sutba, airaitka brisa, gabament dukia tilara kaia, ban sin, wala nanira tabaikaia baha nani tilara kabia.
  2. +
  3. Upla sutba airaitka brisa, wala nani aikuki baku Gabament Warkka nani tilara kaia.
  4. +
  5. Tawan aiska brinka ba, upla sut karhnika sa gabament tanira; naha brinka ba, gabament bani mangkisa bara, klir lakisa, lulkaia laka ba kat kulki, aikuki baku, bara, upla bani aikupya laka kat, ban natka wal nani ni daukkbia sin.
+

Kulkanka 22

+

Upla bani ba, upla baku airaitka brisa, main kaikankaia, baku sin, gabament tabaikanka baku, tasba aiska buisin asla takanka nani bilka brisa bara, gabament dukia nani sut kulki, pawaia natka nani, asla takaia nani, bara, aikulkanka nani sutba brin kaiasa, baku mika, upla baku ailukanka kat, bara, pri pali pawaia sip kabia.

+

Kulkanka 23

+
    +
  1. Upla bani ba airaitka brisa warkka kum brikaia, bara, aikupya pahkira wark plikaia, wala nani baku kaia, wark pain brikaia, bara, warka apu pyuara tabaikan kaiasa.
  2. +
  3. Upla sutba airaitka brisa wala mita mayara kulkankaia apiasa, aiwarkka daukiba baku mâna sin baku kaia su.
  4. +
  5. Upla wark taki naniba airaitka aprisa aimana kum brikaia, mana sin painkaia, baku sip kabia aitaya nani main kaikaia, upla baku iwaia, ban sin bilka sa kaka natka wala nani pliki mainka kaikan kaiasa.
  6. +
  7. Upla bani ba airaitka risa aslatakanka dakni paskaia bara tilara dimaia sin, aibrinka nani dukyara aiklabaia mata.
+

Kulkanka 24

+

Upla bani ba ai raitka brisa ris briaia, riska lilika briaia ai wark ka pyua kum brikaia, bara, baku sin ris pyua yari nani sin, ai mana wal.

+

Kulkanka 25

+
    +
  1. Upla bani ba, airaitka brisa iwaia natka pain kum brikaia, baku, sip kabia witin, bara, aitaika nani sin, siknis nani luhakaia; kau purara ban kulkan kaiasa: plun ba, praka, utla ba, sika nani yabaiaba, upla baku mainka kaikaia ba; baku sin, airaitka brisa wark apu sa pyuara, mainka kaikan kaia, siknis sa pyuara saua sakan sa bara, pyarka takansa bara, almuk takan sa bara, ban sin dia dia bui kra aidukia nani sut sauhki tikan sa bara, tabaikan karia.
  2. +
  3. Mairin ba, kwihra sa bara, baikan pyuara sin airaitka brisa main kaikan kaia; bara, dia dia brinka nani sut yâban kaia, ani tuktika, marit laka kat kulki baikan kabia, apia, tnayara baikan kabia sin airaitka brisa wal baku main kaikan kaia.
+

Kulkanka 26

+
    +
  1. Upla bani ba, airaitka brisa aaisinska kwakaia, smalkanka ba pri natkara kaiasa, ulbaia ba pan, aisikaikaiaba pan. Baku sin, karhna munan kaiasa ulbaia ba, bara aisikaikaiaba, lan takaia; lila kulka naniba, sut lahma kaiasa baku sin, kul nanira dimaia ba sip takan kaiasa sut lahma, kumi bani daukan kaba kaiki.
  2. +
  3. Kul smalkanka brinka kabia; upla ba; upla baku lukanka brikaia dukiara, smalkaia, baku sin, upla bani airaitka ba kulkaia, bara, upla bani aiprika laka kum kum bri nmaniba kulkaia dukiara smalkan kaia; tanka pain briaia bra, aidahra pain walaia, bara, pana laka tasba wala nani aikuki bara, indian nani sut aikuki kau taura kulkan kaiasa, baku si kupya kumi laka, upla sut mata, tasba aiska asla takanka daukiba ta baikan kaiasa.
  4. +
  5. Tuktan nani aisika bani pa, sip kabia ailuhpya dia a dia lan takaia ba, witin pali pliki yabaia.
+

Kulkanka 27

+
    +
  1. Upla bani ba, pri lakara aitasbaya lukanka laka nani tilara, kaia, baku sin paskanka nani tilara, bara, sins laka tara nani pawanka dilara kaia, ban sin baha lilika briaia.
  2. +
  3. Upla bani ba, airaitka brisa airispik ka laka ba, bara, aidukia nani ba sin, main kaikan kaia, witin aisinska tihukani, aiulbanka nani bak kra, apia, aipaskanka nani bak brisa kaka.
+

Kulkanka 28

+

Upla bani ba, airaitka brisa, tasba aiskara, bara, aitasbayara sin la kat, bara, wapni laka kata iwaia; naha laka bapan na; upla nani raitka ba kulkaia, bara, pri laka ba kulkaia nani ba, sut alkaia mata.

+

Kulkanka 29

+
    +
  1. Upla sut bui ai tawan kara, rispik ka ba yaban kaiasa bara baman upla baku ai auya pah pawisa.
  2. +
  3. Ai raitka nani ba , kulki, bara, ai prika lakaba wal, ai auya pah kaiasa kaka, upla bani ba la nani bapanba yabalka kat wapaia sa, baku mika, upla wala nani raitka, bara, prika laka aniba kaikaia, kulkaia sin, baku rispik ka yaia la kat iwaia, bara pana pana kupya pliki natka nani iwaia sa, kaka.
  4. +
  5. Naha raitka naniba, bara prika laka nani ba kulkan kaia apiasa, Tasba Aiska Asla Takanka lukanka mapara sa kaka.
+

Kulkanka 30

+

Naha laka bapanna, gavament ra kabia, dakni kumra kabia, upla kumra kabia, bilka ya bansa lukan kaia apia sa, raitka nani, bara, prika laka nani naha lakara aisan na, alki taibi munaia upla wala nanira.

+ +
+
+
+ + + + + + + + +
+
+ +
+ +
+
+
+ +
+ + +
+
+ + + + + + + + + + + + + diff --git a/evals/elsuite/skill_acquisition/scraping/scrape_distractor_articles.py b/evals/elsuite/skill_acquisition/scraping/scrape_distractor_articles.py new file mode 100644 index 0000000000..93e248b107 --- /dev/null +++ b/evals/elsuite/skill_acquisition/scraping/scrape_distractor_articles.py @@ -0,0 +1,96 @@ +# %% +import json +import re + +import requests +from bs4 import BeautifulSoup +from markdownify import markdownify as md + +articles_to_scrape = [ + "https://en.wikipedia.org/wiki/Mosquito", + "https://en.wikipedia.org/wiki/Mosquito_Coast", + "https://en.wikipedia.org/wiki/Nicaragua", + "https://en.wikipedia.org/wiki/Honduras", + "https://en.wikipedia.org/wiki/Miskito_language", + "https://en.wikipedia.org/wiki/Miskito_people", +] +dirpath = "evals/registry/data/skill_acquisition/distractor_articles/" + + +def clean_soup(content): + for infobox_tag in content.find_all("table", class_="infobox"): + infobox_tag.decompose() + for figure_tag in content.find_all("figure"): + figure_tag.decompose() + for style_tags in content.find_all("style"): + style_tags.decompose() + reflist_div = '
") + + sections = {} + for heading_text in headings: + if "" not in heading_text: + sections["Introduction"] = clean_heading_text(heading_text) + continue + span = heading_text[: heading_text.index("")] + heading_title = BeautifulSoup(span, "html.parser").contents[0].contents[0] + text = heading_text[heading_text.index("") + 5 :] + if heading_title not in ["References", "See also", "External links", "Footnotes"]: + sections[heading_title] = clean_heading_text(text) + + article_title = article.split("/")[-1] + + print(f"Scraped {article_title} successfully. Headings: {sections.keys()}\n") + filename = f"{article_title.lower()}.jsonl" + + with open(dirpath + filename, "w") as f: + for k, v in sections.items(): + f.write(json.dumps({"title": k, "content": v}, ensure_ascii=False) + "\n") + +# Separate code to scrape human rights article, as it's in a different format. +with open("human_rights.html", "r") as f: + html = f.read() + +soup = BeautifulSoup(html, "html.parser") +content = soup.find("div", class_="migrated-content") +md_content = md(str(content)).replace("\xa0", " ").replace("\u3000", " ") + +with open(dirpath + "human_rights_miskito.jsonl", "w") as f: + f.write( + json.dumps( + {"title": "Declaration of Human Rights in Miskito", "content": md_content}, + ensure_ascii=False, + ) + + "\n" + ) diff --git a/evals/elsuite/skill_acquisition/scraping/scrape_miskito.py b/evals/elsuite/skill_acquisition/scraping/scrape_miskito.py new file mode 100644 index 0000000000..697b5667cd --- /dev/null +++ b/evals/elsuite/skill_acquisition/scraping/scrape_miskito.py @@ -0,0 +1,135 @@ +# %% +import json + +import bs4 +import requests +from bs4 import BeautifulSoup +from markdownify import markdownify as md + +# TODO: make sure italicised text is crawled properly and that hints are excluded from answers. +# TODO: Split any multi-part questions into individual questions. + +miskito_base_url = "https://en.wikibooks.org/wiki/Miskito/Lesson_{idx}" + + +def process_practice_section_div(practice_div: bs4.element.Tag): + tds = practice_div.find_all("td") + instructions = ( + md(str(tds[1])) + .replace("*", "") + .replace("|", "") + .strip() + .replace("What do these mean?", "Translate to English:") + .replace("What do these sentences mean?", "Translate to English:") + ) + question_text = tds[2] + questions = question_text.find_all("li") + questions = [str(q.contents[0]) for q in questions] + answer_text = tds[3] + answers = answer_text.find_all("li") + answers = [str(a.contents[0]) for a in answers] + return instructions, questions, answers + + +def extract_toc_sections(content: bs4.element.Tag): + toc = content.find_all("div", class_="toc")[0] + lis = toc.find_all("li", class_="toclevel-1") + lis = [li.find_all("span", class_="toctext")[0].contents[0] for li in lis] + + lis = [md(str(li)).strip().replace("*", "") for li in lis] + return lis + + +def process_miskito_page(): + qa_pairs_by_lesson = {} + articles_without_qa_pairs = [] + for idx in range(1, 11): + response = requests.get(miskito_base_url.format(idx=idx)) + soup = BeautifulSoup(response.text, "html.parser") + content = soup.find("div", class_="mw-content-ltr mw-parser-output") + + # Extract the question-answer pairs. + divs_with_specific_style = content.find_all( + "div", style=lambda value: value and "width:300px; float:right;" in value + ) + lesson_qa_pairs = [] + for i, div in enumerate(divs_with_specific_style): + if i == 0 and idx == 1: # First section of first lesson is not in the same format. + instructions = "Translate to English:" + questions = div.find_all("ul")[0].find_all("li") + questions = [str(q.contents[0]) for q in questions] + answers = div.find_all("ul")[1].find_all("li") + answers = [str(a.contents[0]) for a in answers] + lesson_qa_pairs += [ + {"question": q, "answer": a, "instructions": instructions} + for q, a in zip(questions, answers) + ] + continue + instructions, questions, answers = process_practice_section_div(div) + for q, a in zip(questions, answers): + lesson_qa_pairs += [{"question": q, "answer": a, "instructions": instructions}] + qa_pairs_by_lesson[f"lesson_{idx}"] = lesson_qa_pairs + + # Remove them from the page and store the page contents. + for div in divs_with_specific_style: + div.decompose() + + articles_without_qa_pairs += [content] + + return qa_pairs_by_lesson, articles_without_qa_pairs + + +# %% +# Write to file: all questions by lesson, and all questions in evallib format. +qa_pairs_by_lesson, clean_articles = process_miskito_page() +qa_by_lesson_file = "miskito_qa_pairs_by_lesson.jsonl" + +with open(qa_by_lesson_file, "w") as f: + for lesson, qa_pairs in qa_pairs_by_lesson.items(): + f.write(json.dumps({"lesson": lesson, "qa_pairs": qa_pairs}) + "\n") + +miskito_qa = "miskito_qa.jsonl" +with open(miskito_qa, "w") as f: + for lesson, qa_list in qa_pairs_by_lesson.items(): + for qa_dict in qa_list: + instructions = qa_dict["instructions"][:-1] + ": " + f.write( + json.dumps( + { + "input": [{"role": "user", "content": instructions + qa_dict["question"]}], + "ideal": qa_dict["answer"], + }, + ensure_ascii=False, + ) + + "\n" + ) +# %% +as_text = [str(a).split("

")[1:] for a in clean_articles] +sections_by_heading = {} +for article in as_text: + for heading in article: + hsoup = BeautifulSoup(heading, "html.parser") + heading_name = ( + md(str(hsoup.find("span", class_="mw-headline").contents[0])).replace("*", "").strip() + ) + hsoup.find("span", class_="mw-editsection").decompose() + content = ( + md(str(hsoup)) + .strip() + .replace("*", "") + .replace("|", "") + .replace("What do they mean?", "") + .replace(" --- ", "") + .replace("\u2003", " ") + .replace(" ", " ") + ) + content = content.split(" Study ")[1] if "Study " in content else content + sections_by_heading[heading_name] = content.strip() + +sections_by_heading +# %% +file = "lessons_no_exercises.jsonl" +with open(file, "w") as f: + for heading, content in sections_by_heading.items(): + f.write(json.dumps({"title": heading, "content": content}, ensure_ascii=False) + "\n") +# %% diff --git a/evals/elsuite/skill_acquisition/scripts/make_plots.py b/evals/elsuite/skill_acquisition/scripts/make_plots.py new file mode 100644 index 0000000000..01eab83412 --- /dev/null +++ b/evals/elsuite/skill_acquisition/scripts/make_plots.py @@ -0,0 +1,204 @@ +import argparse +import os +from pathlib import Path + +import matplotlib.pyplot as plt +import pandas as pd +import seaborn as sns + +from evals.utils import log_utils + +PLOT_TITLES_BY_METRIC = { + "overall_accuracy": "Accuracy", # ie. both retrieval and non-retrieval in one plot + "baseline_accuracy": "Baseline accuracy (non-retrieval)", + "retrieval_accuracy": "Retrieval accuracy", + "average_retrieval_precision": "Average retrieval precision", + "average_non_retrieval_bleu_score": "Average non-retrieval BLEU score", + "average_retrieval_bleu_score": "Average retrieval BLEU score", + "average_retrieval_calls": "Average retrieval calls", + "average_invalid_retrieval_calls": "Average invalid retrieval calls", + "bleu_score": "BLEU score", + "correct_call_rate": "Correct call rate", + "invalid_call_rate": "Invalid call rate", + "timeout_rate": "Timeout rate", + "ctx_len_exceeded_rate": "Context length exceeded rate", +} + +UNIT_METRICS = set( + ["correct_call_rate", "invalid_call_rate", "timeout_rate", "ctx_len_exceeded_rate"] +) + + +def extract_metrics(datadir: Path) -> pd.DataFrame: + df_rows = [] + for path, results in sorted(list(log_utils.get_final_results_from_dir(datadir).items())): + spec = log_utils.extract_spec(path) + solver_path = Path(spec["completion_fns"][0]) + model = solver_path.name + solver = solver_path.parent.name + # Remove root section of path, which is the eval name + solver_path = solver_path.relative_to(solver_path.parts[0]) + df_rows.append({"solver": solver, "model": model, **results}) + df = pd.DataFrame(df_rows) + + return df + + +def make_plot( + df: pd.DataFrame, + outpath: Path, + metric="baseline_accuracy", + min_ylim=0, + max_ylim=0.08, + dataset="miskito", +): + plt.figure() + sns.set_theme(style="whitegrid") + # Calculating mean and SEM + grouped = df.groupby(["model", "solver"])[metric].agg(["mean", "sem"]).reset_index() + + def compute_sem(x): + sem = x.std() / (len(x) ** 0.5) + sem2 = sem * 2 # 95% confidence interval + return (x.mean() - sem2, x.mean() + sem2) + + # Plotting + sns.set(style="whitegrid") + sns.barplot(x="model", y="mean", hue="solver", data=grouped, errorbar=compute_sem, capsize=0.1) + plt.xticks(rotation=30, ha="right") + plt.ylim(min_ylim, max_ylim) + + # Some of the metrics are in [0, 1]. + if metric in UNIT_METRICS: + plt.ylim(0, 1) + + plt.title(PLOT_TITLES_BY_METRIC[metric] + f" on {dataset.capitalize()} Q&A dataset") + plt.xlabel("Model") + plt.tight_layout() + plt.savefig(outpath) + plt.close() + + +def make_side_bar_plot( + df: pd.DataFrame, + outpath: Path, + metric="overall_accuracy", + min_ylim=0, + max_ylim=0.1, + dataset="miskito", +): + if metric == "overall_accuracy": + df_clean = df[["model", "solver", "baseline_accuracy", "retrieval_accuracy"]] + elif metric == "bleu_score": + df_clean = df[ + ["model", "solver", "average_non_retrieval_bleu_score", "average_retrieval_bleu_score"] + ] + + fig, ax = plt.subplots(figsize=(10, 5)) + # df_clean = df_clean.drop(columns=["solver"]) + df_clean.set_index(["model", "solver"], inplace=True) + + # Group by 'model' and calculate mean and SEM + grouped = df_clean.groupby(["model", "solver"]).agg(["mean", "sem"]) + xlabels = [f"{model}/{solver}" for model, solver in grouped.index] + + # Prepare data for plotting + means = grouped.xs("mean", axis=1, level=1) + errors = grouped.xs("sem", axis=1, level=1) + + # Plotting + means.plot(kind="bar", yerr=errors, capsize=4, ax=ax) # Removed 'stacked=True' + + ax.set_ylabel(metric) + ax.set_xticklabels(xlabels, rotation=30, ha="right") + ax.set_xlabel("model/solver") + ax.set_ylim(min_ylim, max_ylim) + + fig.tight_layout(pad=3.0) + fig.suptitle(PLOT_TITLES_BY_METRIC[metric] + f" on {dataset.capitalize()} dataset") + fig.savefig(outpath) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--log-dir", "-d", type=str, required=True) + parser.add_argument("--out-dir", "-o", type=str, default="./outputs") + args = parser.parse_args() + log_dir = Path(args.log_dir) + out_dir = Path(args.out_dir) + + out_dir.mkdir(exist_ok=True, parents=True) + + datasets = os.listdir(log_dir) + + for dataset in datasets: + print(f"Extracting data for eval dataset {dataset}...") + df = extract_metrics(log_dir / dataset) + + # Rename some of the solver values so they can be represented in the same plot. + df.loc[df["solver"] == "cot_hhh", "solver"] = "cot" + df.loc[df["solver"] == "hhh", "solver"] = "direct" + df.loc[df["solver"] == "fewshot_direct", "solver"] = "fewshot" + + # TODO: report directly as 'average_correct_calls' in future and remove this rename. + df.rename(columns={"average_retrieval_precision": "average_correct_calls"}, inplace=True) + df["correct_call_rate"] = df["average_correct_calls"] / df["average_retrieval_calls"] + df["invalid_call_rate"] = ( + df["average_invalid_retrieval_calls"] / df["average_retrieval_calls"] + ) + + print(f"Plotting other metrics for eval dataset {dataset}...") + + # Generate bar plots for all other metrics. + core_metrics = ( + [] + ) # ["baseline_accuracy", "retrieval_accuracy", "average_non_retrieval_bleu_score", "average_retrieval_bleu_score"] + auxiliary_metrics = [ + "correct_call_rate", + "invalid_call_rate", + "timeout_rate", + "ctx_len_exceeded_rate", + ] + for metric in core_metrics + auxiliary_metrics: + make_plot( + df[["model", "solver", metric]].copy(), + out_dir / f"{dataset}_{metric}.png", + metric, + dataset=dataset, + ) + + print(f"Plotting headline metrics for eval dataset {dataset}...") + + # Generate stacked bar plots for the two headline metrics. + for metric in ["overall_accuracy", "bleu_score"]: + make_side_bar_plot(df, out_dir / f"{dataset}_{metric}.png", metric, dataset=dataset) + + # Print numerical results (and compute % improvement metrics) + grouped = df.groupby(["model", "solver"]).agg(["mean", "sem"]) + for type, closedbook, openbook in [ + ( + "Translation (BLEU)", + "average_non_retrieval_bleu_score", + "average_retrieval_bleu_score", + ), + ("Non-translation (%)", "baseline_accuracy", "retrieval_accuracy"), + ]: + print(f"Improvement Metrics for {type} on {dataset.capitalize()} dataset") + improvement_rows = [] + for idx, row in grouped.iterrows(): + openbook_score = row[openbook]["mean"] + closedbook_score = row[closedbook]["mean"] + rel_improvement_score = (openbook_score - closedbook_score) / (1 - closedbook_score) + improvement_rows.append( + { + "model": idx[0], + "solver": idx[1], + "closedbook": closedbook_score, + "openbook": openbook_score, + "improvement": rel_improvement_score, + } + ) + improvement_df = pd.DataFrame(improvement_rows) + print(improvement_df) + # print to stdout as csv + print(improvement_df.to_csv(index=False)) diff --git a/evals/elsuite/skill_acquisition/scripts/run_experiments.sh b/evals/elsuite/skill_acquisition/scripts/run_experiments.sh new file mode 100755 index 0000000000..aaf81e0745 --- /dev/null +++ b/evals/elsuite/skill_acquisition/scripts/run_experiments.sh @@ -0,0 +1,76 @@ +logdir=./logs +outputdir=./outputs + +timestamp=$(date +%Y%m%d_%H%M%S) +logpathbase=$logdir/$timestamp/ + +size=full +num_repeats=1 +eval_variants_zero_shot=("skill_acquisition.miskito.zero_shot.$size") + +# Check for --num_repeats argument +for arg in "$@" +do + if [[ $arg == --num_repeats=* ]]; then + num_repeats="${arg#*=}" + fi +done + + +echo Running experiments and logging to $logpathbase + +declare -a ZEROSHOT_SOLVERS=( + # Solvers for gpt-3.5-turbo + "generation/direct/gpt-3.5-turbo" + "skill_acquisition/cot/gpt-3.5-turbo" + + + # Solvers for gpt-4-turbo-preview + "generation/direct/gpt-4-turbo-preview" + "skill_acquisition/cot/gpt-4-turbo-preview" +) + +declare -a FEWSHOT_SOLVERS=( + "miskito_all/fewshot_direct/gpt-3.5-turbo" + "miskito_all/fewshot_direct/gpt-4-turbo-preview" +) + +if [ ! -d "$logpathbase/miskito" ]; then + mkdir -p "$logpathbase/miskito" +fi + + +# Run zero-shot experiments. +for eval_variant in "${eval_variants_zero_shot[@]}" +do + if [[ $eval_variant == *"miskito"* ]]; then + record_path="$logpathbase/miskito" + fi + + for solver in "${ZEROSHOT_SOLVERS[@]}" + do + for ((i=1;i<=num_repeats;i++)); do + echo "Running $solver, iteration $i" + oaieval $solver $eval_variant --record_path "$record_path/$solver-$i.log" + done + done +done + +# Run few-shot experiments. +# Miskito +for solver in "${FEWSHOT_SOLVERS[@]}" +do + if [[ $solver == *"miskito"* ]]; then + for ((i=1;i<=num_repeats;i++)); do + echo "Running $solver, iteration $i" + oaieval $solver skill_acquisition.miskito.few_shot.$size --record_path "$logpathbase/miskito/$solver-$i.log" + done + fi +done + +echo Done running experiments, all logs in $logpathbase + +echo Producing plots, outputs to $outputdir + +# Produce results +python make_plots.py --log-dir $logpathbase --out-dir $outputdir \ No newline at end of file diff --git a/evals/elsuite/skill_acquisition/solvers.py b/evals/elsuite/skill_acquisition/solvers.py new file mode 100644 index 0000000000..16bd72f653 --- /dev/null +++ b/evals/elsuite/skill_acquisition/solvers.py @@ -0,0 +1,22 @@ +from evals.solvers.openai_assistants_solver import OpenAIAssistantsSolver +from evals.solvers.solver import SolverResult +from evals.task_state import TaskState + + +class SkillAcquisitionAssistantsSolver(OpenAIAssistantsSolver): + def _solve( + self, + task_state: TaskState, + **kwargs, + ) -> SolverResult: + + if task_state.current_state is not None: + task_state.task_description = """ +You are about to take a Q&A test in a language called {target_language}. You should do your best to answer each question correctly. Your score will be your number of correct answers divided by the number of questions. + +This is an open book test, and you have been provided with relevant materials to help you answer the questions via your file browser. + +To provide your final answer, output [ANSWER X], where X is the answer. For example, if you answer is 'this is a bear', output [ANSWER this is a bear] +""".strip() + + return super()._solve(task_state, **kwargs) diff --git a/evals/elsuite/skill_acquisition/task_description.py b/evals/elsuite/skill_acquisition/task_description.py new file mode 100644 index 0000000000..5a356179d0 --- /dev/null +++ b/evals/elsuite/skill_acquisition/task_description.py @@ -0,0 +1,3 @@ +TASK_DESCRIPTION = """You are about to take a Q&A test in a language called {target_language}. You should do your best to answer each question correctly. Your score will be your number of correct answers divided by the number of questions. + +To provide your final answer, output [ANSWER X], where X is the answer. For example, if you answer is 'this is a bear', output [ANSWER this is a bear]""" diff --git a/evals/elsuite/skill_acquisition/test_skill_acquisition.py b/evals/elsuite/skill_acquisition/test_skill_acquisition.py new file mode 100644 index 0000000000..079fdcb435 --- /dev/null +++ b/evals/elsuite/skill_acquisition/test_skill_acquisition.py @@ -0,0 +1,118 @@ +import json +import os +from pathlib import Path + +from evals.elsuite.skill_acquisition.eval import SkillAcquisition +from evals.elsuite.skill_acquisition.utils import ( + answer_detected, + process_answer, + process_view_instruction, + view_instruction_detected, +) +from evals.registry import Registry + +registry = Registry() + +dummy_eval_spec = { + "eval_registry_path": Path("evals/registry"), + "completion_fns": [registry.make_completion_fn("gpt-4")], + "samples_jsonl": "skill_acquisition/miskito/miskito_test.jsonl", + "target_language": "miskito", + "n_samples": 5, + "knowledge_base_directory": "skill_acquisition/miskito/knowledge_base/", + "max_replies": 50, +} + + +def test_answer_detected(): + assert answer_detected("[ANSWER foo]") is True + assert answer_detected("[ANSWER: foo]") is True + assert answer_detected("ANSWER foo") is False + assert answer_detected("[ANSWER foo") is False + assert answer_detected("ANSWER foo]") is False + assert answer_detected("[ANSWER foo][ANSWER bar]") is True + + +def test_view_instruction_detected(): + SkillAcquisition(**dummy_eval_spec) + assert view_instruction_detected("[VIEW file1]") is True + assert view_instruction_detected("[VIEW: file1]") is True + assert view_instruction_detected("[VIEW file1 section1]") is True + assert view_instruction_detected("[VIEW: file1 section1]") is True + assert view_instruction_detected("VIEW file1") is False + assert view_instruction_detected("[VIEW file1") is False + assert view_instruction_detected("VIEW file1]") is False + assert view_instruction_detected("[VIEW file1][VIEW file2]") is True + assert view_instruction_detected("[VIEW: file1][VIEW: file2]") is True + + +def test_process_answer(): + SkillAcquisition(**dummy_eval_spec) + assert process_answer("[ANSWER foo]") == "foo" + assert process_answer("[ANSWER: foo]") == "foo" + assert process_answer("[ANSWER foo bar baz]") == "foo bar baz" + assert process_answer("[ANSWER: foo bar baz]") == "foo bar baz" + assert process_answer("[ANSWER foo][ANSWER bar]") == "bar" + assert process_answer("[ANSWER foo][ANSWER bar") == "foo" + + +def test_process_view_instruction(): + SkillAcquisition(**dummy_eval_spec) + assert process_view_instruction("[VIEW file1]") == ("file1", None) + assert process_view_instruction("[VIEW: file1]") == ("file1", None) + assert process_view_instruction("[VIEW file1 section1]") == ( + "file1", + "section1", + ) + assert process_view_instruction("[VIEW: file1 section1]") == ( + "file1", + "section1", + ) + assert process_view_instruction("[VIEW file1][VIEW file2]") == ( + "file2", + None, + ) + assert process_view_instruction("[VIEW: file1][VIEW: file2]") == ( + "file2", + None, + ) + assert process_view_instruction("[VIEW file1 section1][VIEW file2 section2]") == ( + "file2", + "section2", + ) + + +def test_process_view_instruction_spaces_and_quotes(): + assert process_view_instruction("[VIEW file1 sectionpart1 sectionpart2]") == ( + "file1", + "sectionpart1 sectionpart2", + ) + assert process_view_instruction("[VIEW file1 sectionpart1 'sectionpart2']") == ( + "file1", + "sectionpart1 'sectionpart2'", + ) + + +def test_view_content(): + skill_acquisition_eval = SkillAcquisition(**dummy_eval_spec) + + # Create a file to view first. + filepath = skill_acquisition_eval.knowledge_base_directory / "test_file.jsonl" + with open(filepath, "w") as f: + f.write(json.dumps({"title": "foo", "content": "Test file contents."}) + "\n") + + content, sections_visible_to_model, sections_viewed = skill_acquisition_eval._view_content( + "test_file.jsonl" + ) + assert content == "Table of contents for test_file.jsonl: {'foo'}." + assert sections_visible_to_model == {"test_file.jsonl": {"foo"}} + assert sections_viewed == {"test_file.jsonl": {"Table of Contents"}} + + content, sections_visible_to_model, sections_viewed = skill_acquisition_eval._view_content( + "test_file.jsonl", "foo" + ) + assert content == "Test file contents." + assert sections_visible_to_model == {"test_file.jsonl": {"foo"}} + assert sections_viewed == {"test_file.jsonl": {"Table of Contents", "foo"}} + + os.remove(filepath) diff --git a/evals/elsuite/skill_acquisition/utils.py b/evals/elsuite/skill_acquisition/utils.py new file mode 100644 index 0000000000..ef0ef09349 --- /dev/null +++ b/evals/elsuite/skill_acquisition/utils.py @@ -0,0 +1,179 @@ +import json +import logging +import random +import re +from math import sqrt +from typing import Callable, Dict, List, Set, Union + +import numpy as np +from nltk.translate.bleu_score import SmoothingFunction, sentence_bleu + +ANSWER_PATTERN = r"\[ANSWER:? .+?\]" +VIEW_PATTERN = r"\[VIEW:? .+?\]" +PROMPTS = { + "retrieval_instructions": ( + "This is an open-book test." + " If you are unsure about a question or unable to answer, you may read through a number of" + " knowledge sources that are made available to you in the form of text files." + " To open a file, output [VIEW X] where X is the name of the source you would like." + " For example, to view a file called help.jsonl, output [VIEW help.jsonl]." + " When you first open a file I will tell you its table of contents." + " You may view a particular section by outputting [VIEW X Y] where X is the name of" + " the file and Y is the name of the section you would like to read." + " The section titles won't necessarily match your exact task." + " You should explore them for any useful information, such as key words or phrases," + " before giving your answer. You should always try to provide an answer," + " even if you are not confident. Failure to provide an answer is penalised" + " more strongly than incorrect answers." + "\nHere are the sources available to you: {list_of_files}." + ), + "intermediate_prompt": "You've already viewed the following files and sections: {sections}.\nYou can view another file or section by outputting [VIEW X] or [VIEW X Y], or you can answer the question by outputting [ANSWER X].", + "present_content": "You asked to view file {file}, section {section}. Here is the content: {content}", + "wrong_file": "You tried to view {file}, which does not exist in the knowledge base. Choose another file from {knowledge_base}.", + "wrong_section": "You tried to view section {section} in file {file}, which does not exist. The table of contents for that file contains: {table_of_contents}.", +} + +logger = logging.getLogger(__name__) + + +def answer_detected(output: str) -> bool: + return len(re.findall(ANSWER_PATTERN, output)) > 0 + + +def view_instruction_detected(output: str) -> bool: + return len(re.findall(VIEW_PATTERN, output)) > 0 + + +def process_answer(output: str) -> str: + """Extracts the answer from model output. + The answer looks like [ANSWER X], where X is the answer. + + Args: + output (str): model output + + Returns: + str: answer provided by the model + """ + maybe_multiple_answers = re.findall(ANSWER_PATTERN, output) + + # Sanity check – this should never happen. + assert len(maybe_multiple_answers) > 0, f"No answer detected in {output}." + + if len(maybe_multiple_answers) > 1: + logger.debug( + f"Multiple answers detected, using only the final answer: {maybe_multiple_answers}" + ) + + final_answer_instruction = maybe_multiple_answers[-1] + final_answer = " ".join(final_answer_instruction.split(" ")[1:])[:-1] + + return final_answer + + +def process_view_instruction(output: str) -> Union[tuple[str, str], tuple[str, None]]: + """Extracts the target of a view instruction from model output. + The view instruction looks like [VIEW X Y], where X is a file name and Y is a section name. + This function extracts X and Y. + + Args: + output (str): model output + + Returns: + Union[tuple[str, str], tuple[str, None]]: tuple of file name and if applicable section name to view + """ + maybe_multiple_views = re.findall(VIEW_PATTERN, output) + + # Sanity check – this should never happen. + assert len(maybe_multiple_views) > 0, f"No view instruction detected in {output}." + + if len(maybe_multiple_views) > 1: + logger.debug( + f"Multiple view instructions detected, using only the final instruction: {maybe_multiple_views}" + ) + + final_view_instruction = maybe_multiple_views[-1][1:-1].split(" ")[1:] + file = final_view_instruction[0].strip() + + section = ( + None if len(final_view_instruction) == 1 else " ".join(final_view_instruction[1:]).strip() + ) + + return (file, section) + + +def _get_average_metric( + results: List[Dict[str, str]], metric_fn: Callable[List[Dict[str, str]], List[float]] +) -> float: + total_metric = sum(metric_fn(results)) + num_total = len(results) + if num_total == 0: + return float("nan") + else: + return total_metric / num_total + + +def get_bootstrap_accuracy_std(results: List[Dict[str, str]], num_samples: int = 1000) -> float: + results = [sample for sample in results if sample["question_type"] != "translation"] + vals = [result["correct"] for result in results] + return np.std([np.mean(random.sample(vals, len(vals) // 2)) for _ in range(1000)]) + + +def render_intermediate_prompt(sections_viewed: Dict[str, Set]) -> str: + return PROMPTS["intermediate_prompt"].format( + sections=json.dumps( + {k: list(v) for k, v in sections_viewed.items()}, indent=4 + ) # Cannot serialise sets directly. + ) + + +def get_question_type(question: str) -> str: + return "translation" if question.strip().startswith("Translate") else "non-translation" + + +def get_average_bleu_score(results: List[Dict[str, str]]) -> float: + results = [sample for sample in results if sample["question_type"] == "translation"] + return _get_average_metric( + results, + lambda samples: [ + get_bleu_score(sample["expected"][0], sample["parsed_output"]) for sample in samples + ], + ) + + +def get_bleu_score(expected: str, sampled: str) -> float: + punctuation = r"[^\w\s]" + + return sentence_bleu( + [re.sub(punctuation, "", expected).split()], + re.sub(punctuation, "", sampled).split(), + smoothing_function=SmoothingFunction().method1, + ) + + +def get_accuracy(results: List[Dict[str, str]]) -> float: + results = [sample for sample in results if sample["question_type"] != "translation"] + return _get_average_metric( + results, lambda samples: [int(sample["correct"]) for sample in samples] + ) + + +def get_average_retrieval_calls(results: List[Dict[str, str]]) -> float: + return _get_average_metric( + results, lambda samples: [sample["total_retrieval_calls"] for sample in samples] + ) + + +def get_average_invalid_retrieval_calls(results: List[Dict[str, str]]) -> float: + return _get_average_metric( + results, lambda samples: [sample["invalid_retrieval_calls"] for sample in samples] + ) + + +def get_average_retrieval_precision(results: List[Dict[str, str]]) -> float: + return _get_average_metric( + results, lambda samples: [sample["lesson_retrieval_calls"] for sample in samples] + ) + + +def get_std_of_difference(baseline_std: float, retrieval_std: float) -> float: + return sqrt(baseline_std**2 + retrieval_std**2) diff --git a/evals/registry/data/skill_acquisition/miskito/knowledge_base/honduras.jsonl b/evals/registry/data/skill_acquisition/miskito/knowledge_base/honduras.jsonl new file mode 100644 index 0000000000..cc818363b7 --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/knowledge_base/honduras.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:50b995b959aa7308a0be6413d005c0407984cf6f57a953c1fdde745f17df0db4 +size 72360 diff --git a/evals/registry/data/skill_acquisition/miskito/knowledge_base/human_rights_miskito.jsonl b/evals/registry/data/skill_acquisition/miskito/knowledge_base/human_rights_miskito.jsonl new file mode 100644 index 0000000000..fe48f48eef --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/knowledge_base/human_rights_miskito.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a3baae4eade2acc21395c8b29a1f82cc05da00b7f7bc4cd458cc8ee2f7d032cb +size 10298 diff --git a/evals/registry/data/skill_acquisition/miskito/knowledge_base/miskito_language.jsonl b/evals/registry/data/skill_acquisition/miskito/knowledge_base/miskito_language.jsonl new file mode 100644 index 0000000000..7b118988e6 --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/knowledge_base/miskito_language.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2972b14f1f6aa0fb4246a3d4a964cf07c0dfc3e717b6036ccff7d1f6284e7812 +size 7399 diff --git a/evals/registry/data/skill_acquisition/miskito/knowledge_base/miskito_lessons.jsonl b/evals/registry/data/skill_acquisition/miskito/knowledge_base/miskito_lessons.jsonl new file mode 100644 index 0000000000..fd42093260 --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/knowledge_base/miskito_lessons.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f657754efc73614292b53c313583cd0013a9f7bde1e6018220d0bd15a546838c +size 43506 diff --git a/evals/registry/data/skill_acquisition/miskito/knowledge_base/miskito_people.jsonl b/evals/registry/data/skill_acquisition/miskito/knowledge_base/miskito_people.jsonl new file mode 100644 index 0000000000..eb18d39508 --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/knowledge_base/miskito_people.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2be3a3684c1586cc0779ae4cf47866d0e88bd8f67c5256438fe59aaa2e8a81b7 +size 53928 diff --git a/evals/registry/data/skill_acquisition/miskito/knowledge_base/mosquito.jsonl b/evals/registry/data/skill_acquisition/miskito/knowledge_base/mosquito.jsonl new file mode 100644 index 0000000000..f3abc68f3e --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/knowledge_base/mosquito.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a69fde31a05e3f95e34bcbbc7e9986e3bf107513658a6e002ae8bb303d69d7d8 +size 28786 diff --git a/evals/registry/data/skill_acquisition/miskito/knowledge_base/mosquito_coast.jsonl b/evals/registry/data/skill_acquisition/miskito/knowledge_base/mosquito_coast.jsonl new file mode 100644 index 0000000000..29e7de5a3f --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/knowledge_base/mosquito_coast.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:1539eceeb715376db2db9eb17dcc7c5e43d0e71df710c65b71a7d6276c23dc44 +size 34533 diff --git a/evals/registry/data/skill_acquisition/miskito/knowledge_base/nicaragua.jsonl b/evals/registry/data/skill_acquisition/miskito/knowledge_base/nicaragua.jsonl new file mode 100644 index 0000000000..c71d1603ef --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/knowledge_base/nicaragua.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b446d17a582e1d8bdf2c6c46a742716c7290dc441559817c841361c3e33c39fd +size 80204 diff --git a/evals/registry/data/skill_acquisition/miskito/qa_pairs_by_lesson.jsonl b/evals/registry/data/skill_acquisition/miskito/qa_pairs_by_lesson.jsonl new file mode 100644 index 0000000000..226af11348 --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/qa_pairs_by_lesson.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:92c631af79044257aea396b250a93eb466d404d637c6c0fc764a30763576f5ea +size 32651 diff --git a/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_all.jsonl b/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_all.jsonl new file mode 100644 index 0000000000..d72e8b83fa --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_all.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5c9540f646ea2610874b3e33286e300cd92b70d91f6c00f5b0275f1be918b74a +size 38464 diff --git a/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_all_fewshot.jsonl b/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_all_fewshot.jsonl new file mode 100644 index 0000000000..8114c4e111 --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_all_fewshot.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2af150986f257a3e358d76b5a878d17116b583332eee303f0792fcffd1eee6d1 +size 37930 diff --git a/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_manipulation.jsonl b/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_manipulation.jsonl new file mode 100644 index 0000000000..151136d565 --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_manipulation.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:51ec99e36e05dd2ee0f87f9177c4c4fc0155c744b1ed30d26bf4464ff7985e4f +size 28627 diff --git a/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_manipulation_fewshot.jsonl b/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_manipulation_fewshot.jsonl new file mode 100644 index 0000000000..6a5e655d82 --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_manipulation_fewshot.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:62b665285283e232bbd670c32900078c77380b5c3c612d3fa11b4369e007edd5 +size 28201 diff --git a/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_translation.jsonl b/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_translation.jsonl new file mode 100644 index 0000000000..491624d22c --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_translation.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:19d8e9bbe4868479d2df0f6c7e72740399db5943dde1d3109c66affe878a62d8 +size 9836 diff --git a/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_translation_fewshot.jsonl b/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_translation_fewshot.jsonl new file mode 100644 index 0000000000..7e91554f6f --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/variants/miskito_test_translation_fewshot.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2a76625921e1810e4ce22ba76f4881ee9327e1522555cc9eccc6beb854b7a129 +size 9236 diff --git a/evals/registry/data/skill_acquisition/miskito/variants/miskito_train_all.jsonl b/evals/registry/data/skill_acquisition/miskito/variants/miskito_train_all.jsonl new file mode 100644 index 0000000000..17d04c41d2 --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/variants/miskito_train_all.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f099000b117f1d5d46778263674998e9785f3e993b4c32ca8afb5f82065e1afb +size 560 diff --git a/evals/registry/data/skill_acquisition/miskito/variants/miskito_train_manipulation.jsonl b/evals/registry/data/skill_acquisition/miskito/variants/miskito_train_manipulation.jsonl new file mode 100644 index 0000000000..480e645ad3 --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/variants/miskito_train_manipulation.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:625f3fa618e52688f6593774b7ba5691879882dbe9e3a8508a8aed43327f7d86 +size 425 diff --git a/evals/registry/data/skill_acquisition/miskito/variants/miskito_train_translation.jsonl b/evals/registry/data/skill_acquisition/miskito/variants/miskito_train_translation.jsonl new file mode 100644 index 0000000000..4f83ef3e2d --- /dev/null +++ b/evals/registry/data/skill_acquisition/miskito/variants/miskito_train_translation.jsonl @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8c2f3a4699303b85d49641ec17ec77bff600c75940078d735406db1539da90c4 +size 599 diff --git a/evals/registry/evals/skill_acquisition.yaml b/evals/registry/evals/skill_acquisition.yaml new file mode 100644 index 0000000000..2b4594efff --- /dev/null +++ b/evals/registry/evals/skill_acquisition.yaml @@ -0,0 +1,107 @@ +# --------------- +# Miskito dataset +# --------------- + +skill_acquisition.miskito: + id: skill_acquisition.miskito.zero_shot.full + metrics: [delta_accuracy] + description: Evaluates whether models can learn the Miskito language through retrieval. + +# Miskito manipulation + translation dataset, zero- and few-shot. +skill_acquisition.miskito.zero_shot.full: + class: evals.elsuite.skill_acquisition.eval:SkillAcquisition + args: + samples_jsonl: skill_acquisition/miskito/variants/miskito_test_all.jsonl + target_language: miskito + knowledge_base_directory: skill_acquisition/miskito/knowledge_base/ + max_replies: 30 +skill_acquisition.miskito.zero_shot.dev5: + class: evals.elsuite.skill_acquisition.eval:SkillAcquisition + args: + samples_jsonl: skill_acquisition/miskito/variants/miskito_test_all.jsonl + target_language: miskito + n_samples: 5 + knowledge_base_directory: skill_acquisition/miskito/knowledge_base/ + max_replies: 30 + +skill_acquisition.miskito.few_shot.full: + class: evals.elsuite.skill_acquisition.eval:SkillAcquisition + args: + samples_jsonl: skill_acquisition/miskito/variants/miskito_test_all_fewshot.jsonl + target_language: miskito + knowledge_base_directory: skill_acquisition/miskito/knowledge_base/ + max_replies: 30 +skill_acquisition.miskito.few_shot.dev5: + class: evals.elsuite.skill_acquisition.eval:SkillAcquisition + args: + samples_jsonl: skill_acquisition/miskito/variants/miskito_test_all_fewshot.jsonl + target_language: miskito + n_samples: 5 + knowledge_base_directory: skill_acquisition/miskito/knowledge_base/ + max_replies: 30 + +# Miskito translation-only, zero- and few-shot. +skill_acquisition.miskito.zero_shot.translation.full: + class: evals.elsuite.skill_acquisition.eval:SkillAcquisition + args: + samples_jsonl: skill_acquisition/miskito/variants/miskito_test_translation.jsonl + target_language: miskito + knowledge_base_directory: skill_acquisition/miskito/knowledge_base/ + max_replies: 30 +skill_acquisition.miskito.zero_shot.translation.dev5: + class: evals.elsuite.skill_acquisition.eval:SkillAcquisition + args: + samples_jsonl: skill_acquisition/miskito/variants/miskito_test_translation.jsonl + target_language: miskito + n_samples: 5 + knowledge_base_directory: skill_acquisition/miskito/knowledge_base/ + max_replies: 30 + +skill_acquisition.miskito.few_shot.translation.full: + class: evals.elsuite.skill_acquisition.eval:SkillAcquisition + args: + samples_jsonl: skill_acquisition/miskito/variants/miskito_test_translation_fewshot.jsonl + target_language: miskito + knowledge_base_directory: skill_acquisition/miskito/knowledge_base/ + max_replies: 30 +skill_acquisition.miskito.few_shot.translation.dev5: + class: evals.elsuite.skill_acquisition.eval:SkillAcquisition + args: + samples_jsonl: skill_acquisition/miskito/variants/miskito_test_translation_fewshot.jsonl + target_language: miskito + n_samples: 5 + knowledge_base_directory: skill_acquisition/miskito/knowledge_base/ + max_replies: 30 + +# Miskito manipulation-only, zero- and few-shot. +skill_acquisition.miskito.zero_shot.manipulation.full: + class: evals.elsuite.skill_acquisition.eval:SkillAcquisition + args: + samples_jsonl: skill_acquisition/miskito/variants/miskito_test_manipulation.jsonl + target_language: miskito + knowledge_base_directory: skill_acquisition/miskito/knowledge_base/ + max_replies: 30 +skill_acquisition.miskito.zero_shot.manipulation.dev5: + class: evals.elsuite.skill_acquisition.eval:SkillAcquisition + args: + samples_jsonl: skill_acquisition/miskito/variants/miskito_test_manipulation.jsonl + target_language: miskito + n_samples: 5 + knowledge_base_directory: skill_acquisition/miskito/knowledge_base/ + max_replies: 30 + +skill_acquisition.miskito.few_shot.manipulation.full: + class: evals.elsuite.skill_acquisition.eval:SkillAcquisition + args: + samples_jsonl: skill_acquisition/miskito/variants/miskito_test_manipulation_fewshot.jsonl + target_language: miskito + knowledge_base_directory: skill_acquisition/miskito/knowledge_base/ + max_replies: 30 +skill_acquisition.miskito.few_shot.manipulation.dev5: + class: evals.elsuite.skill_acquisition.eval:SkillAcquisition + args: + samples_jsonl: skill_acquisition/miskito/variants/miskito_test_manipulation_fewshot.jsonl + target_language: miskito + n_samples: 5 + knowledge_base_directory: skill_acquisition/miskito/knowledge_base/ + max_replies: 30 \ No newline at end of file diff --git a/evals/registry/solvers/skill_acquisition.yaml b/evals/registry/solvers/skill_acquisition.yaml new file mode 100644 index 0000000000..e187837297 --- /dev/null +++ b/evals/registry/solvers/skill_acquisition.yaml @@ -0,0 +1,287 @@ +# CoT solvers with a custom extraction prompt. +skill_acquisition/cot/gpt-3.5-turbo: + class: evals.solvers.nested.cot_solver:CoTSolver + args: + cot_solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-3.5-turbo + extra_options: + temperature: 1 + max_tokens: 512 + extract_template: &extract_template Given the above reasoning, what is the next action you wish to take? Please respond in the format required by the instructions. + extract_solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-3.5-turbo + extra_options: + temperature: 1 + max_tokens: 512 + +skill_acquisition/cot/gpt-4-turbo-preview: + class: evals.solvers.nested.cot_solver:CoTSolver + args: + cot_solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-4-turbo-preview + extra_options: + temperature: 1 + max_tokens: 512 + extract_template: *extract_template + extract_solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-4-turbo-preview + extra_options: + temperature: 1 + max_tokens: 512 + +skill_acquisition/cot/gemini-pro: + class: evals.solvers.nested.cot_solver:CoTSolver + args: + cot_solver: + class: evals.solvers.providers.google.gemini_solver:GeminiSolver + args: + model_name: gemini-pro + extract_template: *extract_template + extract_solver: + class: evals.solvers.providers.google.gemini_solver:GeminiSolver + args: + model_name: gemini-pro + +skill_acquisition/cot/gpt-4: + class: evals.solvers.nested.cot_solver:CoTSolver + args: + cot_solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-4 + extra_options: + temperature: 1 + max_tokens: 512 + extract_template: *extract_template + extract_solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-4 + extra_options: + temperature: 1 + max_tokens: 512 + +skill_acquisition/cot_hhh/gpt-4-base: + class: evals.solvers.nested.cot_solver:CoTSolver + args: + cot_solver: + class: evals.solvers.nested.hhh_solver:HHHSolver + args: + solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-4-base + extra_options: + temperature: 1 + max_tokens: 512 + extract_template: *extract_template + extract_solver: + class: evals.solvers.nested.hhh_solver:HHHSolver + args: + solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-4-base + extra_options: + temperature: 1 + max_tokens: 512 + +skill_acquisition/assistants/gpt-4-turbo-preview: + class: evals.elsuite.skill_acquisition.solvers:SkillAcquisitionAssistantsSolver + args: + tools: + - type: code_interpreter + - type: retrieval + model: gpt-4-turbo-preview + +skill_acquisition/cot_assistant/gpt-4-turbo-preview: + class: evals.solvers.nested.cot_solver:CoTSolver + args: + cot_solver: + class: evals.elsuite.skill_acquisition.solvers:SkillAcquisitionAssistantsSolver + args: + tools: + - type: code_interpreter + - type: retrieval + model: gpt-4-turbo-preview + extract_solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-4-turbo-preview + extra_options: + temperature: 1 + max_tokens: 512 + +### Few-shot solvers. +# TODO: refactor few-shot solver so that train_jsonl is not parameterised here to reduce verbosity. +# Miskito full. +miskito_all/fewshot_direct/gpt-3.5-turbo: + class: evals.solvers.nested.fewshot_solver:FewShotSolver + args: + train_jsonl: evals/registry/data/skill_acquisition/miskito/variants/miskito_train_all.jsonl + n_shots: 3 + base_solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-3.5-turbo + extra_options: + temperature: 1 + max_tokens: 512 + +miskito_all/fewshot_direct/gpt-4-turbo-preview: + class: evals.solvers.nested.fewshot_solver:FewShotSolver + args: + train_jsonl: evals/registry/data/skill_acquisition/miskito/variants/miskito_train_all.jsonl + n_shots: 3 + base_solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-4-turbo-preview + extra_options: + temperature: 1 + max_tokens: 512 + +miskito_all/fewshot_direct/gpt-4-32k: + class: evals.solvers.nested.fewshot_solver:FewShotSolver + args: + train_jsonl: evals/registry/data/skill_acquisition/miskito/variants/miskito_train_all.jsonl + n_shots: 3 + base_solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-4-32k + extra_options: + temperature: 1 + max_tokens: 512 + +miskito_all/fewshot_direct/gpt-4-base: + class: evals.solvers.nested.fewshot_solver:FewShotSolver + args: + train_jsonl: evals/registry/data/skill_acquisition/miskito/variants/miskito_train_all.jsonl + n_shots: 3 + base_solver: + class: evals.solvers.nested.hhh_solver:HHHSolver + args: + solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-4-base + extra_options: + temperature: 1 + max_tokens: 512 + +miskito_manipulation/fewshot_direct/gpt-4-32k: + class: evals.solvers.nested.fewshot_solver:FewShotSolver + args: + train_jsonl: evals/registry/data/skill_acquisition/miskito/variants/miskito_train_manipulation.jsonl + n_shots: 3 + base_solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-4-32k + extra_options: + temperature: 1 + max_tokens: 512 + +miskito_manipulation/fewshot_direct/gpt-4-base: + class: evals.solvers.nested.fewshot_solver:FewShotSolver + args: + train_jsonl: evals/registry/data/skill_acquisition/miskito/variants/miskito_train_manipulation.jsonl + n_shots: 3 + base_solver: + class: evals.solvers.nested.hhh_solver:HHHSolver + args: + solver: + class: evals.solvers.openai_solver:OpenAISolver + args: + completion_fn_options: + model: gpt-4-base + extra_options: + temperature: 1 + max_tokens: 512 + +# OS models +skill_acquisition/cot/llama-2-13b-chat: + class: evals.solvers.nested.cot_solver:CoTSolver + args: + cot_solver: + class: evals.solvers.together_solver:TogetherSolver + args: + completion_fn_options: + model: meta-llama/Llama-2-13b-chat-hf + extra_options: + temperature: 1 + max_tokens: 512 + extract_template: *extract_template + extract_solver: + class: evals.solvers.together_solver:TogetherSolver + args: + completion_fn_options: + model: meta-llama/Llama-2-13b-chat-hf + extra_options: + temperature: 1 + max_tokens: 512 + +skill_acquisition/cot/llama-2-70b-chat: + class: evals.solvers.nested.cot_solver:CoTSolver + args: + cot_solver: + class: evals.solvers.together_solver:TogetherSolver + args: + completion_fn_options: + model: meta-llama/Llama-2-70b-chat-hf + extra_options: + temperature: 1 + max_tokens: 512 + extract_template: *extract_template + extract_solver: + class: evals.solvers.together_solver:TogetherSolver + args: + completion_fn_options: + model: meta-llama/Llama-2-70b-chat-hf + extra_options: + temperature: 1 + max_tokens: 512 + +skill_acquisition/cot/mixtral-8x7b-instruct: + class: evals.solvers.nested.cot_solver:CoTSolver + args: + cot_solver: + class: evals.solvers.together_solver:TogetherSolver + args: + completion_fn_options: + model: mistralai/Mixtral-8x7B-Instruct-v0.1 + extra_options: + temperature: 1 + max_tokens: 512 + extract_template: *extract_template + extract_solver: + class: evals.solvers.together_solver:TogetherSolver + args: + completion_fn_options: + model: mistralai/Mixtral-8x7B-Instruct-v0.1 + extra_options: + temperature: 1 + max_tokens: 512