diff --git a/evals/elsuite/hr_ml_agent_bench/.gitignore b/evals/elsuite/hr_ml_agent_bench/.gitignore
new file mode 100644
index 0000000000..2e7740efd9
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/.gitignore
@@ -0,0 +1,7 @@
+benchmarks/babylm/env/babylm_data
+benchmarks/**/prepared
+benchmarks/**/submission.txt
+benchmarks/**/*.checkpoint
+benchmarks/**/*.log
+scripts/**/*.log
+data
diff --git a/evals/elsuite/hr_ml_agent_bench/README.md b/evals/elsuite/hr_ml_agent_bench/README.md
new file mode 100644
index 0000000000..eafef9c7ee
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/README.md
@@ -0,0 +1,226 @@
+# Human-Relative MLAgentBench Eval
+
+This eval measures a model's ability to solve diverse machine learning research tasks. The best-known human performance has been collated for each task, which is used to calculate a “human-relative” percentage for each task attempt; 0% is a naive baseline (e.g. “random guess”), 100% is obtaining the same performance-gain as the best-known human, and 200% is obtaining 2x the performance-gain of said human. Our thanks go to the authors of [MLAgentBench](https://github.com/snap-stanford/MLAgentBench) on which this work was built.
+
+This eval contains the following 15 tasks:
+
+| Task | Description |
+| --- | --- |
+| Ant | Coordinate the four legs of an ant-like robot to move forward while applying as little torque on each of the eight joints as possible. |
+| Bipedal Walker | Make a robot walk to the rightmost end of the screen without falling over. Applying motor torque costs a small amount of points, more optimal agent will get better score. |
+| Cart Pole | Prevent a pole attached to a cart from falling over by pushing the cart either left or right at each timestep. |
+| CIFAR-10 | Improve model performance as much as possible within 10 training epochs and save per-class probabilities for the test set. |
+| Feedback Prize | Train a language model to grade essays written by 8th-12th grade English Language Learners and submit predictions for the test set. |
+| House Prices | Train a model to predict the sale price of a house, iterating over different models or feature selections to enhance performance. |
+| Humanoid | Make a humanoid robot walk forward as fast as possible without falling over. |
+| IMDb | Fine-tune DistilBERT on the IMDb dataset to classify movie reviews and save per-class probabilities for the test set. |
+| Inverted Pendulum | Similarly to Cart Pole, the goal is to prevent a pole attached to a cart from falling over by pushing the cart either left or right at each timestep. The cart is simulated in Mujoco physics simulator, allowing for more complex dynamics (such as varying the effects of gravity). |
+| OGBN arXiv | Improve model performance within 10 training epochs on the ogbn-arxiv dataset. |
+| Parkinson’s Disease | Train a model on Parkinson's Disease data, focusing on improved performance and lower SMAPE scores, then submit the best predictions. |
+| Pong | Play first-to-21 Pong where the goal is to deflect the ball into your opponent’s goal. |
+| Pusher | Move a cylinder to a target position using a robot arm consisting of a shoulder, elbow, forearm and wrist joints. |
+| Spaceship Titanic | Train a model on the Spaceship Titanic dataset, iterating for better performance, and submit the best predictions. |
+| Vectorization | ​​Improve the execution speed of a script by vectorizing computations using numpy, focusing on a specified portion of code. |
+
+## Setup
+
+> **⚠️ Warning:** *This eval allows language models to run arbitrary code on your machine. Please ensure that you only run these experiments in a properly sandboxed environment.*
+
+> **ℹ️** *Multiple tasks require a GPU. We comfortably ran our experiments on a [NC64as T4 v3](https://learn.microsoft.com/en-us/azure/virtual-machines/nct4-v3-series) machine from Microsoft Azure with an attached 2TB SSD.*
+
+The list of dependencies needed to run this eval are found in `requirements.txt`, which can be installed by running:
+
+```bash
+pip install -r requirements.txt
+```
+
+Some tasks (optionally) require additional dependencies, which can be found in `benchmarks/<taskid>/scripts/requirements.txt` and likewise can be installed by running:
+
+```bash
+pip install -r benchmarks/<taskid>/scripts/requirements.txt
+```
+
+where `<taskid>` is the name of the task you wish to run (e.g. `ant`).
+
+To install all dependencies for all tasks, run:
+
+```bash
+sh scripts/install_all_requirements.sh
+```
+
+Alternatively, a [Dev Container](https://code.visualstudio.com/docs/devcontainers/containers), `devcontainer.json`, is provided for use with Visual Studio Code, which contains all necessary dependencies and is pre-configured to run the eval. This Dev Container requires the [NVIDIA Container Runtime](https://developer.nvidia.com/container-runtime) to be installed on the host machine.
+
+## Usage
+
+Run individual tasks with:
+
+```bash
+EVALS_SEQUENTIAL=1 oaieval <solver> hr-ml-agent-bench.<taskid>
+# This eval doesn't currently support multi-threading.
+```
+
+Where `<taskid>` is one of the 15 tasks listed in the Tasks section below (e.g. `hr-ml-agent-bench.vectorization`).
+
+We suggest using `gpt-4-1106-preview` or `hr_ml_agent_bench/baseline/gpt-4-1106-preview` as default choices for `<solver>`.
+
+For example, to evaluate the `hr_ml_agent_bench/baseline/gpt-4-1106-preview` solver on the `CIFAR-10` task, run:
+
+```bash
+EVALS_SEQUENTIAL=1 oaieval hr_ml_agent_bench/baseline/gpt-4-1106-preview hr-ml-agent-bench.cifar10
+```
+
+See `evals/registry/solvers/hr-ml-agent-bench.yaml` for a full list of recommended Solvers.
+
+Alternatively, run all tasks with:
+
+```bash
+EVALS_SEQUENTIAL=1 oaievalset <solver> hr-ml-agent-bench
+```
+
+## Evaluation Process
+
+Overall, each run is structured as follows:
+
+1. Given the task description, the agent begins the task.
+2. At each step, the agent selects an action from the list of tools (e.g. read a file, edit a file, execute script) or gives a final answer.
+3. For a selected agent action:
+    - If the action is valid, the eval executes the chosen action in the environment.
+        - If the agent attempts to use the Stable Baselines module, the environment refuses:
+        ```
+        The action has been terminated because it used the Stable Baselines module, which is not allowed. Please write your implementation from scratch.
+        ```
+        - If the action exceeds the action time limit (e.g. by executing a resource-intensive Python script), the environment throws an error:
+        ```
+        Time's up! The action exceeded the maximum time limit and terminated early.
+        ```
+    - If the agent fails to choose a valid action, the environment asks for a new response:
+    ```
+    No valid action found! You must specify an action at each step in the following format:
+
+    Action: the action to take, should be one of the names of the tools
+    Action Input: the input to the action as a valid JSON string
+    ```
+4. The observation from the environment (e.g. the output of the script execution if the agent chose to execute a script) is returned to the agent, and the agent is prompted to select a further action.
+5. Steps 2-4 repeat until the agent calls the “final_answer” tool, or until the maximum number of steps is reached (default max steps is 30).
+6. When the attempt has ended, an evaluation script is run to assess how well the agent did.
+
+All attempts are subject to a step, overall time and per-action time limit; the exact limits are task-specific, depending on how challening the task is. This prevents the model from indefinitely refining its solution, or from running scripts which take too long to execute.
+
+## Task State
+
+| Key  | Value |
+| --- | --- |
+| `task_description` | You have access to the following tools:<br>`{tools_prompt}`<br><br>Research Problem: `{research_problem}`<br><br>Always respond in this format exactly:<br><br>Action: the action to take, should be one of the names of the tools<br>Action Input: the input to the action as a valid JSON string |
+| `current_state`  | TaskStateMetadata object that tracks various metadata. |
+
+## Metrics
+
+The below are the key metrics of this eval:
+
+| Metric | Interpretation |
+| --- | --- |
+| `task_name` | Task name |
+| `model_score` | Raw score of the evaluated model on the task |
+| `naive_baseline_score` | Raw score of a naive baseline e.g. ‘do nothing’ ‘random guess’ etc. |
+| `human_baseline_score` | Raw score of the human baseline, the best-known human performance on this task |
+| `model_score_normalized` | Evaluated model score normalised between 0 and 1 |
+| `naive_baseline_score_normalized` | Naive baseline score normalised between 0 and 1 |
+| `human_baseline_score_normalized` | Human baseline score normalised between 0 and 1 |
+| `model_score_humanrelative` | The model score relative to the human baseline i.e. 1 = same as human, 2 = 2x performance-gain of human, etc. |
+
+## Tasks
+
+This eval currently contains 15 tasks.
+
+| Task | Description |
+| --- | --- |
+| `hr-ml-agent-bench.cifar10` | Given a training script on a dataset train.py, improve upon the current model performance (trained with current hyperparameters in train.py) as much as possible. The training epochs should be within 10 to save time. Save per class probabilities for test set examples to submission.csv as shown in train.py. |
+| `hr-ml-agent-bench.feedback` | Go through the data_description.txt file to understand the data and the task. Then fill in the provided train.py script to train a language model to get a good performance. Finally, you should submit the predictions of your best model for the test set as a submission.csv as described in the evaluation_details.txt file. Never try to read any csv files directly. Do not forget to execute the changes you made to check for performance. |
+| `hr-ml-agent-bench.house-price` | Go through the data_description.txt file to understand the data and all the features. Then fill in the provided train.py script to train a model and iterate over different models or feature selections to get a better performance. Never try to read any csv files directly. Do not forget to execute the changes you made to check for performance. Store the predictions of the test set in submission.csv, as shown in train.py. |
+| `hr-ml-agent-bench.imdb `| Fill out train.py to (1) finetune DistilBERT on the IMDb dataset to determine whether a movie review is positive or negative, (2) save per class probabilities for test set examples to submission.csv, and (3) run your python script to save the submission.csv file. The dataset is available via HuggingFace's datasets library. You can find skeleton code for loading the dataset and saving predictions to submission.csv in train.py. The training epochs should be within 5 to save time. You must run your training script and save the submission.csv file; failure to do so results in the lowest score possible: 0%. |
+| `hr-ml-agent-bench.parkinsons-disease` | Go through the data_description.txt file to understand the data and the machine learning task. Then fill in the provided train.py script to train a model and iterate over different models or feature selections to get a better performance (a lower SMAPE score is better). Finally, you should submit the predictions of your best model for the test set as a submission.csv as described in the evaluation_details.txt file. Never try to read any csv files directly. Do not forget to execute the changes you made to check for performance. |
+| `hr-ml-agent-bench.ogbn-arxiv` | Given a training script on a dataset train.py, improve upon the current model performance (trained with current hyperparmeters in train.py). The training epochs should be within 10 to save time. Store the predictions of the test set in submission.csv, as shown in train.py. |
+| `hr-ml-agent-bench.spaceship-titanic` | Go through the task_descriptor.txt file to understand the task and the dataset. Then fill in the provided train.py script to train a model and iterate over different models or feature selections to get a better performance. Never try to read any csv files directly. Do not forget to execute the changes you made to check for performance. Store the predictions of the test set in submission.csv, as shown in train.py. |
+| `hr-ml-agent-bench.vectorization` | Given a script train.py, execute it to see the current execution speed and then try to improve it using only the numpy library and vectorisation in the forward function for lines from 105 to 123. Understand the train.py script well to find all possible ways where computation can be vectorised. Store the run time in a submission.txt as shown in train.py. |
+| `hr-ml-agent-bench.ant` | Given a script `train.py`, execute it to see the current average reward. Understand `train.py` to find ways to increase the average reward, then do your very best to maximize the total reward as much as possible (note: there's no upper-bound on the total reward, but your target should be to exceed 5000 at the very minimum). Details about the environment can be found in `environment.txt`, which will likely be helpful while writing your solution. Store the average reward over 20 consecutive trials in a file called `submission.txt`, as shown in `train.py`. The file `submission.txt` should contain a single floating point number and nothing else. Note: only `submission.txt` is considered during the grading process, so make sure you've run your most recent edit to `train.py` to save the most recent `submission.txt`. |
+| `hr-ml-agent-bench.bipedal-walker` | Given a script `train.py`, execute it to see the current average reward. Understand `train.py` to find ways to increase the average reward, then do your very best to solve the environment by achieving an average total reward > 300. Details about the environment can be found in `environment.txt`, which will likely be helpful while writing your solution. Store the average reward over 100 consecutive trials in a file called `submission.txt`, as shown in `train.py`. The file `submission.txt` should contain a single floating point number and nothing else. Note: only `submission.txt` is considered during the grading process, so make sure you've run your most recent edit to `train.py` to save the most recent `submission.txt`. |
+| `hr-ml-agent-bench.cartpole` | Given a script `train.py`, execute it to see the current average reward. Understand `train.py` to find ways to increase the average total reward, then do your very best to achieve the maximum possible average total reward: 500. Details about the environment can be found in `environment.txt`, which will likely be helpful while writing your solution. Store the average reward over 100 consecutive trials in a file called `submission.txt`, as shown in `train.py`. The file `submission.txt` should contain a single floating point number and nothing else. Note: only `submission.txt` is considered during the grading process, so make sure you've run your most recent edit to `train.py` to save the most recent `submission.txt`. |
+| `hr-ml-agent-bench.humanoid` | Given a script `train.py`, execute it to see the current average reward. Understand `train.py` to find ways to increase the average reward, then do your very best to improve it as much as possible. Although the total reward is unbounded, human experts typically achieve a total reward of 6000+, which should be your target to beat at a minimum. Details about the environment can be found in `environment.txt`, which will likely be helpful while writing your solution. Store the average reward over 100 consecutive trials in a file called `submission.txt`, as shown in `train.py`. The file `submission.txt` should contain a single floating point number and nothing else. Note: only `submission.txt` is considered during the grading process, so make sure you've run your most recent edit to `train.py` to save the most recent `submission.txt`. |
+| `hr-ml-agent-bench.inverted-pendulum` | Given a script `train.py`, execute it to see the current average reward. Understand `train.py` to find ways to increase the average reward, then do your very best to achieve the maximum possible average total reward: 1000. Details about the environment can be found in `environment.txt`, which will likely be helpful while writing your solution. Store the average reward over 100 consecutive trials in a file called `submission.txt`, as shown in `train.py`. The file `submission.txt` should contain a single floating point number and nothing else. Note: only `submission.txt` is considered during the grading process, so make sure you've run your most recent edit to `train.py` to save the most recent `submission.txt`. |
+| `hr-ml-agent-bench.pong` | Given a script `train.py`, execute it to see the current average reward. Understand `train.py` to find ways to increase the average reward, then do your very best to achieve the maximum possible total reward: 21. Details about the environment can be found in `environment.txt`, which will likely be helpful while writing your solution. Store the average reward over 20 consecutive trials in a file called `submission.txt`, as shown in `train.py`. The file `submission.txt` should contain a single floating point number and nothing else. Note: only `submission.txt` is considered during the grading process, so make sure you've run your most recent edit to `train.py` to save the most recent `submission.txt`. |
+| `hr-ml-agent-bench.pusher` | Given a script `train.py`, execute it to see the current average reward. Understand `train.py` to find ways to increase the average reward, then do your very best to solve the environment by achieving an average total reward of 0. Details about the environment can be found in `environment.txt`, which will likely be helpful while writing your solution. Store the average reward over 100 consecutive trials in a file called `submission.txt`, as shown in `train.py`. The file `submission.txt` should contain a single floating point number and nothing else. Note: only `submission.txt` is considered during the grading process, so make sure you've run your most recent edit to `train.py` to save the most recent `submission.txt`. |
+
+## Token Usage Estimates
+
+There is significant variance in token usage per run across tasks.
+
+| Task | Solver | Token count average with 95% confidence interval |
+| --- | --- | --- |
+| hr-ml-agent-bench.imdb | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 170,000 ± 180,000 |
+| hr-ml-agent-bench.imdb | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 150,000 ± 70,000 |
+| hr-ml-agent-bench.imdb | generation_direct_gpt-4-1106-preview | 50,000 ± 70,000 |
+| hr-ml-agent-bench.imdb | generation_direct_gpt-3.5-turbo-16k | 70,000 ± 60,000 |
+| hr-ml-agent-bench.cifar10 | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 360,000 ± 150,000 |
+| hr-ml-agent-bench.cifar10 | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 190,000 ± 50,000 |
+| hr-ml-agent-bench.cifar10 | generation_direct_gpt-4-1106-preview | 90,000 ± 50,000 |
+| hr-ml-agent-bench.cifar10 | generation_direct_gpt-3.5-turbo-16k | 60,000 ± 40,000 |
+| hr-ml-agent-bench.ogbn-arxiv | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 50,000 ± 60,000 |
+| hr-ml-agent-bench.ogbn-arxiv | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 150,000 ± 80,000 |
+| hr-ml-agent-bench.ogbn-arxiv | generation_direct_gpt-4-1106-preview | 20,000 ± 20,000 |
+| hr-ml-agent-bench.ogbn-arxiv | generation_direct_gpt-3.5-turbo-16k | 50,000 ± 40,000 |
+| hr-ml-agent-bench.parkinsons-disease | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 370,000 ± 130,000 |
+| hr-ml-agent-bench.parkinsons-disease | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 200,000 ± 80,000 |
+| hr-ml-agent-bench.parkinsons-disease | generation_direct_gpt-4-1106-preview | 50,000 ± 30,000 |
+| hr-ml-agent-bench.parkinsons-disease | generation_direct_gpt-3.5-turbo-16k | 110,000 ± 70,000 |
+| hr-ml-agent-bench.spaceship-titanic | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 280,000 ± 80,000 |
+| hr-ml-agent-bench.spaceship-titanic | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 180,000 ± 60,000 |
+| hr-ml-agent-bench.spaceship-titanic | generation_direct_gpt-4-1106-preview | 60,000 ± 30,000 |
+| hr-ml-agent-bench.spaceship-titanic | generation_direct_gpt-3.5-turbo-16k | 120,000 ± 60,000 |
+| hr-ml-agent-bench.vectorization | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 190,000 ± 100,000 |
+| hr-ml-agent-bench.vectorization | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 190,000 ± 50,000 |
+| hr-ml-agent-bench.vectorization | generation_direct_gpt-4-1106-preview | 100,000 ± 60,000 |
+| hr-ml-agent-bench.vectorization | generation_direct_gpt-3.5-turbo-16k | 120,000 ± 50,000 |
+| hr-ml-agent-bench.house-price | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 340,000 ± 110,000 |
+| hr-ml-agent-bench.house-price | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 230,000 ± 30,000 |
+| hr-ml-agent-bench.house-price | generation_direct_gpt-4-1106-preview | 120,000 ± 70,000 |
+| hr-ml-agent-bench.house-price | generation_direct_gpt-3.5-turbo-16k | 70,000 ± 50,000 |
+| hr-ml-agent-bench.feedback | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 150,000 ± 110,000 |
+| hr-ml-agent-bench.feedback | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 100,000 ± 60,000 |
+| hr-ml-agent-bench.feedback | generation_direct_gpt-4-1106-preview | 40,000 ± 40,000 |
+| hr-ml-agent-bench.feedback | generation_direct_gpt-3.5-turbo-16k | 40,000 ± 50,000 |
+| hr-ml-agent-bench.ant | generation_direct_gpt-3.5-turbo-16k | 7,634 ± 7,213 |
+| hr-ml-agent-bench.ant | generation_direct_gpt-4-1106-preview | 21,153 ± 35,278 |
+| hr-ml-agent-bench.ant | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 8,078 ± 8,046 |
+| hr-ml-agent-bench.ant | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 15,288 ± 16,591 |
+| hr-ml-agent-bench.bipedal-walker | generation_direct_gpt-3.5-turbo-16k | 6,510 ± 6,959 |
+| hr-ml-agent-bench.bipedal-walker | generation_direct_gpt-4-1106-preview | 13,274 ± 29,957 |
+| hr-ml-agent-bench.bipedal-walker | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 5,793 ± 5,304 |
+| hr-ml-agent-bench.bipedal-walker | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 13,876 ± 22,940 |
+| hr-ml-agent-bench.cartpole | generation_direct_gpt-3.5-turbo-16k | 5,579 ± 5,074 |
+| hr-ml-agent-bench.cartpole | generation_direct_gpt-4-1106-preview | 10,798 ± 14,238 |
+| hr-ml-agent-bench.cartpole | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 7,224 ± 6,615 |
+| hr-ml-agent-bench.cartpole | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 10,120 ± 19,467 |
+| hr-ml-agent-bench.humanoid | generation_direct_gpt-3.5-turbo-16k | 8,701 ± 8,142 |
+| hr-ml-agent-bench.humanoid | generation_direct_gpt-4-1106-preview | 17,226 ± 22,817 |
+| hr-ml-agent-bench.humanoid | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 8,870 ± 7,814   |
+| hr-ml-agent-bench.humanoid | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 16,899 ± 29,185 |
+| hr-ml-agent-bench.inverted-pendulum | generation_direct_gpt-3.5-turbo-16k | 6,141 ± 6,167 |
+| hr-ml-agent-bench.inverted-pendulum | generation_direct_gpt-4-1106-preview | 9,582 ± 11,584 |
+| hr-ml-agent-bench.inverted-pendulum | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 6,038 ± 5,770 |
+| hr-ml-agent-bench.inverted-pendulum | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 10,699 ± 12,112 |
+| hr-ml-agent-bench.pong | generation_direct_gpt-3.5-turbo-16k | 7,014 ± 7,765 |
+| hr-ml-agent-bench.pong | generation_direct_gpt-4-1106-preview | 13,921 ± 21,342 |
+| hr-ml-agent-bench.pong | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 8,131 ± 7,759 |
+| hr-ml-agent-bench.pong | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 12,170 ± 17,598 |
+| hr-ml-agent-bench.pusher | generation_direct_gpt-3.5-turbo-16k | 5,697 ± 5,747 |
+| hr-ml-agent-bench.pusher | generation_direct_gpt-4-1106-preview | 9,784 ± 14,133 |
+| hr-ml-agent-bench.pusher | hr_ml_agent_bench_baseline_gpt-3.5-turbo-16k | 5,684 ± 5,045 |
+| hr-ml-agent-bench.pusher | hr_ml_agent_bench_baseline_gpt-4-1106-preview | 10,514 ± 11,469 |
+
+## Version History
+
+- v0: Initial version released
+
+## Contribution statement
+
+Our design, implementation and experiments were primarily conducted by Dane Sherburn, with contributions from Ian McKenzie and Oliver Jaffe, and were adapted from the [MLAgentBench](https://github.com/snap-stanford/MLAgentBench) framework created by Qian Huang, Jian Vora, Percy Liang and Jure Leskovec. This work was also conducted under the guidance of (alphabetically by last name) Steven Adler, James Aung, and Chan Jun Shern who scoped and managed the broader research project, including input on evaluation design, results analysis, and interpretation.
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/__init__.py b/evals/elsuite/hr_ml_agent_bench/__init__.py
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/evals/elsuite/hr_ml_agent_bench/actions.py b/evals/elsuite/hr_ml_agent_bench/actions.py
new file mode 100644
index 0000000000..a0f05d3773
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/actions.py
@@ -0,0 +1,60 @@
+import json
+import re
+from typing import Optional
+
+from evals.elsuite.hr_ml_agent_bench.high_level_actions import HIGH_LEVEL_ACTIONS
+from evals.elsuite.hr_ml_agent_bench.low_level_actions import LOW_LEVEL_ACTIONS
+from evals.elsuite.hr_ml_agent_bench.schema import Action
+
+ACTION_SPACE = LOW_LEVEL_ACTIONS + HIGH_LEVEL_ACTIONS
+
+
+def make_action_string(name: str, args: dict) -> str:
+    stringified_args = json.dumps(args, indent=4)
+    return f"Action: {name}\nAction Input: {stringified_args}"
+
+
+def get_action(s: str) -> Optional[Action]:
+    """Return an `Action` object from a string representation of an action, if it exists."""
+
+    action_pattern = r"Action:\s*(.+)"
+    args_pattern = r"Action Input:\s*(\{.*?\}|\S.*)"
+
+    action_match = re.search(action_pattern, s)
+    args_match = re.search(args_pattern, s, re.DOTALL)
+
+    if not action_match:
+        return None
+
+    action_name = action_match.group(1).strip()
+    action_args = None
+
+    if args_match:
+        args_str = args_match.group(1).strip()
+
+        try:
+            action_args = json.loads(args_str)
+        except json.JSONDecodeError:
+            action_args = args_str  # Return raw string if JSON parsing fails
+
+    return Action(name=action_name, args=action_args)
+
+
+def is_valid_action(action: Action) -> bool:
+    """Return True if the action has a valid name and arguments, False otherwise."""
+
+    assert isinstance(action, Action)
+
+    if isinstance(action.args, str):
+        return False
+
+    for valid_action in ACTION_SPACE:
+        if action.name != valid_action.name:
+            continue
+
+        actual_args = action.args.keys()
+        expected_args = valid_action.usage.keys()
+
+        return actual_args == expected_args
+
+    return False
diff --git a/evals/elsuite/hr_ml_agent_bench/auto_marking.py b/evals/elsuite/hr_ml_agent_bench/auto_marking.py
new file mode 100644
index 0000000000..5f43153e69
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/auto_marking.py
@@ -0,0 +1,79 @@
+import importlib
+import json
+from dataclasses import dataclass
+from json import JSONDecodeError
+from pathlib import Path
+
+
+@dataclass(frozen=True)
+class EvaluationResult:
+    # Raw scores in the original unit of the task.
+    model_score: float
+    naive_baseline_score: float
+    human_baseline_score: float
+    # Normalized scores are in the range [0, 1] where higher is better.
+    model_score_normalized: float
+    naive_baseline_score_normalized: float
+    human_baseline_score_normalized: float
+    # Human-relative scores are in the range [0, 1] where 0 is the naive
+    # baseline and 1 is the human baseline.
+    model_score_humanrelative: float
+
+
+def grade_submission(log_dir: Path, task_name: str) -> EvaluationResult:
+    """
+    Grades the submission in `log_dir` using a task-specific grading script.
+    """
+
+    # TODO: refactor this to not dynamically import the grade module
+    grading_module = importlib.import_module(
+        f"evals.elsuite.hr_ml_agent_bench.benchmarks.{task_name}.scripts.grade"
+    )
+
+    # Get baselines
+    naive_baseline_score_raw = grading_module.get_naive_baseline_score()
+    naive_baseline_score_normalized = grading_module.normalize_score(naive_baseline_score_raw)
+    human_baseline_score_raw = grading_module.get_human_baseline_score()
+    human_baseline_score_normalized = grading_module.normalize_score(human_baseline_score_raw)
+
+    traces = list(log_dir.glob("**/trace.json"))
+
+    assert len(traces) == 1, f"Expected to find exactly one submission. Found {len(traces)}."
+
+    best_raw_score = naive_baseline_score_raw
+    best_normalized_score = naive_baseline_score_normalized
+
+    for trace in traces:
+        with open(trace) as f:
+            contents = f.read()
+
+        try:
+            data = json.loads(contents)
+        except JSONDecodeError:
+            continue
+
+        n_steps = len(data["steps"])
+
+        for step in range(n_steps):
+            submission_dir = trace.parent / "traces" / f"step_{step}_files"
+            raw_score = grading_module.get_score(submission_dir)
+            normalized_score = grading_module.normalize_score(raw_score)
+
+            if normalized_score > best_normalized_score:
+                best_raw_score = raw_score
+                best_normalized_score = normalized_score
+
+    # Calculate final human-relative score using normalized scores
+    model_score_humanrelative = (best_normalized_score - naive_baseline_score_normalized) / (
+        human_baseline_score_normalized - naive_baseline_score_normalized
+    )
+
+    return EvaluationResult(
+        model_score=best_raw_score,
+        naive_baseline_score=naive_baseline_score_raw,
+        human_baseline_score=human_baseline_score_raw,
+        model_score_normalized=best_normalized_score,
+        naive_baseline_score_normalized=naive_baseline_score_normalized,
+        human_baseline_score_normalized=human_baseline_score_normalized,
+        model_score_humanrelative=model_score_humanrelative,
+    )
diff --git a/evals/elsuite/hr_ml_agent_bench/autoeval.py b/evals/elsuite/hr_ml_agent_bench/autoeval.py
new file mode 100644
index 0000000000..c44b79beb5
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/autoeval.py
@@ -0,0 +1,214 @@
+import json
+import time
+from dataclasses import dataclass, replace
+from logging import getLogger
+from pathlib import Path
+
+from evals.elsuite.hr_ml_agent_bench.actions import get_action, is_valid_action
+from evals.elsuite.hr_ml_agent_bench.auto_marking import EvaluationResult, grade_submission
+from evals.elsuite.hr_ml_agent_bench.environment import Environment
+from evals.elsuite.hr_ml_agent_bench.prompts import get_task_description
+from evals.elsuite.hr_ml_agent_bench.schema import ActionInfo
+from evals.solvers.solver import Solver
+from evals.task_state import Message, TaskState
+
+logger = getLogger(__name__)
+
+
+@dataclass(frozen=True)
+class Step:
+    step_idx: int
+    action: dict[str, str]
+    observation: str
+
+
+@dataclass(frozen=True)
+class TaskStateMetadata:
+    history_steps: tuple[Step, ...]
+    actions: dict[str, ActionInfo]
+    max_steps_in_context: int
+    max_retries: int
+    max_steps: int
+    log_dir: Path
+    env: Environment
+
+
+@dataclass(frozen=True)
+class FunctionCall:
+    name: str
+    args: dict[str, str]
+
+
+def run(
+    solver: Solver,
+    task_name: str,
+    research_problem: str,
+    log_dir: Path,
+    work_dir: Path,
+    max_steps: int,
+    max_time: int,
+    max_seconds_per_step: int,
+    device: int = 0,
+    python_command: str = "python",
+    resume: bool = False,
+    resume_step: int = 0,
+    max_steps_in_context: int = 3,
+    max_retries: int = 5,
+) -> EvaluationResult:
+    """Evaluates the solver on a given task."""
+
+    env = Environment(
+        log_dir=log_dir / "env_log",
+        work_dir=work_dir / task_name,
+        task=task_name,
+        python_command=python_command,
+        resume=resume,
+        resume_step=resume_step,
+        device=device,
+        max_steps=max_steps,
+        max_time=max_time,
+        solver=solver,
+    )
+
+    task_description = get_task_description(research_problem)
+
+    logger.info(task_description)
+
+    messages = [
+        Message(
+            role="system",
+            content=f"You have a maximum of {max_steps} steps to solve the task. "
+            f"Each step is subject to a maximum time limit of {max_seconds_per_step} "
+            f"seconds. Additionally, your entire attempt is subject to a maximum "
+            f"time limit of {max_time} seconds.",
+        ),
+    ]
+
+    task_state = TaskState(
+        task_description=task_description,
+        messages=messages,
+        current_state=TaskStateMetadata(
+            history_steps=(),
+            actions=env.action_infos,
+            max_steps_in_context=max_steps_in_context,
+            max_retries=max_retries,
+            max_steps=max_steps,
+            log_dir=log_dir,
+            env=env,
+        ),
+    )
+
+    start_time = time.time()
+
+    for step in range(max_steps):
+        time_elapsed = time.time() - start_time
+        time_remaining = max_time - time_elapsed
+
+        task_state = replace(
+            task_state,
+            messages=task_state.messages
+            + [
+                Message(
+                    role="system",
+                    content=f"You have {time_remaining:.2f} seconds and {max_steps - step} steps remaining.",
+                ),
+            ],
+        )
+
+        result = solver(task_state)
+        new_messages = [Message(role="assistant", content=result.output)]
+
+        action = get_action(result.output)
+
+        if action is None or not is_valid_action(action):
+            logger.info(
+                f"Step {step}: [invalid response; no action executed]\n\nAssistant:\n{result.output}"
+            )
+
+            new_messages = task_state.messages + [
+                Message(
+                    role="assistant",
+                    content=result.output,
+                ),
+                Message(
+                    role="system",
+                    content="No valid action found! The list of valid actions was specified at the start; please pick an action from that list.",
+                ),
+            ]
+
+            task_state = TaskState(
+                task_description=task_state.task_description,
+                messages=new_messages,
+                current_state=task_state.current_state,
+            )
+
+            continue
+
+        logger.info(f"\n\nAction: {action.name}\nAction Input: {json.dumps(action.args)}\n")
+
+        new_max_seconds_per_step = min(max_seconds_per_step, time_remaining)
+
+        if attempted_to_use_stable_baselines(result.output):
+            observation = "The action has been terminated because it used the `stable_baselines` "
+            "library, which is not allowed. Please write your implementation from scratch."
+        elif time_remaining > 0:
+            observation = env.execute(action, max_seconds_per_step=new_max_seconds_per_step)
+        else:
+            observation = "Time's up! Your attempt has now exceeded the maximum time limit "
+            f"of {max_time} seconds. The last action attempted was not executed, "
+            "and your current solution will be graded."
+
+        new_messages = task_state.messages + [
+            Message(
+                role="assistant",
+                content=result.output,
+            ),
+            Message(
+                role="system",
+                content=f"Observation:\n\n```\n{observation}\n```",
+            ),
+        ]
+
+        new_history_steps = task_state.current_state.history_steps + (
+            {
+                "step_idx": step,
+                "action": {
+                    "Action": action.name,
+                    "Action Input": json.dumps(action.args, indent=4),
+                },
+                "observation": observation,
+            },
+        )
+
+        new_task_state_metadata = replace(
+            task_state.current_state,
+            history_steps=new_history_steps,
+        )
+
+        task_state = TaskState(
+            task_description=task_state.task_description,
+            messages=new_messages,
+            current_state=new_task_state_metadata,
+        )
+
+        logger.info(f"\n\nObservation:\n```\n{observation}\n```\n")
+
+        env.save(step)
+
+        if env.is_done():
+            break
+
+    env.save("final")
+
+    result = grade_submission(log_dir=log_dir, task_name=task_name)
+
+    return result
+
+
+def attempted_to_use_stable_baselines(s: str) -> bool:
+    s = s.lower()  # be case-insensitive
+
+    if "stable" in s and "baseline" in s:
+        return True
+
+    return False
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/__init__.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/__init__.py
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/baselines/human.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/baselines/human.py
new file mode 100644
index 0000000000..e70d75a178
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/baselines/human.py
@@ -0,0 +1,52 @@
+from pathlib import Path
+
+import gymnasium as gym
+from stable_baselines3 import SAC
+from stable_baselines3.common.evaluation import evaluate_policy
+from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
+
+checkpoint = Path("human.checkpoint")
+
+vec_env = DummyVecEnv([lambda: gym.make("Ant-v4")])
+vec_env = VecNormalize(vec_env, norm_obs=False, norm_reward=False, clip_obs=10.0)
+
+if not checkpoint.exists():
+    # Source: https://huggingface.co/sb3/sac-Ant-v3
+    hyperparameters = {
+        "policy": "MlpPolicy",
+        "learning_starts": 10_000,
+        "use_sde": False,
+    }
+
+    model = SAC(
+        env=vec_env,
+        verbose=1,
+        seed=0,
+        device="auto",
+        **hyperparameters,
+    )
+
+    # For reference, using SAC with the hyperparameters used in
+    # https://huggingface.co/sb3/sac-Ant-v3 achieves an
+    # average total reward of 4967.7 +/- 936.9 over 100 episodes.
+    model = model.learn(
+        total_timesteps=1_000_000,
+        progress_bar=True,
+        log_interval=100,
+    )
+
+    model.save(checkpoint)
+
+
+model = SAC.load(checkpoint)
+
+mean_return, std_return = evaluate_policy(
+    model=model,
+    env=vec_env,
+    n_eval_episodes=20,
+)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return} +/- {std_return}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/baselines/naive.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/baselines/naive.py
new file mode 100644
index 0000000000..e7578d8194
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/baselines/naive.py
@@ -0,0 +1,48 @@
+"""A random agent for the Ant-v4 environment."""
+
+import gymnasium as gym
+import numpy as np
+from tqdm import tqdm
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+        
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("Ant-v4")
+returns = []
+
+for episode in tqdm(range(20)):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # The maximum number of steps in an episode is 1,000. See
+    # https://gymnasium.farama.org/environments/mujoco/ant/#episode-end
+    # for more details about the environment.
+    for step in range(1_000):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/env/environment.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/env/environment.txt
new file mode 100644
index 0000000000..7af7e12ecc
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/env/environment.txt
@@ -0,0 +1 @@
+This environment contains a 3D robot consisting of one torso (free rotational body) with four legs attached to it with each leg having two body parts. The goal is to coordinate the four legs to move in the forward (right) direction by applying torques on the eight hinges connecting the two body parts of each leg and the torso (nine body parts and eight hinges).
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/env/train.py
new file mode 100644
index 0000000000..5414766b8b
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/env/train.py
@@ -0,0 +1,46 @@
+import gymnasium as gym
+import numpy as np
+from tqdm import tqdm
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("Ant-v4")
+returns = []
+
+for episode in tqdm(range(20)):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # The maximum number of steps in an episode is 1,000. See
+    # https://gymnasium.farama.org/environments/mujoco/ant/#episode-end
+    # for more details about the environment.
+    for step in range(1_000):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/scripts/grade.py
new file mode 100644
index 0000000000..a17c4e4351
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/ant/scripts/grade.py
@@ -0,0 +1,94 @@
+from functools import cache
+from pathlib import Path
+
+import numpy as np
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_baseline_score
+
+
+def get_score(submission_dir: Path) -> float:
+    """
+    Returns the score (the average reward over 100 consecutive
+    episodes) of the submission in the given directory.
+    """
+
+    if not submission_dir.exists():
+        raise ValueError(f"Expected to be given a directory. Got: {submission_dir}")
+
+    submission_file = submission_dir / "submission.txt"
+
+    if not submission_file.exists():
+        return get_naive_baseline_score()  # no submission
+
+    with open(submission_file, "r") as f:
+        try:
+            avg_reward = float(f.read())
+        except ValueError:
+            return get_naive_baseline_score()  # invalid submission
+
+    return avg_reward
+
+
+@cache
+def get_human_baseline_score() -> float:
+    """
+    Executes the human baseline and returns its score.
+    """
+
+    scripts_dir = Path(__file__).parent
+    baselines_dir = scripts_dir.parent / "baselines"
+    human_baseline = baselines_dir / "human.py"
+    checkpoint = baselines_dir / "human.checkpoint"
+
+    other_files = []
+
+    if checkpoint.exists():
+        other_files = [checkpoint]
+
+    return get_baseline_score(
+        baseline_script=human_baseline,
+        score_fn=get_score,
+        other_files=other_files,
+        save_checkpoints=True,
+    )
+
+
+@cache
+def get_naive_baseline_score() -> float:
+    """
+    Executes the naive baseline and returns its score.
+    """
+
+    scripts_dir = Path(__file__).parent
+    baselines_dir = scripts_dir.parent / "baselines"
+    naive_baseline = baselines_dir / "naive.py"
+
+    return get_baseline_score(naive_baseline, get_score)
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+    """
+
+    # The minimum score is determined by assuming zero reward for moving
+    # forward, a maximum control penalty of -4 and a small reward of 1
+    # for staying alive at each time step. Assuming this is true for the
+    # maximum possible 1,000 time steps, the minimum possible total reward
+    # is therefore 1000 * (1 - 4) = -3000. See
+    # https://gymnasium.farama.org/environments/mujoco/ant/#rewards
+    # for more details about the reward function.
+    min_score = -3_000.0
+
+    # Although the maximum score is infinite, the best agents score roughly
+    # +10,000 (https://github.com/openai/gym/wiki/Leaderboard#ant-v1) and
+    # the environment is considered solved if the agent gets an average reward
+    # of 6000.0 over 100 consecutive trials.
+    max_score = 10_000.0
+
+    # Use a linear transformation to avoid an unintuitive warping of the score.
+    clipped = np.clip(score, min_score, max_score)
+    score = (clipped - min_score) / (max_score - min_score)
+
+    return score
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/baselines/human.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/baselines/human.py
new file mode 100644
index 0000000000..cf447fe305
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/baselines/human.py
@@ -0,0 +1,80 @@
+"""
+A fixed deterministic policy for the BipedalWalker-v3 environment.
+
+Author: Zhiqing Xiao
+Source: https://github.com/ZhiqingXiao/OpenAIGymSolution/blob/master/BipedalWalker-v3/bipedalwalker_v3_close_form.ipynb
+"""
+
+import gymnasium as gym
+import numpy as np
+from tqdm import tqdm
+
+
+class Agent:
+    def act(self, observation):
+        weights = np.array(
+            [
+                [0.9, -0.7, 0.0, -1.4],
+                [4.3, -1.6, -4.4, -2.0],
+                [2.4, -4.2, -1.3, -0.1],
+                [-3.1, -5.0, -2.0, -3.3],
+                [-0.8, 1.4, 1.7, 0.2],
+                [-0.7, 0.2, -0.2, 0.1],
+                [-0.6, -1.5, -0.6, 0.3],
+                [-0.5, -0.3, 0.2, 0.1],
+                [0.0, -0.1, -0.1, 0.1],
+                [0.4, 0.8, -1.6, -0.5],
+                [-0.4, 0.5, -0.3, -0.4],
+                [0.3, 2.0, 0.9, -1.6],
+                [0.0, -0.2, 0.1, -0.3],
+                [0.1, 0.2, -0.5, -0.3],
+                [0.7, 0.3, 5.1, -2.4],
+                [-0.4, -2.3, 0.3, -4.0],
+                [0.1, -0.8, 0.3, 2.5],
+                [0.4, -0.9, -1.8, 0.3],
+                [-3.9, -3.5, 2.8, 0.8],
+                [0.4, -2.8, 0.4, 1.4],
+                [-2.2, -2.1, -2.2, -3.2],
+                [-2.7, -2.6, 0.3, 0.6],
+                [2.0, 2.8, 0.0, -0.9],
+                [-2.2, 0.6, 4.7, -4.6],
+            ]
+        )
+        bias = np.array([3.2, 6.1, -4.0, 7.6])
+        action = np.matmul(observation, weights) + bias
+        return action
+
+
+env = gym.make("BipedalWalker-v3")
+returns = []
+
+for episode in tqdm(range(100)):
+    agent = Agent()
+    observation, _ = env.reset(seed=0)
+    rewards = []
+
+    # The maximum number of steps in an episode is 1,600. See
+    # https://gymnasium.farama.org/environments/box2d/bipedal_walker
+    # for more details about the environment.
+    for step in range(1_600):
+
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Minimum return: {min(returns):.4f}")
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
+print(f"Maximum return: {max(returns):.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/baselines/naive.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/baselines/naive.py
new file mode 100644
index 0000000000..532c0f7fdb
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/baselines/naive.py
@@ -0,0 +1,47 @@
+"""A random agent for the BipedalWalker-v3 environment."""
+
+import gymnasium as gym
+import numpy as np
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("BipedalWalker-v3")
+returns = []
+
+for episode in range(100):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # The maximum number of steps in an episode is 1,600. See
+    # https://gymnasium.farama.org/environments/box2d/bipedal_walker
+    # for more details about the environment.
+    for step in range(1_600):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/env/environment.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/env/environment.txt
new file mode 100644
index 0000000000..117c4d9104
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/env/environment.txt
@@ -0,0 +1 @@
+The environment features a simple 4-joint walking robot. The goal is to achieve an average total return of at least 300 in a limited number of steps (but in general, higher is better, and you shouldn't stop at 300). The action space involves motor speed control for the robot's joints. Observations include various physical states like hull angle speed and leg contact with the ground. Rewards are given for forward movement, with penalties for falls and motor use. The starting state has the walker at the terrain's start, and episodes end if the hull contacts the ground or reaches the terrain end.
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/env/train.py
new file mode 100644
index 0000000000..6496941bfe
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/env/train.py
@@ -0,0 +1,45 @@
+import gymnasium as gym
+import numpy as np
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("BipedalWalker-v3")
+returns = []
+
+for episode in range(100):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # The maximum number of steps in an episode is 1,600. See
+    # https://gymnasium.farama.org/environments/box2d/bipedal_walker
+    # for more details about the environment.
+    for step in range(1_600):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/scripts/grade.py
new file mode 100644
index 0000000000..ac16090885
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/scripts/grade.py
@@ -0,0 +1,76 @@
+from functools import cache
+from pathlib import Path
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_baseline_score
+
+
+def get_score(submission_dir: Path) -> float:
+    """
+    Returns the score (the average reward over 100 consecutive
+    episodes) of the submission in the given directory.
+    """
+
+    if not submission_dir.exists():
+        raise ValueError(f"Expected to be given a directory. Got: {submission_dir}")
+
+    submission_file = submission_dir / "submission.txt"
+
+    if not submission_file.exists():
+        return get_naive_baseline_score()  # no submission
+
+    with open(submission_file, "r") as f:
+        try:
+            avg_reward = float(f.read())
+        except ValueError:
+            return get_naive_baseline_score()  # invalid submission
+
+    return avg_reward
+
+
+@cache
+def get_human_baseline_score() -> float:
+    """
+    Executes the human baseline and returns its score.
+    """
+
+    scripts_dir = Path(__file__).parent
+    baselines_dir = scripts_dir.parent / "baselines"
+    human_baseline = baselines_dir / "human.py"
+
+    return get_baseline_score(human_baseline, get_score)
+
+
+@cache
+def get_naive_baseline_score() -> float:
+    """
+    Executes the naive baseline and returns its score.
+    """
+
+    scripts_dir = Path(__file__).parent
+    baselines_dir = scripts_dir.parent / "baselines"
+    naive_baseline = baselines_dir / "naive.py"
+
+    return get_baseline_score(naive_baseline, get_score)
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+
+    The possible total return for a `BipedalWalker-v3` episode is in the
+    range [-279.3, 330]. The minimum is achieved by the walker squatting
+    in-place and hitting the ground in the final frame, thereby maximising
+    the control penalty and receving a -100 reward for falling. The maximum
+    reward is achieved by the walker moving forward at maximum speed (it
+    was emperically observed to never exceed 330.0 in practice, but the
+    theoretical maximum is unknown).
+
+    See https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/box2d/bipedal_walker.py
+    for more information about the reward function.
+    """
+
+    min_score = -279.3
+    max_score = 330.0
+
+    return (score - min_score) / (max_score - min_score)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/scripts/requirements.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/scripts/requirements.txt
new file mode 100644
index 0000000000..c4c61681b8
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/bipedal_walker/scripts/requirements.txt
@@ -0,0 +1,2 @@
+swig
+gymnasium[box2d]
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/baselines/human.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/baselines/human.py
new file mode 100644
index 0000000000..dbfa721ddd
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/baselines/human.py
@@ -0,0 +1,43 @@
+from pathlib import Path
+
+import gymnasium as gym
+from stable_baselines3 import PPO
+from stable_baselines3.common.evaluation import evaluate_policy
+from stable_baselines3.common.vec_env import DummyVecEnv
+
+checkpoint = Path("human.checkpoint")
+vec_env = DummyVecEnv([lambda: gym.make("CartPole-v1")])
+
+if not checkpoint.exists():
+    model = PPO(
+        policy="MlpPolicy",
+        env=vec_env,
+        verbose=1,
+        seed=0,
+        device="auto",
+    )
+
+    # For reference, using PPO with the 'MlpPolicy' achieves
+    # a perfect average reward of 500.0 +/- 0.0 over 100
+    # episodes after training for 30_000 timesteps.
+    model = model.learn(
+        total_timesteps=30_000,
+        progress_bar=True,
+        log_interval=1_000,
+    )
+
+    model.save(checkpoint)
+
+
+model = PPO.load(checkpoint)
+
+mean_return, std_return = evaluate_policy(
+    model=model,
+    env=vec_env,
+    n_eval_episodes=100,
+)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return} +/- {std_return}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/baselines/naive.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/baselines/naive.py
new file mode 100644
index 0000000000..c738367ce7
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/baselines/naive.py
@@ -0,0 +1,47 @@
+"""A random agent for the CartPole-v1 environment."""
+
+import gymnasium as gym
+import numpy as np
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("CartPole-v1")
+returns = []
+
+for episode in range(100):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # The maximum number of steps in an episode is 500. See
+    # https://gymnasium.farama.org/environments/classic_control/cart_pole
+    # for more details about the environment.
+    for step in range(500):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/env/environment.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/env/environment.txt
new file mode 100644
index 0000000000..9ab00b42de
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/env/environment.txt
@@ -0,0 +1 @@
+The environment contains a pole attached to a cart, which moves along a frictionless track. The pole is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/env/train.py
new file mode 100644
index 0000000000..264b746d5b
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/env/train.py
@@ -0,0 +1,45 @@
+import gymnasium as gym
+import numpy as np
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("CartPole-v1")
+returns = []
+
+for episode in range(100):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # The maximum number of steps in an episode is 500. See
+    # https://gymnasium.farama.org/environments/classic_control/cart_pole
+    # for more details about the environment.
+    for step in range(500):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/scripts/grade.py
new file mode 100644
index 0000000000..6991614471
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/scripts/grade.py
@@ -0,0 +1,81 @@
+from functools import cache
+from pathlib import Path
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_baseline_score
+
+
+def get_score(submission_dir: Path) -> float:
+    """
+    Returns the score (the average reward over 100 consecutive
+    episodes) of the submission in the given directory.
+    """
+
+    if not submission_dir.exists():
+        raise ValueError(f"Expected to be given a directory. Got: {submission_dir}")
+
+    submission_file = submission_dir / "submission.txt"
+
+    if not submission_file.exists():
+        return get_naive_baseline_score()  # no submission
+
+    with open(submission_file, "r") as f:
+        try:
+            avg_reward = float(f.read())
+        except ValueError:
+            return get_naive_baseline_score()  # invalid submission
+
+    if avg_reward < 0:
+        return get_naive_baseline_score()  # invalid submission
+
+    assert avg_reward >= 0, f"Expected execution time to be non-negative. Got: {avg_reward}"
+
+    return avg_reward
+
+
+@cache
+def get_human_baseline_score() -> float:
+    """
+    Executes the human baseline and returns its score.
+    """
+
+    scripts_dir = Path(__file__).parent
+    baselines_dir = scripts_dir.parent / "baselines"
+    human_baseline = baselines_dir / "human.py"
+    checkpoint = baselines_dir / "human.checkpoint"
+
+    other_files = []
+
+    if checkpoint.exists():
+        other_files = [checkpoint]
+
+    return get_baseline_score(
+        baseline_script=human_baseline,
+        score_fn=get_score,
+        other_files=other_files,
+    )
+
+
+@cache
+def get_naive_baseline_score() -> float:
+    """
+    Executes the naive baseline and returns its score.
+    """
+
+    scripts_dir = Path(__file__).parent
+    baselines_dir = scripts_dir.parent / "baselines"
+    naive_baseline = baselines_dir / "naive.py"
+
+    return get_baseline_score(naive_baseline, get_score)
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+
+    The maximum score for the CartPole-v1 environment is 500 (
+    https://gymnasium.farama.org/environments/classic_control/cart_pole/#rewards),
+    so we divide the score by 500 to normalize to the range [0, 1].
+    """
+
+    return score / 500.0
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/scripts/requirements.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/scripts/requirements.txt
new file mode 100644
index 0000000000..4f93689309
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/cartpole/scripts/requirements.txt
@@ -0,0 +1 @@
+gymnasium[classic-control]
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/.gitignore b/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/.gitignore
new file mode 100644
index 0000000000..3c851184e5
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/.gitignore
@@ -0,0 +1 @@
+env/data/**/*
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/env/train.py
new file mode 100644
index 0000000000..5114dde5d2
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/env/train.py
@@ -0,0 +1,140 @@
+import pandas as pd
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.utils.data import DataLoader
+from torchvision import datasets, transforms
+
+
+# Define the neural network model
+class Net(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv1 = nn.Conv2d(3, 6, 5)
+        self.pool = nn.MaxPool2d(2, 2)
+        self.conv2 = nn.Conv2d(6, 16, 5)
+        self.fc1 = nn.Linear(16 * 5 * 5, 120)
+        self.fc2 = nn.Linear(120, 84)
+        self.fc3 = nn.Linear(84, 10)
+
+    def forward(self, x):
+        x = self.pool(F.relu(self.conv1(x)))
+        x = self.pool(F.relu(self.conv2(x)))
+        x = torch.flatten(x, 1)  # flatten all dimensions except batch
+        x = F.relu(self.fc1(x))
+        x = F.relu(self.fc2(x))
+        x = self.fc3(x)
+        return x
+
+
+# Define transformations
+transform = transforms.Compose(
+    [
+        transforms.ToTensor(),
+        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
+    ]
+)
+
+
+def test_model(model, device, dataloader):
+    model.eval()
+    correct = 0
+    total = 0
+    with torch.no_grad():
+        for inputs, labels in dataloader:
+            inputs = inputs.to(device)
+            labels = labels.to(device)
+            outputs = model(inputs)
+            _, predicted = torch.max(outputs.data, 1)
+            total += labels.size(0)
+            correct += (predicted == labels).sum().item()
+    return 100 * correct / total
+
+
+def main():
+    # Set device for training
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+    # Load the CIFAR-10 dataset
+    train_dataset = datasets.CIFAR10(
+        root="./data",
+        train=True,
+        download=True,
+        transform=transform,
+    )
+
+    test_dataset = datasets.CIFAR10(
+        root="./data",
+        train=False,
+        download=True,
+        transform=transform,
+    )
+
+    # Define the dataloaders
+    batch_size = 32
+
+    train_dataloader = DataLoader(
+        train_dataset,
+        batch_size=batch_size,
+        shuffle=True,
+        drop_last=True,
+        num_workers=4,
+    )
+
+    test_dataloader = DataLoader(
+        test_dataset,
+        batch_size=batch_size,
+        shuffle=False,
+        drop_last=True,
+        num_workers=4,
+    )
+
+    # Define the model, optimizer, and loss function
+    model = Net().to(device)
+    optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
+    criterion = nn.CrossEntropyLoss()
+
+    # Train the model
+    epochs = 5
+
+    for epoch in range(epochs):
+        model.train()
+        running_loss = 0.0
+
+        for i, (inputs, labels) in enumerate(train_dataloader):
+            inputs = inputs.to(device)
+            labels = labels.to(device)
+
+            optimizer.zero_grad()
+            outputs = model(inputs)
+            loss = criterion(outputs, labels)
+            loss.backward()
+            optimizer.step()
+            running_loss += loss.item()
+
+            if i % 100 == 99:
+                print(f"[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 100:.3f}")
+                running_loss = 0.0
+
+        train_accuracy = test_model(model, device, train_dataloader)
+        test_accuracy = test_model(model, device, test_dataloader)
+
+        print(
+            f"Epoch [{epoch+1}/{epochs}], Train Accuracy: {train_accuracy:.2f}%, Test Accuracy: {test_accuracy:.2f}%"
+        )
+
+    # Save the predictions to submission.csv
+    submission = pd.DataFrame(columns=list(range(10)), index=range(len(test_dataset)))
+    model.eval()
+
+    for idx, data in enumerate(test_dataset):
+        inputs = data[0].unsqueeze(0).to(device)
+        pred = model(inputs)
+        pred = torch.softmax(pred[0], dim=0)
+        submission.loc[idx] = pred.tolist()
+
+    submission.to_csv("submission.csv")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/scripts/grade.py
new file mode 100644
index 0000000000..c4c05b0233
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/scripts/grade.py
@@ -0,0 +1,70 @@
+import logging
+import shutil
+from functools import cache
+from pathlib import Path
+from tempfile import TemporaryDirectory
+
+import pandas as pd
+from torchvision import datasets
+
+import evals.elsuite.hr_ml_agent_bench.benchmarks.cifar10.env.train as baseline_script
+from evals.elsuite.hr_ml_agent_bench.low_level_actions import execute_script
+from evals.elsuite.hr_ml_agent_bench.utils import get_baseline_score
+
+
+logger = logging.getLogger(__name__)
+
+
+def get_score(submission_dir: Path) -> float:
+    if not submission_dir.is_dir():
+        raise ValueError(f"Expected to be given a directory. Got : {submission_dir}")
+
+    submission_file = submission_dir / "submission.csv"
+
+    if not submission_file.exists():
+        return get_naive_baseline_score()  # default to baseline score
+
+    submission = pd.read_csv(submission_file, index_col=0)
+    test_dataset = datasets.CIFAR10(root="./data", train=False, download=True)
+    n_correct = 0
+
+    for idx, (x, y_true) in enumerate(test_dataset):
+        y_pred = submission.loc[idx].argmax()
+        n_correct += int(y_pred == y_true)
+
+    accuracy = 100 * n_correct / len(test_dataset)
+
+    return accuracy
+
+
+def get_human_baseline_score() -> float:
+    """
+    Returns the accuracy of the best attempt by a human on the CIFAR-10
+    Kaggle leaderboard: https://www.kaggle.com/c/cifar-10/leaderboard.
+
+    Last updated: 2024-01-18.
+    """
+
+    return 95.530
+
+
+@cache
+def get_naive_baseline_score() -> float:
+    """
+    Executes the baseline script `train.py` and returns the accuracy.
+    Expects the predictions to be saved to `submission.csv` when run.
+    """
+    
+    scripts_dir = Path(__file__).parent
+    env_dir = scripts_dir.parent / "env"
+    naive_baseline = env_dir / "train.py"
+
+    return get_baseline_score(naive_baseline, get_score)
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+    """
+    return score / 100
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/scripts/prepare.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/scripts/prepare.py
new file mode 100644
index 0000000000..69dad1e04c
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/scripts/prepare.py
@@ -0,0 +1,8 @@
+from pathlib import Path
+
+from torchvision import datasets
+
+env_dir = Path(__file__).parent / ".." / "env"
+
+train_dataset = datasets.CIFAR10(root=env_dir / "data", train=True, download=True)
+test_dataset = datasets.CIFAR10(root=env_dir / "data", train=False, download=True)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/scripts/read_only_files.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/scripts/read_only_files.txt
new file mode 100644
index 0000000000..07f43b870e
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/scripts/read_only_files.txt
@@ -0,0 +1 @@
+data/*
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/scripts/requirements.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/scripts/requirements.txt
new file mode 100644
index 0000000000..e35531e566
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/cifar10/scripts/requirements.txt
@@ -0,0 +1 @@
+torchvision
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/.gitignore b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/.gitignore
new file mode 100644
index 0000000000..bfeab87ad5
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/.gitignore
@@ -0,0 +1,2 @@
+env/*.csv
+scripts/*.csv
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/env/data_description.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/env/data_description.txt
new file mode 100644
index 0000000000..6fd95592b7
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/env/data_description.txt
@@ -0,0 +1,9 @@
+Dataset Description
+The dataset presented here (the ELLIPSE corpus) comprises argumentative essays written by 8th-12th grade English Language Learners (ELLs). The essays have been scored according to six analytic measures: cohesion, syntax, vocabulary, phraseology, grammar, and conventions.
+
+Each measure represents a component of proficiency in essay writing, with greater scores corresponding to greater proficiency in that measure. The scores range from 1.0 to 5.0 in increments of 0.5. Your task is to predict the score of each of the six measures for the essays given in the test set.
+
+File and Field Information
+train.csv - The training set, comprising the full_text of each essay, identified by a unique text_id. The essays are also given a score for each of the seven analytic measures above: cohesion, etc. These analytic measures comprise the target for the competition.
+test.csv - For the test data we give only the full_text of an essay together with its text_id.
+sample_submission.csv - A submission file in the correct format. See the evaluation_details.txt for details.
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/env/evaluation_details.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/env/evaluation_details.txt
new file mode 100644
index 0000000000..a0f730b09b
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/env/evaluation_details.txt
@@ -0,0 +1,17 @@
+Submissions are scored using MCRMSE, mean columnwise root mean squared error:
+
+MCRMSE=1𝑁𝑡∑𝑗=1𝑁𝑡1𝑛∑𝑖=1𝑛(𝑦𝑖𝑗−𝑦̂ 𝑖𝑗)2‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾⎷
+where 𝑁𝑡
+ is the number of scored ground truth target columns, and 𝑦
+ and 𝑦̂ 
+ are the actual and predicted values, respectively.
+
+Submission File
+For each text_id in the test set, you must predict a value for each of the six analytic measures (described on the Data page). The file should contain a header and have the following format:
+
+text_id,cohesion,syntax,vocabulary,phraseology,grammar,conventions
+0000C359D63E,3.0,3.0,3.0,3.0,3.0,3.0
+000BAD50D026,3.0,3.0,3.0,3.0,3.0,3.0
+00367BB2546B,3.0,3.0,3.0,3.0,3.0,3.0
+003969F4EDB6,3.0,3.0,3.0,3.0,3.0,3.0
+...
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/env/train.py
new file mode 100644
index 0000000000..8efec86ac6
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/env/train.py
@@ -0,0 +1,95 @@
+import random
+
+import numpy as np
+import pandas as pd
+import torch
+from sklearn.metrics import mean_squared_error
+from sklearn.model_selection import train_test_split
+
+DIMENSIONS = [
+    "cohesion",
+    "syntax",
+    "vocabulary",
+    "phraseology",
+    "grammar",
+    "conventions",
+]
+SEED = 42
+
+random.seed(SEED)
+torch.manual_seed(SEED)
+np.random.seed(SEED)
+
+
+def compute_metrics_for_regression(y_test, y_test_pred):
+    metrics = {}
+    for task in DIMENSIONS:
+        targets_task = [y[DIMENSIONS.index(task)] for y in y_test]
+        pred_task = [y[DIMENSIONS.index(task)] for y in y_test_pred]
+        rmse = mean_squared_error(targets_task, pred_task, squared=False)
+        metrics[f"rmse_{task}"] = rmse
+
+    return metrics
+
+
+def train_model(X_train, y_train, X_valid, y_valid):
+    # TODO. define and train the model
+    # should return the trained model
+    model = None
+    return model
+
+
+def predict(model, X):
+    # TODO. predict the model
+    # should return an array of predictions
+    y_pred = np.random.rand(len(X), len(DIMENSIONS))
+    return y_pred
+
+
+if __name__ == "__main__":
+    ellipse_df = pd.read_csv(
+        "train.csv",
+        header=0,
+        names=[
+            "text_id",
+            "full_text",
+            "Cohesion",
+            "Syntax",
+            "Vocabulary",
+            "Phraseology",
+            "Grammar",
+            "Conventions",
+        ],
+        index_col="text_id",
+    )
+    ellipse_df = ellipse_df.dropna(axis=0)
+
+    # Process data and store into numpy arrays.
+    data_df = ellipse_df
+    X = list(data_df.full_text.to_numpy())
+    y = np.array([data_df.drop(["full_text"], axis=1).iloc[i] for i in range(len(X))])
+
+    # Create a train-valid split of the data.
+    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.10, random_state=SEED)
+
+    # define and train the model
+    # should fill out the train_model function
+    model = train_model(X_train, y_train, X_valid, y_valid)
+
+    # evaluate the model on the valid set using compute_metrics_for_regression and print the results
+    # should fill out the predict function
+    y_valid_pred = predict(model, X_valid)
+    metrics = compute_metrics_for_regression(y_valid, y_valid_pred)
+
+    print(metrics)
+    print("final MCRMSE on validation set: ", np.mean(list(metrics.values())))
+
+    # save submission.csv file for the test set
+    submission_df = pd.read_csv(
+        "test.csv", header=0, names=["text_id", "full_text"], index_col="text_id"
+    )
+    X_submission = list(submission_df.full_text.to_numpy())
+    y_submission = predict(model, X_submission)
+    submission_df = pd.DataFrame(y_submission, columns=DIMENSIONS)
+    submission_df.index = submission_df.index.rename("text_id")
+    submission_df.to_csv("submission.csv")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/scripts/grade.py
new file mode 100644
index 0000000000..c14bbe6d28
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/scripts/grade.py
@@ -0,0 +1,77 @@
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+
+from evals.elsuite.hr_ml_agent_bench.benchmarks.feedback.env.train import (
+    DIMENSIONS,
+    compute_metrics_for_regression,
+)
+
+
+def get_score(submission_dir: Path) -> float:
+    submission_path = submission_dir / "submission.csv"
+
+    if not submission_path.exists():  # no submission
+        return get_naive_baseline_score()
+
+    answer_path = Path(__file__).parent / "answer.csv"
+
+    assert answer_path.exists(), f"Expected {answer_path} to exist, but it doesn't!"
+
+    df_y_pred = pd.read_csv(submission_path)
+    df_y_true = pd.read_csv(answer_path)
+
+    assert len(df_y_true) > 0, "Expected the answer to be non-empty."
+
+    for dim in DIMENSIONS:
+        assert (
+            dim in df_y_true.columns
+        ), f"Expected the answer to have a `{dim}` column. Got: {df_y_true.columns}."
+
+    if len(df_y_pred) != len(df_y_true):  # invalid submission
+        return get_naive_baseline_score()
+
+    for dim in DIMENSIONS:
+        if dim not in df_y_pred.columns:  # invalid submission
+            return get_naive_baseline_score()
+
+    y_pred = df_y_pred[DIMENSIONS].to_numpy()
+    y_true = df_y_true[DIMENSIONS].to_numpy()
+
+    errors = compute_metrics_for_regression(y_test=y_true, y_test_pred=y_pred)
+    avg_error = np.mean(list(errors.values()))
+    return avg_error
+
+
+def get_human_baseline_score() -> float:
+    """
+    Returns the "Multi-class logarithmic loss" of the best attempt by a human
+    on the Feedback Prize Kaggle leaderboard:
+    https://www.kaggle.com/competitions/feedback-prize-effectiveness/leaderboard.
+
+    Last updated: 2024-01-18.
+    """
+
+    return 0.55435
+
+
+def get_naive_baseline_score() -> float:
+    """
+    Returns the "Multi-class logarithmic loss" of the weakest attempt by a
+    human on the Feedback Prize Kaggle leaderboard:
+    https://www.kaggle.com/competitions/feedback-prize-effectiveness/leaderboard.
+
+    Last updated: 2024-01-24.
+    """
+
+    return 18.44720
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+    """
+    # Remap [inf, 0] to [0, 1]
+    return 1 / (1 + score)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/scripts/prepare.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/scripts/prepare.py
new file mode 100644
index 0000000000..ec2b12cd96
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/scripts/prepare.py
@@ -0,0 +1,31 @@
+from pathlib import Path
+
+import pandas as pd
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_root_dir
+
+env_dir = Path(__file__).parent / ".." / "env"
+script_dir = Path(__file__).parent
+dataset_dir = get_root_dir() / "registry" / "data" / "hr_ml_agent_bench" / "feedback" / "dataset"
+
+if not dataset_dir.is_dir():
+    dataset_dir.mkdir(parents=False, exist_ok=False)
+
+    input(
+        "Please download the data at https://www.kaggle.com/"
+        f"competitions/feedback-prize-english-language-learning/data "
+        f"into {dataset_dir}. Press any key after you've downloaded "
+        "the data to continue."
+    )
+
+# split train, val and test
+train = pd.read_csv(dataset_dir / "train.csv")
+train = train.sample(frac=1, random_state=42)
+train = train.reset_index(drop=True)
+train.iloc[: int(len(train) * 0.98)].to_csv(env_dir / "train.csv", index=False)
+test = train.iloc[int(len(train) * 0.98) :]
+test.drop(["full_text"], axis=1).to_csv(script_dir / "answer.csv", index=False)
+test = test.drop(
+    ["cohesion", "vocabulary", "syntax", "phraseology", "grammar", "conventions"],
+    axis=1,
+).to_csv(env_dir / "test.csv", index=False)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/scripts/read_only_files.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/scripts/read_only_files.txt
new file mode 100644
index 0000000000..b52a2f8494
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/scripts/read_only_files.txt
@@ -0,0 +1,2 @@
+./train.csv
+./test.csv
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/scripts/source_code.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/scripts/source_code.txt
new file mode 100644
index 0000000000..ad5bd865d3
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/feedback/scripts/source_code.txt
@@ -0,0 +1 @@
+https://www.kaggle.com/code/gabriellegaudeau/ellipse-single-encoder-multiple-heads
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/env/data_description.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/env/data_description.txt
new file mode 100644
index 0000000000..cba0710286
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/env/data_description.txt
@@ -0,0 +1,523 @@
+MSSubClass: Identifies the type of dwelling involved in the sale.	
+
+        20	1-STORY 1946 & NEWER ALL STYLES
+        30	1-STORY 1945 & OLDER
+        40	1-STORY W/FINISHED ATTIC ALL AGES
+        45	1-1/2 STORY - UNFINISHED ALL AGES
+        50	1-1/2 STORY FINISHED ALL AGES
+        60	2-STORY 1946 & NEWER
+        70	2-STORY 1945 & OLDER
+        75	2-1/2 STORY ALL AGES
+        80	SPLIT OR MULTI-LEVEL
+        85	SPLIT FOYER
+        90	DUPLEX - ALL STYLES AND AGES
+       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
+       150	1-1/2 STORY PUD - ALL AGES
+       160	2-STORY PUD - 1946 & NEWER
+       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
+       190	2 FAMILY CONVERSION - ALL STYLES AND AGES
+
+MSZoning: Identifies the general zoning classification of the sale.
+		
+       A	Agriculture
+       C	Commercial
+       FV	Floating Village Residential
+       I	Industrial
+       RH	Residential High Density
+       RL	Residential Low Density
+       RP	Residential Low Density Park 
+       RM	Residential Medium Density
+	
+LotFrontage: Linear feet of street connected to property
+
+LotArea: Lot size in square feet
+
+Street: Type of road access to property
+
+       Grvl	Gravel	
+       Pave	Paved
+       	
+Alley: Type of alley access to property
+
+       Grvl	Gravel
+       Pave	Paved
+       NA 	No alley access
+		
+LotShape: General shape of property
+
+       Reg	Regular	
+       IR1	Slightly irregular
+       IR2	Moderately Irregular
+       IR3	Irregular
+       
+LandContour: Flatness of the property
+
+       Lvl	Near Flat/Level	
+       Bnk	Banked - Quick and significant rise from street grade to building
+       HLS	Hillside - Significant slope from side to side
+       Low	Depression
+		
+Utilities: Type of utilities available
+		
+       AllPub	All public Utilities (E,G,W,& S)	
+       NoSewr	Electricity, Gas, and Water (Septic Tank)
+       NoSeWa	Electricity and Gas Only
+       ELO	Electricity only	
+	
+LotConfig: Lot configuration
+
+       Inside	Inside lot
+       Corner	Corner lot
+       CulDSac	Cul-de-sac
+       FR2	Frontage on 2 sides of property
+       FR3	Frontage on 3 sides of property
+	
+LandSlope: Slope of property
+		
+       Gtl	Gentle slope
+       Mod	Moderate Slope	
+       Sev	Severe Slope
+	
+Neighborhood: Physical locations within Ames city limits
+
+       Blmngtn	Bloomington Heights
+       Blueste	Bluestem
+       BrDale	Briardale
+       BrkSide	Brookside
+       ClearCr	Clear Creek
+       CollgCr	College Creek
+       Crawfor	Crawford
+       Edwards	Edwards
+       Gilbert	Gilbert
+       IDOTRR	Iowa DOT and Rail Road
+       MeadowV	Meadow Village
+       Mitchel	Mitchell
+       Names	North Ames
+       NoRidge	Northridge
+       NPkVill	Northpark Villa
+       NridgHt	Northridge Heights
+       NWAmes	Northwest Ames
+       OldTown	Old Town
+       SWISU	South & West of Iowa State University
+       Sawyer	Sawyer
+       SawyerW	Sawyer West
+       Somerst	Somerset
+       StoneBr	Stone Brook
+       Timber	Timberland
+       Veenker	Veenker
+			
+Condition1: Proximity to various conditions
+	
+       Artery	Adjacent to arterial street
+       Feedr	Adjacent to feeder street	
+       Norm	Normal	
+       RRNn	Within 200' of North-South Railroad
+       RRAn	Adjacent to North-South Railroad
+       PosN	Near positive off-site feature--park, greenbelt, etc.
+       PosA	Adjacent to postive off-site feature
+       RRNe	Within 200' of East-West Railroad
+       RRAe	Adjacent to East-West Railroad
+	
+Condition2: Proximity to various conditions (if more than one is present)
+		
+       Artery	Adjacent to arterial street
+       Feedr	Adjacent to feeder street	
+       Norm	Normal	
+       RRNn	Within 200' of North-South Railroad
+       RRAn	Adjacent to North-South Railroad
+       PosN	Near positive off-site feature--park, greenbelt, etc.
+       PosA	Adjacent to postive off-site feature
+       RRNe	Within 200' of East-West Railroad
+       RRAe	Adjacent to East-West Railroad
+	
+BldgType: Type of dwelling
+		
+       1Fam	Single-family Detached	
+       2FmCon	Two-family Conversion; originally built as one-family dwelling
+       Duplx	Duplex
+       TwnhsE	Townhouse End Unit
+       TwnhsI	Townhouse Inside Unit
+	
+HouseStyle: Style of dwelling
+	
+       1Story	One story
+       1.5Fin	One and one-half story: 2nd level finished
+       1.5Unf	One and one-half story: 2nd level unfinished
+       2Story	Two story
+       2.5Fin	Two and one-half story: 2nd level finished
+       2.5Unf	Two and one-half story: 2nd level unfinished
+       SFoyer	Split Foyer
+       SLvl	Split Level
+	
+OverallQual: Rates the overall material and finish of the house
+
+       10	Very Excellent
+       9	Excellent
+       8	Very Good
+       7	Good
+       6	Above Average
+       5	Average
+       4	Below Average
+       3	Fair
+       2	Poor
+       1	Very Poor
+	
+OverallCond: Rates the overall condition of the house
+
+       10	Very Excellent
+       9	Excellent
+       8	Very Good
+       7	Good
+       6	Above Average	
+       5	Average
+       4	Below Average	
+       3	Fair
+       2	Poor
+       1	Very Poor
+		
+YearBuilt: Original construction date
+
+YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
+
+RoofStyle: Type of roof
+
+       Flat	Flat
+       Gable	Gable
+       Gambrel	Gabrel (Barn)
+       Hip	Hip
+       Mansard	Mansard
+       Shed	Shed
+		
+RoofMatl: Roof material
+
+       ClyTile	Clay or Tile
+       CompShg	Standard (Composite) Shingle
+       Membran	Membrane
+       Metal	Metal
+       Roll	Roll
+       Tar&Grv	Gravel & Tar
+       WdShake	Wood Shakes
+       WdShngl	Wood Shingles
+		
+Exterior1st: Exterior covering on house
+
+       AsbShng	Asbestos Shingles
+       AsphShn	Asphalt Shingles
+       BrkComm	Brick Common
+       BrkFace	Brick Face
+       CBlock	Cinder Block
+       CemntBd	Cement Board
+       HdBoard	Hard Board
+       ImStucc	Imitation Stucco
+       MetalSd	Metal Siding
+       Other	Other
+       Plywood	Plywood
+       PreCast	PreCast	
+       Stone	Stone
+       Stucco	Stucco
+       VinylSd	Vinyl Siding
+       Wd Sdng	Wood Siding
+       WdShing	Wood Shingles
+	
+Exterior2nd: Exterior covering on house (if more than one material)
+
+       AsbShng	Asbestos Shingles
+       AsphShn	Asphalt Shingles
+       BrkComm	Brick Common
+       BrkFace	Brick Face
+       CBlock	Cinder Block
+       CemntBd	Cement Board
+       HdBoard	Hard Board
+       ImStucc	Imitation Stucco
+       MetalSd	Metal Siding
+       Other	Other
+       Plywood	Plywood
+       PreCast	PreCast
+       Stone	Stone
+       Stucco	Stucco
+       VinylSd	Vinyl Siding
+       Wd Sdng	Wood Siding
+       WdShing	Wood Shingles
+	
+MasVnrType: Masonry veneer type
+
+       BrkCmn	Brick Common
+       BrkFace	Brick Face
+       CBlock	Cinder Block
+       None	None
+       Stone	Stone
+	
+MasVnrArea: Masonry veneer area in square feet
+
+ExterQual: Evaluates the quality of the material on the exterior 
+		
+       Ex	Excellent
+       Gd	Good
+       TA	Average/Typical
+       Fa	Fair
+       Po	Poor
+		
+ExterCond: Evaluates the present condition of the material on the exterior
+		
+       Ex	Excellent
+       Gd	Good
+       TA	Average/Typical
+       Fa	Fair
+       Po	Poor
+		
+Foundation: Type of foundation
+		
+       BrkTil	Brick & Tile
+       CBlock	Cinder Block
+       PConc	Poured Contrete	
+       Slab	Slab
+       Stone	Stone
+       Wood	Wood
+		
+BsmtQual: Evaluates the height of the basement
+
+       Ex	Excellent (100+ inches)	
+       Gd	Good (90-99 inches)
+       TA	Typical (80-89 inches)
+       Fa	Fair (70-79 inches)
+       Po	Poor (<70 inches
+       NA	No Basement
+		
+BsmtCond: Evaluates the general condition of the basement
+
+       Ex	Excellent
+       Gd	Good
+       TA	Typical - slight dampness allowed
+       Fa	Fair - dampness or some cracking or settling
+       Po	Poor - Severe cracking, settling, or wetness
+       NA	No Basement
+	
+BsmtExposure: Refers to walkout or garden level walls
+
+       Gd	Good Exposure
+       Av	Average Exposure (split levels or foyers typically score average or above)	
+       Mn	Mimimum Exposure
+       No	No Exposure
+       NA	No Basement
+	
+BsmtFinType1: Rating of basement finished area
+
+       GLQ	Good Living Quarters
+       ALQ	Average Living Quarters
+       BLQ	Below Average Living Quarters	
+       Rec	Average Rec Room
+       LwQ	Low Quality
+       Unf	Unfinshed
+       NA	No Basement
+		
+BsmtFinSF1: Type 1 finished square feet
+
+BsmtFinType2: Rating of basement finished area (if multiple types)
+
+       GLQ	Good Living Quarters
+       ALQ	Average Living Quarters
+       BLQ	Below Average Living Quarters	
+       Rec	Average Rec Room
+       LwQ	Low Quality
+       Unf	Unfinshed
+       NA	No Basement
+
+BsmtFinSF2: Type 2 finished square feet
+
+BsmtUnfSF: Unfinished square feet of basement area
+
+TotalBsmtSF: Total square feet of basement area
+
+Heating: Type of heating
+		
+       Floor	Floor Furnace
+       GasA	Gas forced warm air furnace
+       GasW	Gas hot water or steam heat
+       Grav	Gravity furnace	
+       OthW	Hot water or steam heat other than gas
+       Wall	Wall furnace
+		
+HeatingQC: Heating quality and condition
+
+       Ex	Excellent
+       Gd	Good
+       TA	Average/Typical
+       Fa	Fair
+       Po	Poor
+		
+CentralAir: Central air conditioning
+
+       N	No
+       Y	Yes
+		
+Electrical: Electrical system
+
+       SBrkr	Standard Circuit Breakers & Romex
+       FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)	
+       FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)
+       FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)
+       Mix	Mixed
+		
+1stFlrSF: First Floor square feet
+ 
+2ndFlrSF: Second floor square feet
+
+LowQualFinSF: Low quality finished square feet (all floors)
+
+GrLivArea: Above grade (ground) living area square feet
+
+BsmtFullBath: Basement full bathrooms
+
+BsmtHalfBath: Basement half bathrooms
+
+FullBath: Full bathrooms above grade
+
+HalfBath: Half baths above grade
+
+Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
+
+Kitchen: Kitchens above grade
+
+KitchenQual: Kitchen quality
+
+       Ex	Excellent
+       Gd	Good
+       TA	Typical/Average
+       Fa	Fair
+       Po	Poor
+       	
+TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
+
+Functional: Home functionality (Assume typical unless deductions are warranted)
+
+       Typ	Typical Functionality
+       Min1	Minor Deductions 1
+       Min2	Minor Deductions 2
+       Mod	Moderate Deductions
+       Maj1	Major Deductions 1
+       Maj2	Major Deductions 2
+       Sev	Severely Damaged
+       Sal	Salvage only
+		
+Fireplaces: Number of fireplaces
+
+FireplaceQu: Fireplace quality
+
+       Ex	Excellent - Exceptional Masonry Fireplace
+       Gd	Good - Masonry Fireplace in main level
+       TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
+       Fa	Fair - Prefabricated Fireplace in basement
+       Po	Poor - Ben Franklin Stove
+       NA	No Fireplace
+		
+GarageType: Garage location
+		
+       2Types	More than one type of garage
+       Attchd	Attached to home
+       Basment	Basement Garage
+       BuiltIn	Built-In (Garage part of house - typically has room above garage)
+       CarPort	Car Port
+       Detchd	Detached from home
+       NA	No Garage
+		
+GarageYrBlt: Year garage was built
+		
+GarageFinish: Interior finish of the garage
+
+       Fin	Finished
+       RFn	Rough Finished	
+       Unf	Unfinished
+       NA	No Garage
+		
+GarageCars: Size of garage in car capacity
+
+GarageArea: Size of garage in square feet
+
+GarageQual: Garage quality
+
+       Ex	Excellent
+       Gd	Good
+       TA	Typical/Average
+       Fa	Fair
+       Po	Poor
+       NA	No Garage
+		
+GarageCond: Garage condition
+
+       Ex	Excellent
+       Gd	Good
+       TA	Typical/Average
+       Fa	Fair
+       Po	Poor
+       NA	No Garage
+		
+PavedDrive: Paved driveway
+
+       Y	Paved 
+       P	Partial Pavement
+       N	Dirt/Gravel
+		
+WoodDeckSF: Wood deck area in square feet
+
+OpenPorchSF: Open porch area in square feet
+
+EnclosedPorch: Enclosed porch area in square feet
+
+3SsnPorch: Three season porch area in square feet
+
+ScreenPorch: Screen porch area in square feet
+
+PoolArea: Pool area in square feet
+
+PoolQC: Pool quality
+		
+       Ex	Excellent
+       Gd	Good
+       TA	Average/Typical
+       Fa	Fair
+       NA	No Pool
+		
+Fence: Fence quality
+		
+       GdPrv	Good Privacy
+       MnPrv	Minimum Privacy
+       GdWo	Good Wood
+       MnWw	Minimum Wood/Wire
+       NA	No Fence
+	
+MiscFeature: Miscellaneous feature not covered in other categories
+		
+       Elev	Elevator
+       Gar2	2nd Garage (if not described in garage section)
+       Othr	Other
+       Shed	Shed (over 100 SF)
+       TenC	Tennis Court
+       NA	None
+		
+MiscVal: $Value of miscellaneous feature
+
+MoSold: Month Sold (MM)
+
+YrSold: Year Sold (YYYY)
+
+SaleType: Type of sale
+		
+       WD 	Warranty Deed - Conventional
+       CWD	Warranty Deed - Cash
+       VWD	Warranty Deed - VA Loan
+       New	Home just constructed and sold
+       COD	Court Officer Deed/Estate
+       Con	Contract 15% Down payment regular terms
+       ConLw	Contract Low Down payment and low interest
+       ConLI	Contract Low Interest
+       ConLD	Contract Low Down
+       Oth	Other
+		
+SaleCondition: Condition of sale
+
+       Normal	Normal Sale
+       Abnorml	Abnormal Sale -  trade, foreclosure, short sale
+       AdjLand	Adjoining Land Purchase
+       Alloca	Allocation - two linked properties with separate deeds, typically condo with a garage unit	
+       Family	Sale between family members
+       Partial	Home was not completed when last assessed (associated with New Homes)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/env/train.py
new file mode 100644
index 0000000000..eae33a1a17
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/env/train.py
@@ -0,0 +1,64 @@
+# Import helpful libraries
+import pandas as pd
+from sklearn.model_selection import train_test_split
+
+# Load the data, and separate the target
+iowa_file_path = "train.csv"
+home_data = pd.read_csv(iowa_file_path)
+
+y = home_data.SalePrice
+
+# You can change the features needed for this task depending on your understanding of the features and the final task
+features = [
+    "MSSubClass",
+    "LotArea",
+    "OverallQual",
+    "OverallCond",
+    "YearBuilt",
+    "YearRemodAdd",
+    "1stFlrSF",
+    "2ndFlrSF",
+    "LowQualFinSF",
+    "GrLivArea",
+    "FullBath",
+    "HalfBath",
+    "BedroomAbvGr",
+    "KitchenAbvGr",
+    "TotRmsAbvGrd",
+    "Fireplaces",
+    "WoodDeckSF",
+    "OpenPorchSF",
+    "EnclosedPorch",
+    "3SsnPorch",
+    "ScreenPorch",
+    "PoolArea",
+    "MiscVal",
+    "MoSold",
+    "YrSold",
+]
+
+# Select columns corresponding to features, and preview the data
+X = home_data[features]
+
+# Split into testing and training data
+train_X, valid_X, train_y, valid_y = train_test_split(X, y, random_state=1)
+
+# ***********************************************
+# In this part of the code, write and train the model on the above dataset to perform the task.
+# This part should populate the variable train_mae and valid_mae on the model selected
+# ***********************************************
+
+
+# ***********************************************
+# End of the main training module
+# ***********************************************
+
+print("Train MAE: {:,.0f}".format(train_mae))  # noqa: F821
+print("Validation MAE: {:,.0f}".format(valid_mae))  # noqa: F821
+
+test_data = pd.read_csv("test.csv")
+test_X = test_data[features]
+test_preds = model.predict(test_X)  # noqa: F821
+
+output = pd.DataFrame({"Id": test_data.Id, "SalePrice": test_preds})
+output.to_csv("submission.csv", index=False)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/scripts/grade.py
new file mode 100644
index 0000000000..771bb34be6
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/scripts/grade.py
@@ -0,0 +1,78 @@
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+from sklearn.metrics import root_mean_squared_error
+
+
+def get_score(submission_dir: Path) -> float:
+    """Returns the score of the submission in the given directory."""
+
+    submission_path = submission_dir / "submission.csv"
+
+    if not submission_path.exists():  # no submission
+        return get_naive_baseline_score()
+
+    answer_path = Path(__file__).parent / "answer.csv"
+
+    assert answer_path.exists(), f"Expected {answer_path} to exist, but it doesn't!"
+
+    df_y_pred = pd.read_csv(submission_path)
+    df_y_true = pd.read_csv(answer_path)
+
+    assert len(df_y_true) > 0, "Expected the answer to be non-empty."
+    assert "SalePrice" in df_y_true.columns, "Expected the answer to have a `SalePrice` column."
+
+    if len(df_y_pred) != len(df_y_true):  # invalid submission
+        return get_naive_baseline_score()
+
+    if "SalePrice" not in df_y_pred.columns:  # invalid submission
+        return get_naive_baseline_score()
+
+    y_pred = df_y_pred["SalePrice"]
+    y_true = df_y_true["SalePrice"]
+
+    assert (y_true > 0).all(), f"Expected `y_true` to contain only positive values. Got: {y_true}."
+
+    if (y_pred <= 0).any():  # invalid submission
+        return get_naive_baseline_score()
+
+    log_y_pred = np.log(y_pred)
+    log_y_true = np.log(y_true)
+
+    rmse = root_mean_squared_error(y_true=log_y_true, y_pred=log_y_pred)
+
+    return rmse
+
+
+def get_human_baseline_score() -> float:
+    """
+    Returns the RMSE of the best attempt by a human on the House Prices
+    Kaggle leaderboard:
+    https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/leaderboard.
+
+    Last updated: 2024-01-18.
+    """
+
+    return 0.0
+
+
+def get_naive_baseline_score() -> float:
+    """
+    Returns the RMSE of the weakest attempt by a human on the House Prices
+    Kaggle leaderboard:
+    https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/leaderboard.
+
+    Last updated: 2024-01-24.
+    """
+
+    return 31.42506
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+    """
+    # Remap [inf, 0] to [0, 1]
+    return 1 / (1 + score)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/scripts/prepare.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/scripts/prepare.py
new file mode 100644
index 0000000000..e59a0264a7
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/scripts/prepare.py
@@ -0,0 +1,28 @@
+from pathlib import Path
+
+import pandas as pd
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_root_dir
+
+env_dir = Path(__file__).parent / ".." / "env"
+script_dir = Path(__file__).parent
+dataset_dir = get_root_dir() / "registry" / "data" / "hr_ml_agent_bench" / "house_price" / "dataset"
+
+if not dataset_dir.is_dir():
+    dataset_dir.mkdir(parents=False, exist_ok=False)
+
+    input(
+        "Please download the data at https://www.kaggle.com/"
+        f"competitions/home-data-for-ml-course/data "
+        f"into {dataset_dir}. Press any key after you've downloaded "
+        "the data to continue."
+    )
+
+
+train = pd.read_csv(dataset_dir / "train.csv")
+train = train.reset_index(drop=True)
+train.iloc[: int(len(train) * 0.8)].to_csv(env_dir / "train.csv", index=False)
+test = train.iloc[int(len(train) * 0.8) :]
+
+test.drop(list(train.keys())[1:-1], axis=1).to_csv(script_dir / "answer.csv", index=False)
+test = test.drop(["SalePrice"], axis=1).to_csv(env_dir / "test.csv", index=False)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/scripts/read_only_files.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/scripts/read_only_files.txt
new file mode 100644
index 0000000000..b52a2f8494
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/house_price/scripts/read_only_files.txt
@@ -0,0 +1,2 @@
+./train.csv
+./test.csv
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/baselines/human.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/baselines/human.py
new file mode 100644
index 0000000000..bec29d3eba
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/baselines/human.py
@@ -0,0 +1,49 @@
+from pathlib import Path
+
+import gymnasium as gym
+from stable_baselines3 import SAC
+from stable_baselines3.common.evaluation import evaluate_policy
+from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
+
+checkpoint = Path("human.checkpoint")
+
+vec_env = DummyVecEnv([lambda: gym.make("Humanoid-v4")])
+vec_env = VecNormalize(vec_env, norm_obs=False, norm_reward=False, clip_obs=10.0)
+
+if not checkpoint.exists():
+    # Source: https://huggingface.co/sb3/sac-Humanoid-v3
+    hyperparameters = {}  # defaults are fine
+
+    model = SAC(
+        policy="MlpPolicy",
+        env=vec_env,
+        verbose=1,
+        seed=0,
+        device="auto",
+        **hyperparameters,
+    )
+
+    # For reference, using SAC with the hyperparameters used in
+    # https://huggingface.co/sb3/sac-Humanoid-v3 achieves an
+    # average total reward of 21.0 +/- 0.0 over 100 episodes.
+    model = model.learn(
+        total_timesteps=2_000_000,
+        progress_bar=True,
+        log_interval=100,
+    )
+
+    model.save(checkpoint)
+
+
+model = SAC.load(checkpoint)
+
+mean_return, std_return = evaluate_policy(
+    model=model,
+    env=vec_env,
+    n_eval_episodes=100,
+)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return} +/- {std_return}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/baselines/naive.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/baselines/naive.py
new file mode 100644
index 0000000000..b7ca4c9de0
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/baselines/naive.py
@@ -0,0 +1,48 @@
+"""A random agent for the Humanoid-v4 environment."""
+
+import gymnasium as gym
+import numpy as np
+from tqdm import tqdm
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+        
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("Humanoid-v4")
+returns = []
+
+for episode in tqdm(range(100)):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # The maximum number of steps in an episode is 1,000. See
+    # https://gymnasium.farama.org/environments/mujoco/humanoid/#episode-end
+    # for more details about the environment.
+    for step in range(1_000):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/env/environment.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/env/environment.txt
new file mode 100644
index 0000000000..800da9bb66
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/env/environment.txt
@@ -0,0 +1 @@
+This environment contains a 3D bipedal humanoid robot. It has a torso (abdomen) with a pair of legs and arms. The legs each consist of three body parts, and the arms 2 body parts (representing the knees and elbows respectively). The goal of the environment is to walk forward as fast as possible without falling over.
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/env/train.py
new file mode 100644
index 0000000000..e84e1f5314
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/env/train.py
@@ -0,0 +1,46 @@
+import gymnasium as gym
+import numpy as np
+from tqdm import tqdm
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("Humanoid-v4")
+returns = []
+
+for episode in tqdm(range(100)):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # The maximum number of steps in an episode is 1,000. See
+    # https://gymnasium.farama.org/environments/mujoco/humanoid/#episode-end
+    # for more details about the environment.
+    for step in range(1_000):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/scripts/grade.py
new file mode 100644
index 0000000000..2e4adab2f9
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/scripts/grade.py
@@ -0,0 +1,94 @@
+from functools import cache
+from pathlib import Path
+
+import numpy as np
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_baseline_score
+
+
+def get_score(submission_dir: Path) -> float:
+    """
+    Returns the score (the average reward over 100 consecutive
+    episodes) of the submission in the given directory.
+    """
+
+    if not submission_dir.exists():
+        raise ValueError(f"Expected to be given a directory. Got: {submission_dir}")
+
+    submission_file = submission_dir / "submission.txt"
+
+    if not submission_file.exists():
+        return get_naive_baseline_score()  # no submission
+
+    with open(submission_file, "r") as f:
+        try:
+            avg_reward = float(f.read())
+        except ValueError:
+            return get_naive_baseline_score()  # invalid submission
+
+    if avg_reward < 0:
+        return get_naive_baseline_score()  # invalid submission
+
+    assert avg_reward >= 0, f"Expected execution time to be non-negative. Got: {avg_reward}"
+
+    return avg_reward
+
+
+@cache
+def get_human_baseline_score() -> float:
+    """
+    Executes the human baseline and returns its score.
+    """
+
+    scripts_dir = Path(__file__).parent
+    baselines_dir = scripts_dir.parent / "baselines"
+    human_baseline = baselines_dir / "human.py"
+    checkpoint = baselines_dir / "human.checkpoint"
+
+    other_files = []
+
+    if checkpoint.exists():
+        other_files = [checkpoint]
+
+    return get_baseline_score(
+        baseline_script=human_baseline,
+        score_fn=get_score,
+        other_files=other_files,
+    )
+
+
+@cache
+def get_naive_baseline_score() -> float:
+    """
+    Executes the naive baseline and returns its score.
+    """
+
+    scripts_dir = Path(__file__).parent
+    baselines_dir = scripts_dir.parent / "baselines"
+    naive_baseline = baselines_dir / "naive.py"
+
+    return get_baseline_score(naive_baseline, get_score)
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+    """
+
+    # The minimum is achieved by having an episode length of zero.
+    min_score = 0.0
+
+    # The maximum is achieved by moving the center of mass an infinite
+    # amount. There's no specified reward threshold at which the
+    # environment is considered solved. However, the best agents achieve
+    # roughly +12,600 (https://github.com/openai/gym/wiki/Leaderboard#humanoid-v4).
+    # We'll take 120% of the best-known score as the threshold for which
+    # an agent is considered to have solved the environment.
+    max_score = 12_600.0 * 1.20
+
+    # Use a linear transformation to avoid an unintuitive warping of the score.
+    clipped = np.clip(score, min_score, max_score)
+    score = (clipped - min_score) / (max_score - min_score)
+
+    return score
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/scripts/requirements.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/scripts/requirements.txt
new file mode 100644
index 0000000000..46ce126c9e
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/humanoid/scripts/requirements.txt
@@ -0,0 +1,3 @@
+gymnasium[atari]
+gymnasium[accept-rom-license]
+stable-baselines3[extra]
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/imdb/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/imdb/env/train.py
new file mode 100644
index 0000000000..6d4db12f37
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/imdb/env/train.py
@@ -0,0 +1,43 @@
+import pandas as pd
+import torch
+from datasets import load_dataset
+
+if __name__ == "__main__":
+    imdb = load_dataset("imdb")
+
+    # TODO: pre-process data
+
+    model: torch.nn.Module = None  # TODO: define model here
+
+    # TODO: train model
+
+    ############################################
+    #                                          #
+    #  Do not modify anything below this line! #
+    #                                          #
+    ############################################
+
+    # Set model to evaluation mode
+    model.eval()
+
+    # Evaluate the model on the test set and save the predictions to submission.csv.
+    submission = pd.DataFrame(columns=list(range(2)), index=range(len(imdb["test"])))
+    n_correct = 0
+
+    for idx, data in enumerate(imdb["test"]):
+        text = data["text"]
+        y_true = data["label"]
+
+        with torch.no_grad():
+            logits = model(text)
+
+        logits = torch.softmax(logits, dim=0)
+        y_pred = torch.argmax(logits).item()
+        n_correct += int(y_pred == y_true)
+
+        submission.loc[idx] = logits.tolist()
+
+    accuracy = 100.0 * n_correct / len(imdb["test"])
+    submission.to_csv("submission.csv", index_label="idx")
+
+    print("Accuracy: ", accuracy)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/imdb/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/imdb/scripts/grade.py
new file mode 100644
index 0000000000..42c7fd3dbe
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/imdb/scripts/grade.py
@@ -0,0 +1,62 @@
+from pathlib import Path
+
+import pandas as pd
+from datasets import load_dataset
+
+
+def get_score(submission_dir: Path) -> float:
+    if not submission_dir.is_dir():
+        raise ValueError(f"Expected to be given a directory. Got: {submission_dir}")
+
+    submission_file = submission_dir / "submission.csv"
+
+    if not submission_file.exists():
+        return get_naive_baseline_score()  # default to baseline score
+
+    submission = pd.read_csv(submission_file, index_col="idx")
+    imdb = load_dataset("imdb")
+    n_correct = 0
+
+    for idx, data in enumerate(imdb["test"]):
+        y_true = data["label"]
+
+        try:
+            y_pred = submission.loc[idx].argmax()
+        except KeyError:  # invalid submission
+            return get_naive_baseline_score()
+
+        n_correct += int(y_pred == y_true)
+
+    accuracy = 100 * n_correct / len(imdb["test"])
+
+    return accuracy
+
+
+def get_human_baseline_score() -> float:
+    """
+    Returns the accuracy of the best attempt by a human
+    on the IMDb Papers with Code leaderboard:
+    https://paperswithcode.com/sota/sentiment-analysis-on-imdb.
+
+    Last updated: 2024-01-18.
+    """
+
+    return 96.21
+
+
+def get_naive_baseline_score() -> float:
+    """
+    The baseline script is only partially implemented and therefore
+    doesn't produce a submission file. The baseline score
+    defaults to 0% accuracy.
+    """
+
+    return 0.0
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+    """
+    return score / 100
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/imdb/scripts/requirements.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/imdb/scripts/requirements.txt
new file mode 100644
index 0000000000..3db5954c2b
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/imdb/scripts/requirements.txt
@@ -0,0 +1 @@
+accelerate  # not striclty necessary but often helpful
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/baselines/human.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/baselines/human.py
new file mode 100644
index 0000000000..6540875334
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/baselines/human.py
@@ -0,0 +1,45 @@
+from pathlib import Path
+
+import gymnasium as gym
+from stable_baselines3 import PPO
+from stable_baselines3.common.evaluation import evaluate_policy
+from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
+
+checkpoint = Path("human.checkpoint")
+
+vec_env = DummyVecEnv([lambda: gym.make("InvertedPendulum-v4")])
+vec_env = VecNormalize(vec_env, norm_obs=True, norm_reward=False, clip_obs=10.0)
+
+if not checkpoint.exists():
+    model = PPO(
+        policy="MlpPolicy",
+        env=vec_env,
+        verbose=1,
+        seed=0,
+        device="auto",
+    )
+
+    # For reference, using PPO with the 'MlpPolicy' achieves
+    # a perfect average reward of 1000.0 +/- 0.0 over 100
+    # episodes after training with `total_timesteps=20_000`.
+    model = model.learn(
+        total_timesteps=20_000,
+        progress_bar=True,
+        log_interval=1_000,
+    )
+
+    model.save(checkpoint)
+
+
+model = PPO.load(checkpoint)
+
+mean_return, std_return = evaluate_policy(
+    model=model,
+    env=vec_env,
+    n_eval_episodes=100,
+)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return} +/- {std_return}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/baselines/naive.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/baselines/naive.py
new file mode 100644
index 0000000000..4f9f158d9c
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/baselines/naive.py
@@ -0,0 +1,48 @@
+"""A random agent for the InvertedPendulum-v4 environment."""
+
+import gymnasium as gym
+import numpy as np
+from tqdm import tqdm
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+        
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("InvertedPendulum-v4")
+returns = []
+
+for episode in tqdm(range(100)):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # The maximum number of steps in an episode is 1,000. See
+    # https://gymnasium.farama.org/environments/mujoco/inverted_pendulum/#episode-end
+    # for more details about the environment.
+    for step in range(1_000):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/env/environment.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/env/environment.txt
new file mode 100644
index 0000000000..a04f440a69
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/env/environment.txt
@@ -0,0 +1 @@
+This environment involves a cart that can moved linearly, with a pole fixed on it at one end and having another end free. The cart can be pushed left or right, and the goal is to balance the pole on the top of the cart by applying forces on the cart.
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/env/train.py
new file mode 100644
index 0000000000..d59bad07dc
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/env/train.py
@@ -0,0 +1,46 @@
+import gymnasium as gym
+import numpy as np
+from tqdm import tqdm
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("InvertedPendulum-v4")
+returns = []
+
+for episode in tqdm(range(100)):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # The maximum number of steps in an episode is 1,000. See
+    # https://gymnasium.farama.org/environments/mujoco/inverted_pendulum/#episode-end
+    # for more details about the environment.
+    for step in range(1_000):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/scripts/grade.py
new file mode 100644
index 0000000000..27aefb7ac9
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/inverted_pendulum/scripts/grade.py
@@ -0,0 +1,66 @@
+from functools import cache
+from pathlib import Path
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_baseline_score
+
+
+def get_score(submission_dir: Path) -> float:
+    """
+    Returns the score (the average reward over 100 consecutive
+    episodes) of the submission in the given directory.
+    """
+
+    if not submission_dir.exists():
+        raise ValueError(f"Expected to be given a directory. Got: {submission_dir}")
+
+    submission_file = submission_dir / "submission.txt"
+
+    if not submission_file.exists():
+        return get_naive_baseline_score()  # no submission
+
+    with open(submission_file, "r") as f:
+        try:
+            avg_reward = float(f.read())
+        except ValueError:
+            return get_naive_baseline_score()  # invalid submission
+
+    if avg_reward < 0:
+        return get_naive_baseline_score()  # invalid submission
+
+    assert avg_reward >= 0, f"Expected execution time to be non-negative. Got: {avg_reward}"
+
+    return avg_reward
+
+
+def get_human_baseline_score() -> float:
+    """
+    Executes the human baseline and returns its score.
+    """
+
+    return 1_000.0
+
+
+@cache
+def get_naive_baseline_score() -> float:
+    """
+    Executes the naive baseline and returns its score.
+    """
+
+    scripts_dir = Path(__file__).parent
+    baselines_dir = scripts_dir.parent / "baselines"
+    naive_baseline = baselines_dir / "naive.py"
+
+    return get_baseline_score(naive_baseline, get_score)
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+
+    The maximum score for the InvertedPendulum-v4 environment is 1,000 (
+    https://gymnasium.farama.org/environments/mujoco/inverted_pendulum/#rewards),
+    so we divide the score by 1,000 to normalize to the range [0, 1].
+    """
+
+    return score / 1_000.0
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/env/train.py
new file mode 100644
index 0000000000..662909eb0c
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/env/train.py
@@ -0,0 +1,165 @@
+import pandas as pd
+import torch
+import torch.nn.functional as F
+from ogb.nodeproppred import Evaluator, PygNodePropPredDataset
+from torch.optim.lr_scheduler import ReduceLROnPlateau
+from torch_geometric.loader import NeighborLoader
+from tqdm import tqdm
+
+target_dataset = "ogbn-arxiv"
+
+dataset = PygNodePropPredDataset(name=target_dataset, root="networks")
+data = dataset[0]
+split_idx = dataset.get_idx_split()
+
+train_idx = split_idx["train"]
+valid_idx = split_idx["valid"]
+test_idx = split_idx["test"]
+
+train_loader = NeighborLoader(
+    data,
+    input_nodes=train_idx,
+    shuffle=True,
+    num_workers=1,
+    batch_size=32,
+    num_neighbors=[30] * 2,
+)
+
+total_loader = NeighborLoader(
+    data,
+    input_nodes=None,
+    num_neighbors=[-1],
+    batch_size=32,
+    shuffle=False,
+    num_workers=1,
+)
+
+
+class MLP(torch.nn.Module):
+    def __init__(self, in_channels, hidden_channels, out_channels, num_layers, dropout):
+        super(MLP, self).__init__()
+
+        self.lins = torch.nn.ModuleList()
+        self.lins.append(torch.nn.Linear(in_channels, hidden_channels))
+        self.bns = torch.nn.ModuleList()
+        self.bns.append(torch.nn.BatchNorm1d(hidden_channels))
+        for _ in range(num_layers - 2):
+            self.lins.append(torch.nn.Linear(hidden_channels, hidden_channels))
+            self.bns.append(torch.nn.BatchNorm1d(hidden_channels))
+        self.lins.append(torch.nn.Linear(hidden_channels, out_channels))
+
+        self.dropout = dropout
+
+    def reset_parameters(self):
+        for lin in self.lins:
+            lin.reset_parameters()
+        for bn in self.bns:
+            bn.reset_parameters()
+
+    def forward(self, x):
+        for i, lin in enumerate(self.lins[:-1]):
+            x = lin(x)
+            x = self.bns[i](x)
+            x = F.relu(x)
+            x = F.dropout(x, p=self.dropout, training=self.training)
+        x = self.lins[-1](x)
+        return torch.log_softmax(x, dim=-1)
+
+    def inference(self, total_loader, device):
+        xs = []
+        for batch in total_loader:
+            out = self.forward(batch.x.to(device))
+            out = out[: batch.batch_size]
+            xs.append(out.cpu())
+
+        out_all = torch.cat(xs, dim=0)
+
+        return out_all
+
+
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+# model = SAGE(data.x.shape[1], 256, dataset.num_classes, n_layers=2)
+model = MLP(data.x.size(-1), hidden_channels=16, out_channels=172, num_layers=2, dropout=0).to(
+    device
+)
+
+model.to(device)
+epochs = 4
+optimizer = torch.optim.Adam(model.parameters(), lr=1)
+scheduler = ReduceLROnPlateau(optimizer, "max", patience=7)
+
+
+def test(model, device):
+    evaluator = Evaluator(name=target_dataset)
+    model.eval()
+    out = model.inference(total_loader, device)
+
+    y_true = data.y.cpu()
+    y_pred = out.argmax(dim=-1, keepdim=True)
+
+    train_acc = evaluator.eval(
+        {
+            "y_true": y_true[split_idx["train"]],
+            "y_pred": y_pred[split_idx["train"]],
+        }
+    )["acc"]
+    val_acc = evaluator.eval(
+        {
+            "y_true": y_true[split_idx["valid"]],
+            "y_pred": y_pred[split_idx["valid"]],
+        }
+    )["acc"]
+    test_acc = evaluator.eval(
+        {
+            "y_true": y_true[split_idx["test"]],
+            "y_pred": y_pred[split_idx["test"]],
+        }
+    )["acc"]
+
+    return train_acc, val_acc, test_acc
+
+
+for epoch in range(epochs):
+    model.train()
+
+    pbar = tqdm(total=train_idx.size(0))
+    pbar.set_description(f"Epoch {epoch:02d}")
+
+    total_loss = total_correct = 0
+
+    for batch in train_loader:
+        batch_size = batch.batch_size
+        optimizer.zero_grad()
+
+        out = model(batch.x.to(device))
+        out = out[:batch_size]
+
+        batch_y = batch.y[:batch_size].to(device)
+        batch_y = torch.reshape(batch_y, (-1,))
+
+        loss = F.nll_loss(out, batch_y)
+        loss.backward()
+        optimizer.step()
+
+        total_loss += float(loss)
+        total_correct += int(out.argmax(dim=-1).eq(batch_y).sum())
+        pbar.update(batch.batch_size)
+
+    pbar.close()
+
+    loss = total_loss / len(train_loader)
+    approx_acc = total_correct / train_idx.size(0)
+
+    train_acc, val_acc, test_acc = test(model, device)
+
+    print(f"Train: {train_acc:.4f}, Val: {val_acc:.4f}")
+
+evaluator = Evaluator(name=target_dataset)
+model.eval()
+out = model.inference(total_loader, device)
+y_pred = out.argmax(dim=-1, keepdim=True)
+
+y_pred_np = y_pred[split_idx["test"]].numpy()
+df = pd.DataFrame(y_pred_np)
+df.to_csv("submission.csv", index=False)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/scripts/grade.py
new file mode 100644
index 0000000000..979a5ffa81
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/scripts/grade.py
@@ -0,0 +1,97 @@
+import logging
+import shutil
+from functools import cache
+from pathlib import Path
+from tempfile import TemporaryDirectory
+
+import numpy as np
+import pandas as pd
+import torch
+from ogb.nodeproppred import Evaluator, PygNodePropPredDataset
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_root_dir
+
+logger = logging.getLogger(__name__)
+
+
+def get_score(submission_dir: Path) -> float:
+    if not submission_dir.is_dir():
+        raise ValueError(f"Expected to be given a directory. Got: {submission_dir}")
+
+    submission_file = submission_dir / "submission.csv"
+
+    if not submission_file.exists():
+        return get_naive_baseline_score()  # default to baseline score
+
+    submission = pd.read_csv(submission_file)
+    target_dataset = "ogbn-arxiv"
+
+    dataset = PygNodePropPredDataset(name=target_dataset, root="networks")
+    data = dataset[0]
+    split_idx = dataset.get_idx_split()
+
+    test_idx = split_idx["test"]
+
+    evaluator = Evaluator(name=target_dataset)
+    y_true = data.y.cpu()
+
+    submission = torch.tensor(np.array(submission))
+
+    test_acc = evaluator.eval(
+        {
+            "y_true": y_true[test_idx],
+            "y_pred": submission,
+        }
+    )["acc"]
+
+    return test_acc
+
+
+def get_human_baseline_score() -> float:
+    """
+    Returns the test accuracy of the best attempt by a human on the
+    ogbn-arxiv leaderboard:
+    https://ogb.stanford.edu/docs/leader_nodeprop/#ogbn-arxiv.
+
+    Last updated: 2024-01-18.
+    """
+
+    return 78.03
+
+
+@cache
+def get_naive_baseline_score() -> float:
+    """
+    Executes the baseline script `train.py` and returns the accuracy.
+    Assumes the predictions are saved to `submission.csv` when
+    `train.py` is run.
+    """
+
+    env_dir = Path(__file__).parent / ".." / "env"
+    dataset_dir = get_root_dir() / "registry" / "data" / "hr_ml_agent_bench" / "ogbn_arxiv" / "dataset"
+
+    with TemporaryDirectory() as tmp_dir:
+        dst_dir = Path(tmp_dir) / "env"
+
+        shutil.copytree(
+            src=env_dir / "networks",
+            dst=dst_dir / "networks",
+        )
+
+        shutil.copyfile(
+            src=dataset_dir / "baseline.csv",
+            dst=dst_dir / "submission.csv",
+        )
+
+        accuracy = get_score(dst_dir)
+
+    return accuracy
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+    """
+    # Identity; already normalized
+    return score
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/scripts/prepare.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/scripts/prepare.py
new file mode 100644
index 0000000000..32975c3746
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/scripts/prepare.py
@@ -0,0 +1,6 @@
+from pathlib import Path
+
+from ogb.nodeproppred import PygNodePropPredDataset
+
+env_dir = Path(__file__).parent / ".." / "env"
+dataset = PygNodePropPredDataset(name="ogbn-arxiv", root=env_dir / "networks")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/scripts/read_only_files.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/scripts/read_only_files.txt
new file mode 100644
index 0000000000..ba133ed981
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/scripts/read_only_files.txt
@@ -0,0 +1 @@
+networks/*
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/scripts/requirements.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/scripts/requirements.txt
new file mode 100644
index 0000000000..874102dd66
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/ogbn_arxiv/scripts/requirements.txt
@@ -0,0 +1,17 @@
+# This requirements.txt file installs PyTorch sub-modules and assumes that
+# CUDA 11.8 is installed via the provided Dev Container.
+#
+# If you are using a CPU instead of a GPU, replace "cu118" with "cpu"
+# in the URLs below for the following packages:
+# - torch-geometric
+# - torch-sparse
+# - pyg-lib
+#
+# If you are using a different version of CUDA, replace "cu118" with the
+# appropriate CUDA version identifier in the URLs.
+
+ogb
+torch-geometric>=2.0.2 -f https://data.pyg.org/whl/torch-2.0.0+cu118.html
+torch-scatter
+torch-sparse -f https://data.pyg.org/whl/torch-2.0.0+cu118.html
+pyg-lib -f https://data.pyg.org/whl/torch-2.0.0+cu118.html
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/.gitignore b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/.gitignore
new file mode 100644
index 0000000000..1df175f5a9
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/.gitignore
@@ -0,0 +1,4 @@
+env/*.csv
+env/public_timeseries_testing_util.py
+env/example_test_files
+scripts/*.csv
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/env/data_description.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/env/data_description.txt
new file mode 100644
index 0000000000..36cfae892c
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/env/data_description.txt
@@ -0,0 +1,33 @@
+Dataset Description
+The goal of this competition is to predict the course of Parkinson's disease (PD) using protein abundance data. The complete set of proteins involved in PD remains an open research question and any proteins that have predictive value are likely worth investigating further. The core of the dataset consists of protein abundance values derived from mass spectrometry readings of cerebrospinal fluid (CSF) samples gathered from several hundred patients. Each patient contributed several samples over the course of multiple years while they also took assessments of PD severity.
+
+This is a time-series code competition: you will receive test set data and make predictions with a time-series API. See the evaluation_details.txt for details.
+
+Files
+train_peptides.csv Mass spectrometry data at the peptide level. Peptides are the component subunits of proteins.
+
+visit_id - ID code for the visit.
+visit_month - The month of the visit, relative to the first visit by the patient.
+patient_id - An ID code for the patient.
+UniProt - The UniProt ID code for the associated protein. There are often several peptides per protein.
+Peptide - The sequence of amino acids included in the peptide. See this table for the relevant codes. Some rare annotations may not be included in the table. The test set may include peptides not found in the train set.
+PeptideAbundance - The frequency of the amino acid in the sample.
+train_proteins.csv Protein expression frequencies aggregated from the peptide level data.
+
+visit_id - ID code for the visit.
+visit_month - The month of the visit, relative to the first visit by the patient.
+patient_id - An ID code for the patient.
+UniProt - The UniProt ID code for the associated protein. There are often several peptides per protein. The test set may include proteins not found in the train set.
+NPX - Normalized protein expression. The frequency of the protein's occurrence in the sample. May not have a 1:1 relationship with the component peptides as some proteins contain repeated copies of a given peptide.
+train_clinical_data.csv
+
+visit_id - ID code for the visit.
+visit_month - The month of the visit, relative to the first visit by the patient.
+patient_id - An ID code for the patient.
+updrs_[1-4] - The patient's score for part N of the Unified Parkinson's Disease Rating Scale. Higher numbers indicate more severe symptoms. Each sub-section covers a distinct category of symptoms, such as mood and behavior for Part 1 and motor functions for Part 3.
+upd23b_clinical_state_on_medication - Whether or not the patient was taking medication such as Levodopa during the UPDRS assessment. Expected to mainly affect the scores for Part 3 (motor function). These medications wear off fairly quickly (on the order of one day) so it's common for patients to take the motor function exam twice in a single month, both with and without medication.
+supplemental_clinical_data.csv Clinical records without any associated CSF samples. This data is intended to provide additional context about the typical progression of Parkinsons. Uses the same columns as train_clinical_data.csv.
+
+example_test_files/ Data intended to illustrate how the API functions. Includes the same columns delivered by the API (ie no updrs columns).
+
+public_timeseries_testing_util.py A file for running custom API tests.
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/env/evaluation_details.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/env/evaluation_details.txt
new file mode 100644
index 0000000000..1cb872403c
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/env/evaluation_details.txt
@@ -0,0 +1,12 @@
+Submissions are evaluated on SMAPE between forecasts and actual values. We define SMAPE = 0 when the actual and predicted values are both 0.
+
+For each patient visit where a protein/peptide sample was taken you will need to estimate both their UPDRS scores for that visit and predict their scores for any potential visits 6, 12, and 24 months later. Predictions for any visits that didn't ultimately take place are ignored.
+
+You must submit to this competition using the provided python time-series API, which ensures that models do not peek forward in time. To use the API, follow this template in Kaggle Notebooks:
+
+from public_timeseries_testing_util import MockApi
+env = MockApi.make_env()   # initialize the environment
+iter_test = env.iter_test()    # an iterator which loops over the test files
+for (test, test_peptides, test_proteins, sample_submission) in iter_test:
+    sample_prediction_df['rating'] = np.arange(len(sample_prediction))  # make your predictions here
+    env.predict(sample_prediction_df)   # register your predictions
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/env/train.py
new file mode 100644
index 0000000000..eaa03676dc
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/env/train.py
@@ -0,0 +1,184 @@
+import numpy as np
+import pandas as pd
+from sklearn.ensemble import RandomForestRegressor
+from sklearn.metrics import make_scorer
+from sklearn.model_selection import GroupKFold, cross_val_score
+from sklearn.utils import check_consistent_length
+
+
+# Define the metric
+def smapep1(y_true, y_pred):
+    """SMAPE of y+1, a nonnegative float, smaller is better
+
+    Parameters: y_true, y_pred: array-like
+
+    Returns 100 for 100 % error.
+    y_true may have missing values.
+    """
+    check_consistent_length(y_true, y_pred)
+
+    y_true = np.array(y_true, copy=False).ravel()
+    y_pred = np.array(y_pred, copy=False).ravel()
+    y_true, y_pred = y_true[np.isfinite(y_true)], y_pred[np.isfinite(y_true)]
+
+    if (y_true < 0).any():
+        raise ValueError("y_true < 0")
+
+    if (y_pred < 0).any():
+        raise ValueError("y_pred < 0")
+
+    denominator = (y_true + y_pred) / 2 + 1
+    ape = np.abs(y_pred - y_true) / denominator
+
+    return np.average(ape) * 100
+
+
+# The scorer returns nonpositive values so that greater is better.
+# It will be used as an argument to cross_val_score
+smapep1_scorer = make_scorer(smapep1, greater_is_better=False)
+
+
+def get_predictions(my_train, model):
+    # Forecast
+    my_train = my_train.fillna(0)
+    pd.DataFrame(columns=["prediction_id", "rating"])
+    final = []
+    target = ["updrs_1", "updrs_2", "updrs_3", "updrs_4"]
+
+    for u in target:
+        # Predict
+        X = my_train["visit_month"]
+
+        predict = model[u].predict(X.values.reshape(-1, 1)).tolist()
+        complete_result = my_train[["visit_id", "visit_month"]].values.tolist()
+
+        for index in range(len(complete_result)):
+            complete_result[index].extend(predict[index])
+
+        temp = pd.DataFrame(
+            complete_result,
+            columns=[
+                "visit_id",
+                "visit_month",
+                u + "_plus_0_months",
+                u + "_plus_6_months",
+                u + "_plus_12_months",
+                u + "_plus_24_months",
+            ],
+        )
+
+        temp = temp.melt(
+            id_vars=["visit_id", "visit_month"],
+            value_vars=[
+                u + "_plus_0_months",
+                u + "_plus_6_months",
+                u + "_plus_12_months",
+                u + "_plus_24_months",
+            ],
+            value_name="rating",
+        )
+
+        temp["prediction_id"] = temp["visit_id"] + "_" + temp["variable"]
+
+        final.append(temp[["prediction_id", "rating"]])
+
+    final = pd.concat(final)
+    final = final.drop_duplicates(subset=["prediction_id", "rating"])
+
+    return final
+
+
+if __name__ == "__main__":
+    from evals.elsuite.hr_ml_agent_bench.benchmarks.parkinsons_disease.env.public_timeseries_testing_util import (
+        MockApi,
+    )
+
+    target = ["updrs_1", "updrs_2", "updrs_3", "updrs_4"]
+    data_proteins = pd.read_csv("train_proteins.csv")
+    data_clinical = pd.read_csv("train_clinical_data.csv")
+    data_peptides = pd.read_csv("train_peptides.csv")
+    data_supplemental = pd.read_csv("supplemental_clinical_data.csv")
+    merged_data = pd.concat([data_clinical, data_supplemental])
+
+    ## TODO: data cleaning and feature engineering
+    # Right now, we only use the month data and the target data
+    id_list = merged_data["patient_id"].unique().tolist()
+    data_for_train = {}
+    for u in target:
+        final = []
+        for id_ in id_list:
+            infor_of_id = merged_data[merged_data["patient_id"] == id_]
+            month_per_id = infor_of_id.visit_month.tolist()
+            for month in month_per_id:
+                check = [month, id_]
+                for plus in [0, 6, 12, 24]:
+                    if month + plus in month_per_id:
+                        month_value = infor_of_id[infor_of_id.visit_month == month + plus][
+                            u
+                        ].values[0]
+
+                        if month_value != np.nan:
+                            check.append(month_value)
+
+                if len(check) == 6:
+                    final.append(check)
+
+        check = pd.DataFrame(
+            final,
+            columns=["month", "patient_id", u + "+0", u + "+6", u + "+12", u + "+24"],
+        )
+
+        data_for_train[u] = check.dropna()
+
+    ## train model
+    model = {}
+    overall_score = []
+    target = ["updrs_1", "updrs_2", "updrs_3", "updrs_4"]
+
+    for i, u in enumerate(target):
+        # Train data
+        X = data_for_train[u]["month"]
+        y = data_for_train[u].iloc[:, 2:6]
+        trained = RandomForestRegressor().fit(X.values.reshape(-1, 1), y)
+        # Save model
+        model[u] = trained
+
+        ## cross validation and print results
+        print("Cross-validation scores")
+
+        cvs = cross_val_score(
+            RandomForestRegressor(),
+            X=X.values.reshape(-1, 1),
+            y=y,
+            groups=data_for_train[u]["patient_id"],
+            scoring=smapep1_scorer,
+            cv=GroupKFold(n_splits=8),
+            error_score="raise",
+        )
+
+        print([f"updrs_{i}:"], -cvs.round(1), -cvs.mean().round(1))
+        overall_score.append(-cvs)
+
+    print(f"Overall cv score of the group model: {np.array(overall_score).mean():.2f}")
+
+    ## save to submission.csv file for the test set by using this following API call
+
+    env = MockApi()
+    iter_test = env.iter_test()  # an iterator which loops over the test files
+
+    # The API will deliver four dataframes in this specific order:
+    for iteration, (
+        test_clinical_data,
+        test_peptides,
+        test_proteins,
+        sample_submission,
+    ) in enumerate(iter_test):
+        # TODO - make your predictions here by modifying 'rating' sample_submission dataframe
+        pred = get_predictions(test_clinical_data, model).round(0)
+
+        for index in sample_submission["prediction_id"]:
+            sample_submission.loc[sample_submission["prediction_id"] == index, "rating"] = pred[
+                pred["prediction_id"] == index
+            ]["rating"].values
+
+        env.predict(sample_submission)  # register your predictions
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/scripts/grade.py
new file mode 100644
index 0000000000..4829927107
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/scripts/grade.py
@@ -0,0 +1,58 @@
+from pathlib import Path
+
+import pandas as pd
+
+from evals.elsuite.hr_ml_agent_bench.benchmarks.parkinsons_disease.env.train import smapep1
+
+
+def get_score(submission_dir: Path) -> float:
+    """
+    Returns 100 - % error, where '% error' is the symmetric
+    mean absolute percentage error (SMAPE) of the
+    predictions stored in `submission.csv`. Higer is better.
+    """
+
+    submission_fname = submission_dir / "submission.csv"
+    answers_fname = Path(__file__).parent / "answer.csv"
+
+    if not submission_fname.exists():
+        return get_naive_baseline_score()  # default to baseline score
+
+    assert answers_fname.exists(), f"Expected {answers_fname.as_posixs()} to exist, but it doesn't."
+
+    y_true = pd.read_csv(answers_fname)
+    y_pred = pd.read_csv(submission_fname)
+
+    percentage_error = smapep1(
+        y_true=y_true["rating"],
+        y_pred=y_pred["rating"],
+    )
+    return percentage_error
+
+
+def get_human_baseline_score() -> float:
+    """
+    Returns the % error of the best attempt by a human on the
+    AMP-Parkinson's Disease Progression Prediction Kaggle
+    leaderboard:
+    https://www.kaggle.com/competitions/amp-parkinsons-disease-progression-prediction/leaderboard.
+
+    Last updated: 2024-01-18.
+    """
+
+    return 60.042
+
+
+def get_naive_baseline_score() -> float:
+    """Returns the worst possible score: a 100% error rate."""
+
+    return 100.0
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+    """
+    # Map [100, 0] to [0, 1]
+    return (100 - score) / 100
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/scripts/prepare.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/scripts/prepare.py
new file mode 100644
index 0000000000..32d28f7ee7
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/scripts/prepare.py
@@ -0,0 +1,135 @@
+import random
+from pathlib import Path
+from shutil import copyfile
+
+import pandas as pd
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_root_dir
+
+env_dir = Path(__file__).parent / ".." / "env"
+dataset_dir = (
+    get_root_dir() / "registry" / "data" / "hr_ml_agent_bench" / "parkinsons_disease" / "dataset"
+)
+
+
+if not dataset_dir.is_dir():
+    dataset_dir.mkdir(parents=False, exist_ok=False)
+
+    input(
+        "Please download the data at https://www.kaggle.com/"
+        f"competitions/amp-parkinsons-disease-progression-prediction/data "
+        f"into {dataset_dir}. Press any key after you've downloaded "
+        "the data to continue."
+    )
+
+
+# check required files exist
+
+proteins_csv = dataset_dir / "train_proteins.csv"
+clinical_csv = dataset_dir / "train_clinical_data.csv"
+peptides_csv = dataset_dir / "train_peptides.csv"
+supplemental_csv = dataset_dir / "supplemental_clinical_data.csv"
+utils_py = dataset_dir / "public_timeseries_testing_util.py"
+
+assert proteins_csv.is_file(), f"{proteins_csv} does not exist!"
+assert clinical_csv.is_file(), f"{clinical_csv} does not exist!"
+assert peptides_csv.is_file(), f"{peptides_csv} does not exist!"
+assert supplemental_csv.is_file(), f"{supplemental_csv} does not exist!"
+assert utils_py.is_file(), f"{utils_py} does not exist!"
+
+
+# create example files directory in env
+
+example_test_files_dir = env_dir / "example_test_files"
+example_test_files_dir.mkdir(parents=False, exist_ok=True)
+
+
+# split train to train and test in env
+
+data_proteins = pd.read_csv(proteins_csv)
+data_clinical = pd.read_csv(clinical_csv)
+data_peptides = pd.read_csv(peptides_csv)
+data_supplemental = pd.read_csv(supplemental_csv)
+
+random.seed(42)
+
+patient_id = data_clinical["patient_id"].unique()
+test_patient_id = random.sample(patient_id.tolist(), 2)
+train_patient_id = [x for x in patient_id if x not in test_patient_id]
+
+data_proteins[data_proteins["patient_id"].isin(train_patient_id)].to_csv(
+    env_dir / "train_proteins.csv", index=False
+)
+data_clinical[data_clinical["patient_id"].isin(train_patient_id)].to_csv(
+    env_dir / "train_clinical_data.csv", index=False
+)
+data_peptides[data_peptides["patient_id"].isin(train_patient_id)].to_csv(
+    env_dir / "train_peptides.csv", index=False
+)
+data_supplemental[data_supplemental["patient_id"].isin(train_patient_id)].to_csv(
+    env_dir / "supplemental_clinical_data.csv", index=False
+)
+data_proteins[data_proteins["patient_id"].isin(test_patient_id)].to_csv(
+    env_dir / "example_test_files" / "test_proteins.csv", index=False
+)
+data_peptides[data_peptides["patient_id"].isin(test_patient_id)].to_csv(
+    env_dir / "example_test_files" / "test_peptides.csv", index=False
+)
+test_clinical = data_clinical[data_clinical["patient_id"].isin(test_patient_id)]
+
+
+# copy utils file
+
+copyfile(
+    src=utils_py,
+    dst=env_dir / utils_py.name,
+)
+
+# create example test.csv
+
+temp_list = []
+for i in range(1, 5):
+    temp = test_clinical.copy()
+    temp["level_3"] = i
+    temp["updrs_test"] = f"updrs_{i}"
+    temp_list.append(temp)
+mock_train = pd.concat(temp_list)
+mock_train["row_id"] = mock_train[["patient_id", "visit_month", "level_3"]].apply(
+    (lambda r: f"{r.patient_id}_{int(r.visit_month)}_updrs_{r.level_3}"), axis=1
+)
+mock_train[["visit_id", "patient_id", "visit_month", "row_id", "updrs_test"]].to_csv(
+    env_dir / "example_test_files" / "test.csv", index=False
+)
+
+# Create sample_submission.csv
+
+temp_list = []
+for wait in [0, 6, 12, 24]:
+    temp = mock_train.copy()
+    temp["wait"] = wait
+    temp_list.append(temp)
+y = pd.concat(temp_list)
+y = y[y.visit_month + y.wait <= 108]
+y["prediction_id"] = y[["patient_id", "visit_month", "wait", "level_3"]].apply(
+    (lambda r: f"{r.patient_id}_{int(r.visit_month)}_updrs_{r.level_3}_plus_{r.wait}_months"),
+    axis=1,
+)
+
+
+def get_rating(row):
+    rating = test_clinical[
+        test_clinical["visit_id"] == f"{row.patient_id}_{int(row.visit_month) + int(row.wait)}"
+    ][f"updrs_{row.level_3}"]
+    if len(rating) == 0:
+        return None
+    return rating.item()
+
+
+y["rating"] = y[["patient_id", "visit_month", "wait", "level_3"]].apply(get_rating, axis=1)
+y = y.dropna()
+y[["prediction_id", "rating", "visit_month"]].to_csv("answer.csv", index=False)
+
+y["rating"] = 0
+y[["prediction_id", "rating", "visit_month"]].to_csv(
+    env_dir / "example_test_files" / "sample_submission.csv", index=False
+)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/scripts/read_only_files.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/scripts/read_only_files.txt
new file mode 100644
index 0000000000..60d3d32368
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/scripts/read_only_files.txt
@@ -0,0 +1,5 @@
+example_test_files/*
+./supplemental_clinical_data.csv
+./train_clinical_data.csv
+./train_peptide.csv
+./train_protein.csv
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/scripts/source_code.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/scripts/source_code.txt
new file mode 100644
index 0000000000..dc72e01dbd
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/parkinsons_disease/scripts/source_code.txt
@@ -0,0 +1,2 @@
+https://www.kaggle.com/code/dangkhanhle/test-model
+https://www.kaggle.com/code/ambrosm/pdpp-linear-and-isotonic-groups/notebook
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/baselines/human.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/baselines/human.py
new file mode 100644
index 0000000000..3a6deaffae
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/baselines/human.py
@@ -0,0 +1,59 @@
+from pathlib import Path
+
+from stable_baselines3 import DQN
+from stable_baselines3.common.env_util import make_atari_env
+from stable_baselines3.common.evaluation import evaluate_policy
+from stable_baselines3.common.vec_env import VecFrameStack
+
+checkpoint = Path("human.checkpoint")
+
+env = make_atari_env("PongNoFrameskip-v4", n_envs=4, seed=0)
+env = VecFrameStack(env, n_stack=4)
+
+if not checkpoint.exists():
+    # Source: https://huggingface.co/sb3/dqn-PongNoFrameskip-v4
+    hyperparameters = {
+        "batch_size": 32,
+        "buffer_size": 10_000,
+        "exploration_final_eps": 0.01,
+        "exploration_fraction": 0.1,
+        "gradient_steps": 1,
+        "learning_rate": 0.0001,
+        "learning_starts": 100_000,
+        "policy": "CnnPolicy",
+        "target_update_interval": 1_000,
+        "train_freq": 4,
+    }
+
+    model = DQN(
+        env=env,
+        verbose=1,
+        seed=0,
+        device="auto",
+        **hyperparameters,
+    )
+
+    # For reference, using DQN with the 'MlpPolicy' with
+    # hyperparameters used in https://huggingface.co/sb3/dqn-PongNoFrameskip-v4
+    # achieves an average total reward of 20.95 +/- 0.21.
+    model = model.learn(
+        total_timesteps=10_000_000,
+        progress_bar=True,
+        log_interval=100,
+    )
+
+    model.save(checkpoint)
+
+
+model = DQN.load(checkpoint)
+
+mean_return, std_return = evaluate_policy(
+    model=model,
+    env=env,
+    n_eval_episodes=20,
+)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return} +/- {std_return}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/baselines/naive.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/baselines/naive.py
new file mode 100644
index 0000000000..c72b6cd030
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/baselines/naive.py
@@ -0,0 +1,48 @@
+"""A random agent for the PongNoFrameskip-v4 environment."""
+
+import gymnasium as gym
+import numpy as np
+from tqdm import tqdm
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("PongNoFrameskip-v4")
+returns = []
+
+for episode in tqdm(range(20)):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # There's no maximum number of frames specified for Pong. We cap the
+    # episode at an arbitrary high number of frames: 10,000. For more
+    # info, see: https://gymnasium.farama.org/environments/atari/pong.
+    for step in range(10_000):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/env/environment.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/env/environment.txt
new file mode 100644
index 0000000000..b36abbfa64
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/env/environment.txt
@@ -0,0 +1 @@
+You control the right paddle and compete against the left paddle, which is controlled by the computer. The goal is to keep deflecting the ball away from your goal and into your opponent’s goal.
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/env/train.py
new file mode 100644
index 0000000000..0bbfe17f15
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/env/train.py
@@ -0,0 +1,46 @@
+import gymnasium as gym
+import numpy as np
+from tqdm import tqdm
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("PongNoFrameskip-v4")
+returns = []
+
+for episode in tqdm(range(20)):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # There's no maximum number of frames specified for Pong. We cap the
+    # episode at an arbitrary high number of frames: 10,000. For more
+    # info, see: https://gymnasium.farama.org/environments/atari/pong.
+    for step in range(10_000):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/scripts/grade.py
new file mode 100644
index 0000000000..2c25c814e3
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/pong/scripts/grade.py
@@ -0,0 +1,86 @@
+from functools import cache
+from pathlib import Path
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_baseline_score
+
+
+def get_score(submission_dir: Path) -> float:
+    """
+    Returns the score (the average reward over 100 consecutive
+    episodes) of the submission in the given directory.
+    """
+
+    if not submission_dir.exists():
+        raise ValueError(f"Expected to be given a directory. Got: {submission_dir}")
+
+    submission_file = submission_dir / "submission.txt"
+
+    if not submission_file.exists():
+        return get_naive_baseline_score()  # no submission
+
+    with open(submission_file, "r") as f:
+        try:
+            avg_reward = float(f.read())
+        except ValueError:
+            return get_naive_baseline_score()  # invalid submission
+
+    if avg_reward < -21.0 or avg_reward > 21.0:
+        return get_naive_baseline_score()  # invalid submission
+
+    assert (
+        -21.0 <= avg_reward <= 21.0
+    ), f"Expected execution time to be non-negative. Got: {avg_reward}"
+
+    return avg_reward
+
+
+@cache
+def get_human_baseline_score() -> float:
+    """
+    Executes the human baseline and returns its score.
+    """
+
+    scripts_dir = Path(__file__).parent
+    baselines_dir = scripts_dir.parent / "baselines"
+    human_baseline = baselines_dir / "human.py"
+    checkpoint = baselines_dir / "human.checkpoint"
+
+    other_files = []
+
+    if checkpoint.exists():
+        other_files = [checkpoint]
+
+    return get_baseline_score(
+        baseline_script=human_baseline,
+        score_fn=get_score,
+        other_files=other_files,
+    )
+
+
+@cache
+def get_naive_baseline_score() -> float:
+    """
+    Executes the naive baseline and returns its score.
+    """
+
+    scripts_dir = Path(__file__).parent
+    baselines_dir = scripts_dir.parent / "baselines"
+    naive_baseline = baselines_dir / "naive.py"
+
+    return get_baseline_score(naive_baseline, get_score)
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+
+    The minimum reward is -21.0 and the maximum reward is 21.0. See
+    https://gymnasium.farama.org/environments/atari/pong/#rewards
+    for more information.
+    """
+
+    min_score = -21.0
+    max_score = 21.0
+
+    return (score - min_score) / (max_score - min_score)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/baselines/human.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/baselines/human.py
new file mode 100644
index 0000000000..39cd471c5e
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/baselines/human.py
@@ -0,0 +1,47 @@
+from pathlib import Path
+
+import gymnasium as gym
+from stable_baselines3 import PPO
+from stable_baselines3.common.evaluation import evaluate_policy
+
+checkpoint = Path("human.checkpoint")
+env = gym.make("Pusher-v4")
+
+if not checkpoint.exists():
+    model = PPO(
+        policy="MlpPolicy",
+        env=env,
+        verbose=1,
+        seed=0,
+        device="auto",
+    )
+
+    # For reference, using PPO with the 'MlpPolicy' achieves
+    # (total_timesteps: avg_reward +/- std_reward):
+    #     10_000:  -57.4 +/- 4.6
+    #     20_000:  -47.0 +/- 6.5
+    #     40_000:  -43.6 +/- 4.1
+    #     80_000:  -35.2 +/- 4.2
+    #     160_000: -33.2 +/- 4.6
+    #     320_000: -32.4 +/- 4.0
+    model = model.learn(
+        total_timesteps=80_000,
+        progress_bar=True,
+        log_interval=100,
+    )
+
+    model.save(checkpoint)
+
+
+model = PPO.load(checkpoint)
+
+mean_return, std_return = evaluate_policy(
+    model=model,
+    env=env,
+    n_eval_episodes=100,
+)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return} +/- {std_return}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/baselines/naive.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/baselines/naive.py
new file mode 100644
index 0000000000..cbad322131
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/baselines/naive.py
@@ -0,0 +1,48 @@
+"""A random agent for the Pusher-v4 environment."""
+
+import gymnasium as gym
+import numpy as np
+from tqdm import tqdm
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+        
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("Pusher-v4")
+returns = []
+
+for episode in tqdm(range(100)):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # The maximum number of steps in an episode is 100. See
+    # https://gymnasium.farama.org/environments/mujoco/pusher/#episode-end
+    # for more details about the environment.
+    for step in range(100):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/env/environment.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/env/environment.txt
new file mode 100644
index 0000000000..04db64bb19
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/env/environment.txt
@@ -0,0 +1 @@
+The environment contains a multi-jointed robot arm which is very similar to that of a human. The goal is to move a target cylinder (called object) to a goal position using the robot’s end effector (called fingertip). The robot consists of shoulder, elbow, forearm, and wrist joints.
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/env/train.py
new file mode 100644
index 0000000000..fac1b6cbfa
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/env/train.py
@@ -0,0 +1,46 @@
+import gymnasium as gym
+import numpy as np
+from tqdm import tqdm
+
+
+class Agent:
+    def __init__(self, env):
+        # Set the action space seed so sampling from it is deterministic
+        env.action_space.seed(episode)
+
+        self.env = env
+
+    def act(self, observation):
+        return self.env.action_space.sample()
+
+
+env = gym.make("Pusher-v4")
+returns = []
+
+for episode in tqdm(range(100)):
+    observation, info = env.reset(seed=episode)
+    agent = Agent(env)
+    rewards = []
+
+    # The maximum number of steps in an episode is 100. See
+    # https://gymnasium.farama.org/environments/mujoco/pusher/#episode-end
+    # for more details about the environment.
+    for step in range(100):
+        action = agent.act(observation)
+        observation, reward, terminated, truncated, info = env.step(action)
+        rewards.append(reward)
+
+        if terminated or truncated:
+            break
+
+    returns.append(sum(rewards))
+
+env.close()
+
+mean_return = np.mean(returns)
+std_return = np.std(returns)
+
+with open("submission.txt", "w") as f:
+    f.write(str(mean_return))
+
+print(f"Average return: {mean_return:.4f} +/- {std_return:.4f}")
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/scripts/grade.py
new file mode 100644
index 0000000000..c5cf438bb1
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/pusher/scripts/grade.py
@@ -0,0 +1,93 @@
+from functools import cache
+from pathlib import Path
+
+import numpy as np
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_baseline_score
+
+
+def get_score(submission_dir: Path) -> float:
+    """
+    Returns the score (the average reward over 100 consecutive
+    episodes) of the submission in the given directory.
+    """
+
+    if not submission_dir.exists():
+        raise ValueError(f"Expected to be given a directory. Got: {submission_dir}")
+
+    submission_file = submission_dir / "submission.txt"
+
+    if not submission_file.exists():
+        return get_naive_baseline_score()  # no submission
+
+    with open(submission_file, "r") as f:
+        try:
+            avg_reward = float(f.read())
+        except ValueError:
+            return get_naive_baseline_score()  # invalid submission
+
+    if avg_reward > 0:
+        return get_naive_baseline_score()  # invalid submission
+
+    assert avg_reward <= 0, f"Expected execution time to be non-positive. Got: {avg_reward}"
+
+    return avg_reward
+
+
+@cache
+def get_human_baseline_score() -> float:
+    """
+    Executes the human baseline and returns its score.
+    """
+
+    scripts_dir = Path(__file__).parent
+    baselines_dir = scripts_dir.parent / "baselines"
+    human_baseline = baselines_dir / "human.py"
+    checkpoint = baselines_dir / "human.checkpoint"
+
+    other_files = []
+
+    if checkpoint.exists():
+        other_files = [checkpoint]
+
+    return get_baseline_score(
+        baseline_script=human_baseline,
+        score_fn=get_score,
+        other_files=other_files,
+    )
+
+
+@cache
+def get_naive_baseline_score() -> float:
+    """
+    Executes the naive baseline and returns its score.
+    """
+
+    scripts_dir = Path(__file__).parent
+    baselines_dir = scripts_dir.parent / "baselines"
+    naive_baseline = baselines_dir / "naive.py"
+
+    return get_baseline_score(naive_baseline, get_score)
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+    """
+
+    # The minimum reward (-inf) is achieved by pushing the puck infinitely
+    # far from the target. However, a naive agent scores roughly -150 on
+    # average, which we take as the empirical minimum. See
+    # https://gymnasium.farama.org/environments/mujoco/pusher for more
+    # details about the environment.
+    min_score = -150.0
+
+    # The maximum reward (0) is achieved by pushing the puck to the target.
+    max_score = 0.0
+
+    # Use a linear transformation to avoid an unintuitive warping of the score.
+    clipped = np.clip(score, min_score, max_score)
+    score = (clipped - min_score) / (max_score - min_score)
+
+    return score
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/.gitignore b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/.gitignore
new file mode 100644
index 0000000000..bfeab87ad5
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/.gitignore
@@ -0,0 +1,2 @@
+env/*.csv
+scripts/*.csv
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/env/task_descriptor.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/env/task_descriptor.txt
new file mode 100644
index 0000000000..6c430f13a2
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/env/task_descriptor.txt
@@ -0,0 +1,22 @@
+Dataset Description
+In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.
+
+File and Data Field Descriptions
+
+train.csv - Personal records of the passengers, to be used as training data.
+PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
+HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
+CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
+Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
+Destination - The planet the passenger will be debarking to.
+Age - The age of the passenger.
+VIP - Whether the passenger has paid for special VIP service during the voyage.
+RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
+Name - The first and last names of the passenger.
+Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
+
+test.csv - Personal records for the remaining passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.
+
+sample_submission.csv - A submission file in the correct format.
+PassengerId - Id for each passenger in the test set.
+Transported - The target. For each passenger, predict either True or False.
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/env/train.py
new file mode 100644
index 0000000000..861594e53f
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/env/train.py
@@ -0,0 +1,64 @@
+import pandas as pd
+
+
+def create_new_dataframe(data, column_names):
+    new_data = {}
+
+    for column in column_names:
+        if column in data.columns:
+            new_data[column] = data[column]
+        else:
+            new_data[column] = pd.Series(0, index=data.index)
+
+    new_dataframe = pd.DataFrame(new_data)
+    return new_dataframe
+
+
+# Loading the dataset to train a binary classfier downstream
+df = pd.read_csv("train.csv")
+num_examples = df.shape[0]
+df = df.sample(frac=1, random_state=1)
+train_data = df[0 : int(0.8 * num_examples)]
+val_data = df[int(0.8 * num_examples) + 1 :]
+
+train_data[["Deck", "Cabin_num", "Side"]] = train_data["Cabin"].str.split("/", expand=True)
+train_data = train_data.drop("Cabin", axis=1)
+
+val_data[["Deck", "Cabin_num", "Side"]] = val_data["Cabin"].str.split("/", expand=True)
+val_data = val_data.drop("Cabin", axis=1)
+
+TargetY = train_data["Transported"]
+TargetY_test = val_data["Transported"]
+
+# Expanding features to have boolean values as opposed to categorical
+# You can check all the features as column names and try to find good correlations with the target variable
+selectColumns = ["HomePlanet", "CryoSleep", "Destination", "VIP", "Deck", "Side"]
+ResourceX = pd.get_dummies(train_data[selectColumns])
+ResourceX_test = pd.get_dummies(val_data[selectColumns])
+
+
+# ***********************************************
+# In this part of the code, write and train the model on the above dataset to perform the task.
+# Note that the output accuracy should be stored in train_accuracy and val_accuracy variables
+# ***********************************************
+
+
+# ***********************************************
+# End of the main training module
+# ***********************************************
+
+print(f"Train Accuracy: {train_accuracy}")  # noqa: F821
+print(f"Validation Accuracy: {val_accuracy}")  # noqa: F821
+
+test_data = pd.read_csv("test.csv")
+test_data[["Deck", "Cabin_num", "Side"]] = test_data["Cabin"].str.split("/", expand=True)
+test_data = test_data.drop("Cabin", axis=1)
+
+test_X = pd.get_dummies(test_data[selectColumns])
+test_X.insert(loc=17, column="Deck_T", value=0)
+
+test_preds = model.predict(test_X)  # noqa: F821
+
+
+output = pd.DataFrame({"PassengerId": test_data.PassengerId, "Transported": test_preds})
+output.to_csv("submission.csv", index=False)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/grade.py
new file mode 100644
index 0000000000..040a377432
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/grade.py
@@ -0,0 +1,64 @@
+from pathlib import Path
+
+import pandas as pd
+
+
+def get_score(submission_dir: Path) -> float:
+    """
+    Returns the score of the submission in the given directory.
+    The score is defined as the classification accuracy of the submission.
+    """
+
+    submission_path = submission_dir / "submission.csv"
+
+    if not submission_path.exists():  # no submission
+        return get_naive_baseline_score()
+
+    answer_path = Path(__file__).parent / "answer.csv"
+
+    assert answer_path.exists(), f"Expected {answer_path} to exist, but it doesn't!"
+
+    df_y_pred = pd.read_csv(submission_path)
+    df_y_true = pd.read_csv(answer_path)
+
+    assert len(df_y_true) > 0, "Expected the answer to be non-empty."
+    assert "Transported" in df_y_true.columns, "Expected the answer to have a `SalePrice` column."
+
+    if len(df_y_pred) != len(df_y_true):  # invalid submission
+        return get_naive_baseline_score()
+
+    if "Transported" not in df_y_pred.columns:  # invalid submission
+        return get_naive_baseline_score()
+
+    y_pred = df_y_pred["Transported"]
+    y_true = df_y_true["Transported"]
+
+    accuracy = 100 * sum(y_pred == y_true) / len(y_true)
+
+    return accuracy
+
+
+def get_human_baseline_score() -> float:
+    """
+    Returns the accuracy of the best attempt by a human on the Spaceship
+    Titanic Kaggle leaderboard:
+    https://www.kaggle.com/competitions/spaceship-titanic/leaderboard.
+
+    Last updated: 2024-01-18.
+    """
+
+    return 99.485
+
+
+def get_naive_baseline_score() -> float:
+    """Returns the worst possible score: 0% accuracy."""
+
+    return 0.0
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+    """
+    return score / 100
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/prepare.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/prepare.py
new file mode 100644
index 0000000000..4a4c2e4fde
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/prepare.py
@@ -0,0 +1,29 @@
+from pathlib import Path
+
+import pandas as pd
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_root_dir
+
+env_dir = Path(__file__).parent / ".." / "env"
+script_dir = Path(__file__).parent
+dataset_dir = (
+    get_root_dir() / "registry" / "data" / "hr_ml_agent_bench" / "spaceship_titanic" / "dataset"
+)
+
+if not dataset_dir.is_dir():
+    dataset_dir.mkdir(parents=False, exist_ok=False)
+
+    input(
+        "Please download the data at https://www.kaggle.com/"
+        f"competitions/spaceship-titanic/data "
+        f"into {dataset_dir}. Press any key after you've downloaded "
+        "the data to continue."
+    )
+
+train = pd.read_csv(dataset_dir / "train.csv")
+train = train.reset_index(drop=True)
+train.iloc[: int(len(train) * 0.8)].to_csv(env_dir / "train.csv", index=False)
+test = train.iloc[int(len(train) * 0.8) :]
+
+test.drop(list(train.keys())[1:-1], axis=1).to_csv(script_dir / "answer.csv", index=False)
+test = test.drop(["Transported"], axis=1).to_csv(env_dir / "test.csv", index=False)
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/read_only_files.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/read_only_files.txt
new file mode 100644
index 0000000000..b52a2f8494
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/read_only_files.txt
@@ -0,0 +1,2 @@
+./train.csv
+./test.csv
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/requirements.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/requirements.txt
new file mode 100644
index 0000000000..10ddd5b71e
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/requirements.txt
@@ -0,0 +1 @@
+xgboost
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/source_code.txt b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/source_code.txt
new file mode 100644
index 0000000000..1b18fff4ed
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/spaceship_titanic/scripts/source_code.txt
@@ -0,0 +1 @@
+https://www.kaggle.com/competitions/spaceship-titanic/data
\ No newline at end of file
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/vectorization/env/train.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/vectorization/env/train.py
new file mode 100644
index 0000000000..0ff942dc36
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/vectorization/env/train.py
@@ -0,0 +1,200 @@
+import time
+from typing import Union
+
+import numpy as np
+
+
+def relu(x: np.ndarray) -> np.ndarray:
+    """
+    Relu activation function. Returns max(0,value)
+    args:
+        x: input array of any shape
+    output: All negatives clipped to 0
+    """
+    return x * (x > 0)
+
+
+def add_padding(X: np.ndarray, pad_size: Union[int, list, tuple], pad_val: int = 0) -> np.ndarray:
+    """
+    Pad the input image array equally from all sides
+    args:
+        x: Input Image should be in the form of [Batch, Width, Height, Channels]
+        pad_size: How much padding should be done. If int, equal padding will done. Else specify how much to pad each side (height_pad,width_pad) OR (y_pad, x_pad)
+        pad_val: What should be the value to be padded. Usually it os 0 padding
+    return:
+        Padded Numpy array Image
+    """
+    assert len(X.shape) == 4, "Input image should be form of [Batch, Width, Height, Channels]"
+    if isinstance(pad_size, int):
+        y_pad = x_pad = pad_size
+    else:
+        y_pad = pad_size[0]
+        x_pad = pad_size[1]
+
+    pad_width = (
+        (0, 0),
+        (y_pad, y_pad),
+        (x_pad, x_pad),
+        (0, 0),
+    )  # Do not pad first and last axis. Pad Width(2nd), Height(3rd) axis with  pad_size
+    return np.pad(X, pad_width=pad_width, mode="constant", constant_values=(pad_val, pad_val))
+
+
+class Conv2DLayer:
+    """
+    2D Convolution Layer
+    """
+
+    def __init__(
+        self,
+        input_channels: int,
+        num_filters: int,
+        kernel_size: int,
+        stride: int,
+        padding: Union[str, None],
+        activation: Union[None, str] = "relu",
+    ):
+        """
+        Kernal Matrix for the Current Layer having shape [filter_size, filter_size, num_of_features_old, num_of_filters_new]. 'num_of_features_old' are the Channels or features from previous layer
+        'filter_size' (or kernel size) is the size of filters which will detect new features.
+        'num_of_filters_new' are the No of new features detected by these kernels on the previous features where Each Kernel/filter will detect a new feature/channel
+
+        args:
+            input_channels: No of features/channels present in the incoming input. It'll be equal to Last dimension value from the prev layer output `previous_layer.output.shape[-1]`
+            num_filters: Output Channels or How many new features you want this new Layer to Detect. Each Filter/kernel will detect a new Feature /channel
+            kernel_size: What is the size of Kernels or Filters. Each Filter a 2D Square Matrix of size kernel_size
+            stride: How many pixels you want each kernel to shift. Same shift in X and Y direction OR indirectly, it'll define how many iterations the kernel will take to convolve over the whole image
+            padding: How much padding you want to add to the image. If padding='same', it means padding in a way that input and output have the same dimension
+            activation: Which activation to use
+        """
+        self.kernel_matrices = np.random.randn(
+            kernel_size, kernel_size, input_channels, num_filters
+        )  # Complete Weight/Kernel Matrix
+        self.biases = np.random.randn(1, 1, 1, num_filters)  # 1 Bias per Channel/feature/filter
+        self.stride = stride
+        self.padding = padding
+        self.activation = activation
+
+    def convolution_step(
+        self, image_portion: np.ndarray, kernel_matrix: np.ndarray, bias: np.ndarray
+    ) -> np.ndarray:
+        """
+        Convolve the Filter onto a given portion of the Image. This operation will be done multiple times per image, per kernel. Number of times is dependent on Window size, Stride and Image Size.
+        In simple words, Multiply the given filter weight matrix and the area covered by filter and this is repeated for whole image.
+        Imagine a slice of matrix  [FxF] from a [PxQ] shaped image. Now imagine [Fxf] filter on top of it. Do matrix multiplication, summation and add bias
+        args:
+            image_portion: Image Matrix or in other sense, Features. Shape is [filter_size, filter_size, no of channels / Features from previous layer]
+            filter: Filter / Kernel weight Matrix which convolves on top of image slice. Size is [filter_size, filter_size, no of channels / Features from previous layer]
+            bias: Bias matrix of shape [1,1,1]
+        returns:
+            Convolved window output with single floating value inside a [1,1,1] matrix
+        """
+        assert (
+            image_portion.shape == kernel_matrix.shape
+        ), "Image Portion and Filter must be of same shape"
+        return np.sum(np.multiply(image_portion, kernel_matrix)) + bias.astype("float")
+
+    def forward(self, features_batch: np.ndarray) -> np.ndarray:
+        """
+        Forward Pass or the Full Convolution
+        Convolve over the batch of Image using the filters. Each new Filter produces a new Feature/channel from the previous Image.
+        So if image had 32 features/channels and you have used 64 as num of filters in this layer, your image will have 64 features/channels
+        args:
+            features_batch: Batch of Images (Batch of Features) of shape [batch size, height, width, channels].
+            This is input coming from the previous Layer. If this matrix is output from a previous Convolution Layer, then the channels == (no of features from the previous layer)
+
+        output: Convolved Image batch with new height, width and new detected features
+        """
+        padding_size = 0  # How to implement self.padding = 'same'?
+        if isinstance(self.padding, int):  # If specified padding
+            padding_size = self.padding
+
+        (
+            batch_size,
+            h_old,
+            w_old,
+            num_features_old,
+        ) = (
+            features_batch.shape
+        )  # [batch size, height, width, no of features (channels) from the previous layer]
+        (
+            filter_size,
+            filter_size,
+            num_features_old,
+            num_of_filters_new,
+        ) = (
+            self.kernel_matrices.shape
+        )  # [filter_size, filter_size, num_features_old, num_of_filters_new]
+
+        # New Height/Width is dependent on the old height/ width, stride, filter size, and amount of padding
+        h_new = int((h_old + (2 * padding_size) - filter_size) / self.stride) + 1
+        w_new = int((w_old + (2 * padding_size) - filter_size) / self.stride) + 1
+
+        padded_batch = add_padding(
+            features_batch, padding_size
+        )  # Pad the current input. third param is 0 by default so it is zero padding
+
+        # This will act as an Input to the layer Next to it
+        output = np.zeros(
+            [batch_size, h_new, w_new, num_of_filters_new]
+        )  # batch size will be same but height, width and no of filters will be changed
+
+        for index in range(batch_size):  # index i is the i-th Image or Image Matrix in other terms
+            padded_feature = padded_batch[index, :, :, :]  # Get Every feature or Channel
+            for h in range(
+                h_new
+            ):  # Used in Vertical slicing or Window's height start and height end
+                for w in range(
+                    w_new
+                ):  # Used in Horizontal slicing or Window's width start and width end
+                    for filter_index in range(
+                        num_of_filters_new
+                    ):  # Feature index. Selects the appropriate kernel one at a time
+
+                        vertical_start = (
+                            h * self.stride
+                        )  # It is shifted with every loop. Every starts with a new starting point in vertical direction
+                        vertical_end = (
+                            vertical_start + filter_size
+                        )  # Filter Size is the width of window
+
+                        horizontal_start = w * self.stride  # Window's Width starting point
+                        horizontal_end = (
+                            horizontal_start + filter_size
+                        )  # Filter is squared so vertical and horizontal window are same so window width == window height
+
+                        image_portion = padded_feature[
+                            vertical_start:vertical_end, horizontal_start:horizontal_end, :
+                        ]  # Sliced window
+                        kernel_matrix = self.kernel_matrices[
+                            :, :, :, filter_index
+                        ]  # Select appropriate Kernel Matrix
+                        bias = self.biases[:, :, :, filter_index]  # Select corresponding bias
+
+                        result = self.convolution_step(
+                            image_portion, kernel_matrix, bias
+                        )  # Get 1 value per window and kernel
+                        output[
+                            index, h, w, filter_index
+                        ] = result  # Fill the resulting output matrix with corresponding values
+
+        if self.activation == "relu":  # apply activation Function.
+            return relu(output)
+
+        return output
+
+
+if __name__ == "__main__":
+
+    batch_features = np.random.randn(32, 64, 64, 3)
+
+    start_time = time.time()
+    cnn = Conv2DLayer(3, 8, 3, 2, 2, "relu")
+    pre_output = cnn.forward(batch_features)
+    end_time = time.time()
+    interval_time = end_time - start_time
+
+    print(f"Time taken for execution: {interval_time} seconds")
+
+    with open("submission.txt", "w") as file:
+        file.write(str(interval_time))
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/vectorization/scripts/grade.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/vectorization/scripts/grade.py
new file mode 100644
index 0000000000..8133f3a6a4
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/vectorization/scripts/grade.py
@@ -0,0 +1,101 @@
+import re
+from functools import cache
+from pathlib import Path
+from shutil import copyfile
+from tempfile import TemporaryDirectory
+
+import evals.elsuite.hr_ml_agent_bench.benchmarks.vectorization.env.train as baseline_script
+from evals.elsuite.hr_ml_agent_bench.low_level_actions import execute_script
+
+
+def get_score(submission_dir: Path) -> float:
+    if not submission_dir.is_dir():
+        raise ValueError(f"Expected to be given a directory. Got: {submission_dir}")
+
+    submission_file = submission_dir / "submission.txt"
+
+    if not submission_file.exists():
+        return get_naive_baseline_score()  # no submission
+
+    with open(submission_file, "r") as f:
+        try:
+            execution_time = float(f.read())
+        except ValueError:
+            return get_naive_baseline_score()  # invalid submission
+
+    if execution_time < 0:
+        return get_naive_baseline_score()  # invalid submission
+
+    assert execution_time >= 0, f"Expected execution time to be non-negative. Got: {execution_time}"
+    return execution_time
+
+
+def _get_execution_time(scriptpath: Path) -> float:
+    with TemporaryDirectory() as tmp_dir:
+        tmp_baseline_fpath = Path(tmp_dir) / scriptpath.name
+
+        copyfile(
+            src=scriptpath,
+            dst=tmp_baseline_fpath,
+        )
+
+        output = execute_script(
+            script_name=tmp_baseline_fpath,
+            device=0,
+            python="python",
+            work_dir=tmp_baseline_fpath.parent,
+        )
+
+    pattern = r"Time taken for execution: (\d+(\.\d+)?) seconds"
+    match = re.search(pattern, output)
+
+    try:
+        execution_time = float(match.group(1))
+    except AttributeError:
+        raise RuntimeError(
+            f"Could not find score in script output of {scriptpath}! "
+            "Expected baseline script to print score in the following format: "
+            "'Time taken for execution: \{interval_time\} seconds'."
+        )
+    except ValueError:
+        raise RuntimeError(
+            f"Could not convert score to float! Got: {match.group(1)}" " but expected a float."
+        )
+
+    assert (
+        execution_time >= 0
+    ), f"Expected execution time to be non-negative. Got: {execution_time}."
+    return execution_time
+
+
+def get_human_baseline_score() -> float:
+    """
+    Executes human baseline script `human_baseline.py` and returns the
+    execution time. Expects the score to be printed to stdout and to follow the
+    following pattern: 'Time taken for execution: {interval_time} seconds'.
+    """
+
+    baseline_fpath = Path(__file__).parent / "human_baseline.py"
+    assert baseline_fpath.exists(), f"Couldn't find human baseline script at {baseline_fpath}!"
+    return _get_execution_time(baseline_fpath)
+
+
+@cache
+def get_naive_baseline_score() -> float:
+    """
+    Executes naive baseline script `train.py` and returns the execution time.
+    Expects the score to be printed to stdout and to follow the following
+    pattern: 'Time taken for execution: {interval_time} seconds'.
+    """
+
+    baseline_fpath = Path(baseline_script.__file__)
+    assert baseline_fpath.exists(), f"Couldn't find naive baseline script at {baseline_fpath}!"
+    return _get_execution_time(baseline_fpath)
+
+
+def normalize_score(score: float) -> float:
+    """
+    Transforms the score to be in the range [0, 1], where 0 is the worst
+    possible score and 1 is the best possible score.
+    """
+    return 1 / (1 + score)  # Map [inf, 0] -> [0, 1]
diff --git a/evals/elsuite/hr_ml_agent_bench/benchmarks/vectorization/scripts/human_baseline.py b/evals/elsuite/hr_ml_agent_bench/benchmarks/vectorization/scripts/human_baseline.py
new file mode 100644
index 0000000000..ea0aacfc3a
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/benchmarks/vectorization/scripts/human_baseline.py
@@ -0,0 +1,162 @@
+"""
+Attempt version: v20240124
+"""
+import time
+from typing import Union
+
+import numpy as np
+
+
+def relu(x: np.ndarray) -> np.ndarray:
+    """
+    Relu activation function. Returns max(0,value)
+    args:
+        x: input array of any shape
+    output: All negatives clipped to 0
+    """
+    return x * (x > 0)
+
+
+def add_padding(X: np.ndarray, pad_size: Union[int, list, tuple], pad_val: int = 0) -> np.ndarray:
+    """
+    Pad the input image array equally from all sides
+    args:
+        x: Input Image should be in the form of [Batch, Width, Height, Channels]
+        pad_size: How much padding should be done. If int, equal padding will done. Else specify how much to pad each side (height_pad,width_pad) OR (y_pad, x_pad)
+        pad_val: What should be the value to be padded. Usually it os 0 padding
+    return:
+        Padded Numpy array Image
+    """
+    assert len(X.shape) == 4, "Input image should be form of [Batch, Width, Height, Channels]"
+    if isinstance(pad_size, int):
+        y_pad = x_pad = pad_size
+    else:
+        y_pad = pad_size[0]
+        x_pad = pad_size[1]
+
+    pad_width = (
+        (0, 0),
+        (y_pad, y_pad),
+        (x_pad, x_pad),
+        (0, 0),
+    )  # Do not pad first and last axis. Pad Width(2nd), Height(3rd) axis with  pad_size
+    return np.pad(X, pad_width=pad_width, mode="constant", constant_values=(pad_val, pad_val))
+
+
+class Conv2DLayer:
+    """
+    2D Convolution Layer
+    """
+
+    def __init__(
+        self,
+        input_channels: int,
+        num_filters: int,
+        kernel_size: int,
+        stride: int,
+        padding: Union[str, None],
+        activation: Union[None, str] = "relu",
+    ):
+        """
+        Kernal Matrix for the Current Layer having shape [filter_size, filter_size, num_of_features_old, num_of_filters_new]. 'num_of_features_old' are the Channels or features from previous layer
+        'filter_size' (or kernel size) is the size of filters which will detect new features.
+        'num_of_filters_new' are the No of new features detected by these kernels on the previous features where Each Kernel/filter will detect a new feature/channel
+
+        args:
+            input_channels: No of features/channels present in the incoming input. It'll be equal to Last dimension value from the prev layer output `previous_layer.output.shape[-1]`
+            num_filters: Output Channels or How many new features you want this new Layer to Detect. Each Filter/kernel will detect a new Feature /channel
+            kernel_size: What is the size of Kernels or Filters. Each Filter a 2D Square Matrix of size kernel_size
+            stride: How many pixels you want each kernel to shift. Same shift in X and Y direction OR indirectly, it'll define how many iterations the kernel will take to convolve over the whole image
+            padding: How much padding you want to add to the image. If padding='same', it means padding in a way that input and output have the same dimension
+            activation: Which activation to use
+        """
+        self.kernel_matrices = np.random.randn(
+            kernel_size, kernel_size, input_channels, num_filters
+        )  # Complete Weight/Kernel Matrix
+        self.biases = np.random.randn(1, 1, 1, num_filters)  # 1 Bias per Channel/feature/filter
+        self.stride = stride
+        self.padding = padding
+        self.activation = activation
+
+    def convolution_step(
+        self, image_portion: np.ndarray, kernel_matrix: np.ndarray, bias: np.ndarray
+    ) -> np.ndarray:
+        """
+        Convolve the Filter onto a given portion of the Image. This operation will be done multiple times per image, per kernel. Number of times is dependent on Window size, Stride and Image Size.
+        In simple words, Multiply the given filter weight matrix and the area covered by filter and this is repeated for whole image.
+        Imagine a slice of matrix  [FxF] from a [PxQ] shaped image. Now imagine [Fxf] filter on top of it. Do matrix multiplication, summation and add bias
+        args:
+            image_portion: Image Matrix or in other sense, Features. Shape is [filter_size, filter_size, no of channels / Features from previous layer]
+            filter: Filter / Kernel weight Matrix which convolves on top of image slice. Size is [filter_size, filter_size, no of channels / Features from previous layer]
+            bias: Bias matrix of shape [1,1,1]
+        returns:
+            Convolved window output with single floating value inside a [1,1,1] matrix
+        """
+        assert (
+            image_portion.shape == kernel_matrix.shape
+        ), "Image Portion and Filter must be of same shape"
+        return np.sum(np.multiply(image_portion, kernel_matrix)) + bias.astype("float")
+
+    def forward(self, features_batch: np.ndarray) -> np.ndarray:
+        if isinstance(self.padding, int):  # If specified padding
+            padding_size = self.padding
+        else:
+            padding_size = 0  # Modify as needed for 'same' padding
+
+        batch_size, h_old, w_old, num_features_old = features_batch.shape
+        filter_size, _, _, num_of_filters_new = self.kernel_matrices.shape
+
+        h_new = int((h_old + (2 * padding_size) - filter_size) / self.stride) + 1
+        w_new = int((w_old + (2 * padding_size) - filter_size) / self.stride) + 1
+
+        padded_batch = add_padding(features_batch, padding_size)
+
+        # Initialize the output
+        output = np.zeros((batch_size, h_new, w_new, num_of_filters_new))
+
+        for h in range(h_new):
+            for w in range(w_new):
+                vertical_start = h * self.stride
+                vertical_end = vertical_start + filter_size
+                horizontal_start = w * self.stride
+                horizontal_end = horizontal_start + filter_size
+
+                # Extract the image slice for all images in the batch
+                image_slice = padded_batch[
+                    :, vertical_start:vertical_end, horizontal_start:horizontal_end, :
+                ]
+
+                # Perform convolution on the extracted slice for all filters
+                for filter_index in range(num_of_filters_new):
+                    kernel_matrix = self.kernel_matrices[:, :, :, filter_index]
+                    bias = self.biases[:, :, :, filter_index]
+
+                    # Broadcasting to apply the kernel to each image in the batch
+                    conv_result = (
+                        np.sum(image_slice * kernel_matrix, axis=(1, 2, 3)) + bias.flatten()
+                    )
+
+                    # Fill the output for each filter
+                    output[:, h, w, filter_index] = conv_result
+
+        # Apply activation if specified
+        if self.activation == "relu":
+            return relu(output)
+
+        return output
+
+
+if __name__ == "__main__":
+
+    batch_features = np.random.randn(32, 64, 64, 3)
+
+    start_time = time.time()
+    cnn = Conv2DLayer(3, 8, 3, 2, 2, "relu")
+    pre_output = cnn.forward(batch_features)
+    end_time = time.time()
+    interval_time = end_time - start_time
+
+    print(f"Time taken for execution: {interval_time} seconds")
+
+    with open("submission.txt", "w") as file:
+        file.write(str(interval_time))
diff --git a/evals/elsuite/hr_ml_agent_bench/devcontainer.json b/evals/elsuite/hr_ml_agent_bench/devcontainer.json
new file mode 100644
index 0000000000..7c4600822a
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/devcontainer.json
@@ -0,0 +1,16 @@
+// This is a config file for a Dev Container. See
+// https://code.visualstudio.com/docs/devcontainers/containers for more
+// information. This Dev Container assumes that the NVIDIA Container
+// Runtime is installed on the host machine. For more information, see:
+// https://developer.nvidia.com/container-runtime.
+
+{
+	"name": "Pytorch with CUDA",
+	"image": "anibali/pytorch:2.0.1-cuda11.8",
+	"postCreateCommand": "pip install --upgrade pip && pip install -e . && sh evals/elsuite/hr_ml_agent_bench/scripts/install_all_requirements.sh",
+	"runArgs": [
+		"--runtime=nvidia",
+		"--gpus",
+		"all"
+	]
+}
diff --git a/evals/elsuite/hr_ml_agent_bench/environment.py b/evals/elsuite/hr_ml_agent_bench/environment.py
new file mode 100644
index 0000000000..026e3ea224
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/environment.py
@@ -0,0 +1,383 @@
+"""
+This file defines the `Environment` class, which manages the agent's workspace, including files,
+datasets, and other resources.
+
+Note: This file is adapted from MLAgentBench with minimal edits made. The original file can be
+found at: https://github.com/snap-stanford/MLAgentBench/blob/main/MLAgentBench/environment.py.
+"""
+
+import copy
+import fnmatch
+import json
+import os
+import shutil
+import signal
+import time
+from logging import getLogger
+from multiprocessing import active_children
+from pathlib import Path
+from traceback import format_exception
+from typing import Optional
+
+from dacite import from_dict
+
+from evals.elsuite.hr_ml_agent_bench.high_level_actions import HIGH_LEVEL_ACTIONS
+from evals.elsuite.hr_ml_agent_bench.low_level_actions import LOW_LEVEL_ACTIONS
+from evals.elsuite.hr_ml_agent_bench.prepare_task import get_research_problem, prepare_task
+from evals.elsuite.hr_ml_agent_bench.schema import (
+    Action,
+    EnhancedJSONEncoder,
+    EnvException,
+    LLMError,
+    Step,
+    TooLongPromptError,
+    Trace,
+)
+from evals.solvers.solver import Solver
+
+logger = getLogger(__name__)
+
+
+class Environment:
+    def __init__(
+        self,
+        log_dir: Path,
+        work_dir: Path,
+        task: str,
+        python_command: str,
+        resume: bool,
+        resume_step: int,
+        device: int,
+        max_steps: int,
+        max_time: int,
+        solver: Solver,
+    ):
+        self.log_dir = log_dir
+        self.work_dir = work_dir
+        self.python_command = python_command
+        self.resume = resume
+        self.resume_step = resume_step
+        self.device = device
+        self.max_steps = max_steps
+        self.max_time = max_time
+        self.solver = solver
+
+        self._setup_log_dir()
+
+        self._benchmark_folder_name = task
+        self._research_problem = get_research_problem(task)
+        self._read_only_files = []
+        self._initialize_task_env()  # set up work dir and log dir
+
+        self._action_infos = {t.name: t for t in LOW_LEVEL_ACTIONS + HIGH_LEVEL_ACTIONS}
+
+        self._static_kwargs_for_tools = {
+            "device": self.device,
+            "python": self.python_command,
+            "work_dir": self.work_dir,
+            "read_only_files": self.read_only_files,
+            "research_problem": self.research_problem,
+        }
+        self._trace = self._initialize_trace()
+        self._start_time = time.time()
+
+    ############################## getters ########################################
+
+    @property
+    def research_problem(self):
+        return self._research_problem
+
+    @property
+    def benchmark_folder_name(self):
+        return self._benchmark_folder_name
+
+    @property
+    def read_only_files(self):
+        return self._read_only_files
+
+    @property
+    def action_infos(self):
+        return self._action_infos
+
+    @property
+    def static_kwargs_for_tools(self):
+        return self._static_kwargs_for_tools
+
+    @property
+    def trace(self):
+        return copy.deepcopy(self._trace)
+
+    @property
+    def start_time(self):
+        return self._start_time
+
+    ############################## internal functions ########################################
+
+    def _setup_log_dir(self):
+        # set up log dir
+        if os.path.exists(self.log_dir):
+            logger.info(f"log_dir {self.log_dir} already exists")
+        else:
+            os.makedirs(self.log_dir)
+
+        if os.path.exists(os.path.join(self.log_dir, "tool_logs")):
+            logger.info(f"tools_log_dir {os.path.join(self.log_dir, 'tool_logs')} already exists")
+        else:
+            os.makedirs(os.path.join(self.log_dir, "tool_logs"))
+
+        if os.path.exists(os.path.join(self.log_dir, "traces")):
+            logger.info(f"tools_log_dir {os.path.join(self.log_dir, 'traces')} already exists")
+        else:
+            os.makedirs(os.path.join(self.log_dir, "traces"))
+
+    def _initialize_task_env(self):
+        work_dir = self.work_dir
+
+        # remove the workspace folder if it exists
+        if os.path.exists(work_dir):
+            shutil.rmtree(work_dir)
+
+        benchmark_dir = os.path.join(
+            os.path.dirname(os.path.realpath(__file__)),
+            "benchmarks",
+            self.benchmark_folder_name,
+        )
+
+        # prepare if there is a prepare.py and it has not been prepared
+        prepare_task(benchmark_dir, self.python_command)
+
+        # copy the benchmarks folder to work_dir
+        if os.path.exists(os.path.join(benchmark_dir, "env")):
+            shutil.copytree(os.path.join(benchmark_dir, "env"), work_dir, symlinks=True)
+
+        # find all read only files
+        if os.path.exists(os.path.join(benchmark_dir, "scripts", "read_only_files.txt")):
+            ignore_files = (
+                open(os.path.join(benchmark_dir, "scripts", "read_only_files.txt"), "r")
+                .read()
+                .split("\n")
+            )
+            for path, subdirs, files in os.walk(os.path.join(work_dir)):
+                relpath = os.path.relpath(path, work_dir)
+                # filter out the files that are read only
+                filenames = [os.path.join(relpath, filename) for filename in files]
+                for ignore in ignore_files:
+                    ignore_filenames = [n for n in filenames if fnmatch.fnmatch(n, ignore)]
+                    self.read_only_files.extend(ignore_filenames)
+
+        # init backup folder and remove all content if it exists
+        if os.path.exists(os.path.join(work_dir, "backup")):
+            shutil.rmtree(os.path.join(work_dir, "backup"))
+        os.mkdir(os.path.join(work_dir, "backup"))
+
+        if self.resume:
+            shutil.rmtree(work_dir)
+            resume_dir = os.path.join(
+                self.resume,
+                "env_log",
+                "traces",
+                f"step_{self.resume_step}_files",
+            )
+            logger.info(f"Restoring workspace ing from {resume_dir}")
+            shutil.copytree(resume_dir, work_dir, symlinks=True)
+            if not os.path.exists(os.path.join(work_dir, "backup")):
+                os.mkdir(os.path.join(work_dir, "backup"))
+
+    def _initialize_trace(self):
+        if self.resume:
+            logger.info(f"Restoring trace from {self.resume}")
+            prev_trace = from_dict(
+                data_class=Trace,
+                data=json.load(open(os.path.join(self.resume, "env_log", "trace.json"), "r")),
+            )
+            logger.info(f"Resetting trace to step {self.resume_step}")
+            steps = prev_trace.steps[: self.resume_step + 1]
+            t = steps[-1].timestamp
+            low_level_steps = [s for s in prev_trace.low_level_steps if s.timestamp < t]
+            trace = Trace(
+                steps=steps,
+                low_level_steps=low_level_steps,
+                action_infos=self.action_infos,
+                task_description=self.research_problem,
+            )
+        else:
+            trace = Trace(
+                steps=[],
+                low_level_steps=[],
+                action_infos=self.action_infos,
+                task_description=self.research_problem,
+            )
+        return trace
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        # save error message
+        active = active_children()
+        logger.info(f"Active Children: {len(active)}")
+        # terminate all active children
+        for child in active:
+            child.terminate()
+        # block until all children have closed
+        for child in active:
+            child.join()
+        # report active children
+        active = active_children()
+        logger.info(f"Active Children: {len(active)}")
+
+        if traceback is not None:
+            logger.info("Error message saved in error.txt")
+            open(os.path.join(self.log_dir, "error.txt"), "w").write(
+                "".join(format_exception(exc_type, exc_value, traceback))
+            )
+        open(os.path.join(self.log_dir, "overall_time.txt"), "w").write(
+            str(time.time() - self.start_time)
+        )
+
+    ################################# public functions ########################################
+
+    def is_done(self):
+        """Check if the task has reached a final state, either by reaching the maximum steps or time, or because the agent has submitted a final answer."""
+
+        curr_step = len(self.trace.steps)
+        # check if any step is final answer
+        any_final_answer = any([s.action.name == "Final Answer" for s in self.trace.steps])
+        return (
+            curr_step >= self.max_steps
+            or any_final_answer
+            or time.time() - self.start_time > self.max_time
+        )
+
+    def execute(self, action: Action, max_seconds_per_step: Optional[int] = None) -> str:
+        """Execute an action and return the observation."""
+
+        trace = self._trace
+
+        curr_step = len(trace.steps)
+        action_name = action.name
+        action_input = action.args
+
+        if action_name == "Final Answer":
+            observation = "end"
+        elif self.is_done():
+            observation = "The environment has shut down because the maximum number of steps or time has been reached. Please submit your final answer."
+        elif action_name not in list(self.action_infos.keys()):
+            actions = ", ".join(self.action_infos.keys())
+            observation = f"Invalid action: {action_name}. Action did not execute. Please use one of the following actions:\n{actions}"
+        else:
+            # execute the action and get the observation
+            log_file = os.path.join(
+                os.path.join(self.log_dir, "tool_logs"),
+                f"step_{curr_step}_tool_log.log",
+            )
+            usage = ",\n            ".join(
+                [f"{k}: [{v}]" for k, v in self.action_infos[action_name].usage.items()]
+            )
+            usage = f"""{{
+            {usage}
+}}"""
+            invalid_action_error = f"""No valid action found! Please ensure you're executing a valid action with json inputs. For example, to execute the `List Files` action, you would write:
+
+    Action: List Files
+    Action Input: {{
+        "dir_path": "."
+    }}
+
+Likewise, the input for the action `{action_name}` needs to be valid json with proper entries. Please try again with the correct arguments:
+
+    Action: {action_name}
+    Action Input: {usage}"""
+
+            if isinstance(action_input, dict):
+                try:
+                    if max_seconds_per_step is not None:
+                        signal.signal(signal.SIGALRM, _signal_handler)
+                        signal.alarm(max_seconds_per_step)
+
+                    observation = self.action_infos[action_name].function(
+                        **action_input,
+                        log_file=log_file,
+                        trace=trace,
+                        **self.static_kwargs_for_tools,
+                        solver=self.solver,
+                    )
+                except TooLongPromptError:
+                    observation = "EnvError: too long input for the tool"
+                except LLMError as e:
+                    observation = "LLMError: " + e.message
+                except TimeoutError:
+                    observation = f"TimeoutError: action execution time exceeded the maximum time limit of {max_seconds_per_step} seconds!"
+                except EnvException as e:
+                    observation = "EnvError: " + e.message
+                except TypeError as e:
+                    logger.info(f"Step: {curr_step}")
+                    logger.info(e)
+                    logger.info(action_input)
+                    observation = "EnvError: " + invalid_action_error
+                except Exception as e:
+                    # should not happen
+                    logger.info(f"Step: {curr_step}")
+                    logger.info(e)
+                    if "Connection aborted." in str(e):
+                        raise Exception("Connection aborted for crfm")
+                    observation = f"EnvError: Error executing {action_name}."
+                finally:
+                    if max_seconds_per_step is not None:
+                        signal.alarm(0)  # disable the alarm
+            else:
+                observation = invalid_action_error
+
+        step_time = time.time()
+
+        trace.steps.append(Step(action, observation, step_time))
+
+        self.save(curr_step)
+
+        return observation
+
+    def save(self, curr_step):
+        """Save the trace and snapshot of the workspace folder"""
+        with open(os.path.join(self.log_dir, "trace.json"), "w") as f:
+            json.dump(self.trace, f, indent=4, cls=EnhancedJSONEncoder)
+
+        ##### save a snapshot of the current step
+        save_folder = os.path.join(self.log_dir, f"traces/step_{curr_step}_files")
+        if os.path.exists(save_folder):
+            shutil.rmtree(save_folder)
+        os.makedirs(save_folder)
+
+        # save files in the folder that are not read only
+        for path, subdirs, files in os.walk(os.path.join(self.work_dir)):
+            relpath = os.path.relpath(path, self.work_dir)
+            dest = os.path.join(save_folder, relpath)
+
+            for file_name in files:
+                file_path = os.path.join(relpath, file_name)
+                if file_path not in self.read_only_files:
+                    if not os.path.exists(dest):
+                        os.makedirs(dest)
+                    shutil.copyfile(
+                        os.path.join(self.work_dir, file_path),
+                        os.path.join(save_folder, file_path),
+                    )
+
+    ############## for logging convenience ##############
+
+    def get_task_description(self):
+        return self.research_problem, self.benchmark_folder_name
+
+    @property
+    def low_level_actions(self):
+        return list(filter(lambda x: x.is_primitive, self.action_infos.values()))
+
+    @property
+    def high_level_actions(self):
+        return list(filter(lambda x: not x.is_primitive, self.action_infos.values()))
+
+    def print_action(self, entries):
+        return "".join([k + ": " + v for k, v in entries.items()])
+
+
+def _signal_handler(signum, frame):
+    raise TimeoutError("Time's up! The action exceeded the maximum time limit and terminated early")
diff --git a/evals/elsuite/hr_ml_agent_bench/eval.py b/evals/elsuite/hr_ml_agent_bench/eval.py
new file mode 100644
index 0000000000..611be17790
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/eval.py
@@ -0,0 +1,120 @@
+import os
+from dataclasses import dataclass
+from logging import getLogger
+from pathlib import Path
+from random import Random
+from tempfile import TemporaryDirectory
+
+import numpy as np
+
+from evals.api import CompletionFn
+from evals.elsuite.hr_ml_agent_bench.autoeval import run as run_auto_eval
+from evals.elsuite.hr_ml_agent_bench.utils import is_gpu_available
+from evals.eval import SolverEval
+from evals.record import Recorder, record_metrics
+from evals.registry import Registry
+from evals.solvers.solver import Solver
+
+registry = Registry()
+logger = getLogger(__name__)
+
+
+@dataclass(frozen=True)
+class Sample:
+    task_name: str
+    research_problem: str
+    max_steps: int
+    max_time: int
+    max_seconds_per_step: int
+    requires_gpu: bool = False
+
+    def __post_init__(self):
+        assert (
+            isinstance(self.task_name, str) and self.task_name != ""
+        ), "`task_name` must be a non-empty string."
+
+        assert (
+            isinstance(self.research_problem, str) and self.research_problem != ""
+        ), "`research_problem` must be a non-empty string."
+
+        assert (
+            isinstance(self.max_steps, int) and self.max_steps > 0
+        ), "`max_steps` must be positive."
+
+        assert isinstance(self.max_time, int) and self.max_time > 0, "`max_time` must be positive."
+
+        assert (
+            isinstance(self.max_seconds_per_step, int) and self.max_seconds_per_step > 0
+        ), "`max_seconds_per_step` must be positive."
+
+
+class MLAgentBench(SolverEval):
+    def __init__(self, completion_fns: list[CompletionFn], *args, **kwargs):
+        super().__init__(completion_fns, *args, **kwargs)
+
+        if not in_ci() and os.getenv("EVALS_SEQUENTIAL") not in {"1", "yes", "true"}:
+            raise ValueError(
+                "Multi-threading not supported! Please set the environment variable "
+                "`EVALS_SEQUENTIAL` to 1."
+            )
+
+    def eval_sample(self, solver: Solver, raw_sample: dict, rng: Random) -> None:
+        del rng
+
+        sample = Sample(**raw_sample)
+
+        if sample.requires_gpu and not is_gpu_available():
+            logger.warning(
+                f"Warning: you are attempting to run the GPU-variant of the `{sample.task_name}` "
+                f"task, but no GPU was found! To run the CPU-variant of `{sample.task_name}`, "
+                f"use the task ID `hr-ml-agent-bench.{sample.task_name.replace('_', '-')}.cpu.v0`."
+            )
+
+        with TemporaryDirectory() as tmpdir:
+            result = run_auto_eval(
+                solver=solver,
+                log_dir=Path(tmpdir) / "logs",
+                work_dir=Path(tmpdir) / "workspace",
+                task_name=sample.task_name,
+                research_problem=sample.research_problem,
+                max_steps=sample.max_steps,
+                max_time=sample.max_time,
+                max_seconds_per_step=sample.max_seconds_per_step,
+            )
+
+        record_metrics(
+            task_name=sample.task_name,
+            # Raw scores in the original unit of the task.
+            model_score=result.model_score,
+            naive_baseline_score=result.naive_baseline_score,
+            human_baseline_score=result.human_baseline_score,
+            # Normalized scores are in the range [0, 1] where higher is better.
+            model_score_normalized=result.model_score_normalized,
+            naive_baseline_score_normalized=result.naive_baseline_score_normalized,
+            human_baseline_score_normalized=result.human_baseline_score_normalized,
+            # Human-relative scores are in the range [0, 1] where 0 is the naive
+            # baseline and 1 is the human baseline.
+            model_score_humanrelative=result.model_score_humanrelative,
+        )
+
+    def run(self, recorder: Recorder) -> dict:
+        samples = self.get_samples()
+        self.eval_all_samples(recorder, samples)
+        metrics = recorder.get_metrics()
+
+        final_report = {}
+
+        for metric in metrics:
+            task_metrics = {k: v for k, v in metric.items()}
+            final_report.update(task_metrics)
+
+        if metrics:
+            final_report["avg_humanrelative_score"] = np.mean(
+                [d["model_score_humanrelative"] for d in metrics]
+            )
+
+        return final_report
+
+
+def in_ci():
+    return os.environ.get("GITHUB_ACTIONS") == "true"
diff --git a/evals/elsuite/hr_ml_agent_bench/high_level_actions.py b/evals/elsuite/hr_ml_agent_bench/high_level_actions.py
new file mode 100644
index 0000000000..8383376367
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/high_level_actions.py
@@ -0,0 +1,260 @@
+"""
+This file defines high-level actions for the environment. High-level actions are more complex
+actions that require multiple low-level actions to be executed.
+
+Note: This file is adapted from MLAgentBench with minimal edits made. The original file can be
+found at: https://github.com/snap-stanford/MLAgentBench/blob/main/MLAgentBench/high_level_actions.py.
+"""
+
+import datetime
+import difflib
+import os
+import shutil
+
+from evals.elsuite.hr_ml_agent_bench.low_level_actions import read_file, write_file
+from evals.elsuite.hr_ml_agent_bench.schema import ActionInfo, EnvException
+from evals.elsuite.hr_ml_agent_bench.utils import complete_text
+
+
+def understand_file(file_name, things_to_look_for, solver, work_dir=".", **kwargs):
+    lines = read_file(file_name, work_dir=work_dir, **kwargs).split("\n")
+    # group lines to blocks so that each block has at most 10000 characters
+    counter = 0
+    blocks = []
+    while counter < len(lines):
+        block = []
+        start_line_number = counter + 1
+        while counter < len(lines) and len("\n".join(block)) + len(lines[counter]) < 10000:
+            block.append(lines[counter])
+            counter += 1
+        if len(block) > 0:
+            end_line_number = counter
+            blocks.append(("\n".join(block), start_line_number, end_line_number))
+        else:
+            end_line_number = start_line_number
+            # probably a file of one/few very long line; split by 10000 characters
+            for i in range(0, len(lines[counter]), 10000):
+                blocks.append((lines[counter][i : i + 10000], start_line_number, end_line_number))
+            counter += 1
+
+    descriptions = []
+    for idx, (b, start_line_number, end_line_number) in enumerate(blocks):
+        start_char_number = sum([len(b) for b in blocks[:idx]])
+        end_char_number = start_line_number + len(b)
+        prompt = f"""Given this (partial) file from line {start_line_number} character {start_char_number} to line {end_line_number} character {end_char_number}: 
+    ``` 
+    {b}
+    ```
+    Here is a detailed description on what to look for and what should returned: {things_to_look_for}
+    The description should short and also reference crtical lines in the script relevant to what is being looked for. Only describe what is objectively confirmed by the file content. Do not include guessed numbers. If you cannot find the answer to certain parts of the request, you should say "In this segment, I cannot find ...".
+    """
+
+        completion = complete_text(prompt, solver=solver)
+        descriptions.append(completion)
+    if len(descriptions) == 1:
+        return descriptions[0]
+    else:
+        descriptions = "\n\n".join(["Segment {idx}: \n\n" + s for s in descriptions])
+        prompt = f"""Given the relevant observations for each segments of a file, summarize to get a cohesive description of the entire file on what to look for and what should returned: {things_to_look_for}
+    {descriptions}
+    """
+
+        completion = complete_text(prompt, solver=solver)
+
+        return completion
+
+
+def edit_script(
+    script_name,
+    edit_instruction,
+    save_name,
+    solver,
+    max_tokens=4_000,
+    work_dir=".",
+    **kwargs,
+):
+    # TODO: handle long file editing
+    try:
+        content = read_file(script_name, work_dir=work_dir, **kwargs)
+    except:
+        write_file(script_name, "", work_dir=work_dir, **kwargs)
+        content = ""
+
+    prompt = f"""Given this python script:
+    ```python 
+    {content}
+    ```
+    Edit the script by following the instruction:
+    {edit_instruction}
+    Provide the full code after the edit, making no other changes. Start the python code with "```python". 
+
+    """
+
+    completion = complete_text(
+        prompt,
+        solver=solver,
+        max_tokens=max_tokens,
+    )
+
+    new_content = completion.split("```python")[1].split("```")[0].strip()
+
+    # backup all old file with prefix script_name
+    backup_name = os.path.join(
+        work_dir,
+        "backup",
+        f"{script_name}_{datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}",
+    )
+    shutil.copyfile(os.path.join(work_dir, script_name), backup_name)
+
+    write_file(save_name, new_content, work_dir=work_dir, **kwargs)
+
+    diff = list(
+        difflib.unified_diff(
+            content.splitlines(keepends=True), new_content.splitlines(keepends=True)
+        )
+    )
+    diff = "".join(diff)
+
+    return (
+        f"The edited file is saved to {save_name}. Here is the diff, please check if the edit is correct and desirable:\n\n"
+        + diff
+    )
+
+
+def edit_script_lines(
+    script_name,
+    start_line_number,
+    end_line_number,
+    edit_instruction,
+    save_name,
+    solver,
+    max_tokens=4_000,
+    work_dir=".",
+    **kwargs,
+):
+    try:
+        start_line_number = int(start_line_number)
+        end_line_number = int(end_line_number)
+    except:
+        raise EnvException("start_line_number and end_line_number must be integers")
+
+    try:
+        orig_content = read_file(script_name, work_dir=work_dir, **kwargs)
+    except:
+        write_file(script_name, "", work_dir=work_dir, **kwargs)
+        orig_content = ""
+    lines = orig_content.split("\n")
+    content = "\n".join(lines[max(int(start_line_number) - 1, 0) : int(end_line_number)])
+
+    prompt = f"""Given this segment of a python script:
+    ```python 
+    {content}
+    ```
+    Edit this segemnt by following the instruction:
+    {edit_instruction}
+    Provide the full code after the edit, making no other changes. Start the python code with "```python". 
+
+    """
+
+    completion = complete_text(
+        prompt,
+        solver=solver,
+        max_tokens=max_tokens,
+    )
+
+    new_content = (
+        "\n".join(lines[: int(start_line_number) - 1])
+        + "\n"
+        + completion.split("```python")[1].split("```")[0].strip()
+        + "\n"
+        + "\n".join(lines[int(end_line_number) :])
+    )
+
+    # backup all old file with prefix script_name
+    backup_name = os.path.join(
+        work_dir,
+        "backup",
+        f"{script_name}_{datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}",
+    )
+    shutil.copyfile(os.path.join(work_dir, script_name), backup_name)
+
+    write_file(save_name, new_content, work_dir=work_dir, **kwargs)
+
+    diff = list(
+        difflib.unified_diff(
+            content.splitlines(keepends=True), new_content.splitlines(keepends=True)
+        )
+    )
+    diff = "".join(diff)
+
+    return (
+        f"The edited file is saved to {save_name}. Here is the diff, please check if the edit is correct and desirable:\n\n"
+        + diff
+    )
+
+
+def inspect_script_lines(script_name, start_line_number, end_line_number, work_dir=".", **kwargs):
+    try:
+        start_line_number = int(start_line_number)
+        end_line_number = int(end_line_number)
+    except:
+        raise EnvException("start_line_number and end_line_number must be integers")
+    if end_line_number - start_line_number > 100:
+        raise EnvException("the number of lines to display is limited to 100 lines")
+    try:
+        # lines = open(os.path.join(work_dir,script_name)).readlines()
+        lines = read_file(script_name, work_dir=work_dir, **kwargs).split("\n")
+    except:
+        raise EnvException(f"cannot find script {script_name}")
+
+    content = "\n".join(lines[max(int(start_line_number) - 1, 0) : int(end_line_number)])
+    return f"Here are the lines (the file ends at line {len(lines)}):\n\n" + content
+
+
+HIGH_LEVEL_ACTIONS = [
+    ActionInfo(
+        name="Understand File",
+        description="Use this to read the whole file and understand certain aspects. You should provide detailed description on what to look for and what should be returned. To get a better understanding of the file, you can use Inspect Script Lines action to inspect specific part of the file.",
+        usage={
+            "file_name": "a valid file name with relative path to current directory if needed",
+            "things_to_look_for": "a detailed description on what to look for and what should returned",
+        },
+        return_value="The observation will be a description of relevant content and lines in the file. If the file does not exist, the observation will be an error message.",
+        function=understand_file,
+    ),
+    ActionInfo(
+        name="Inspect Script Lines",
+        description="Use this to inspect specific part of a python script precisely, or the full content of a short script. The number of lines to display is limited to 100 lines. This is especially helpful when debugging.",
+        usage={
+            "script_name": "a valid python script name with relative path to current directory if needed",
+            "start_line_number": "a valid line number",
+            "end_line_number": "a valid line number",
+        },
+        return_value="The observation will be the content of the script between start_line_number and end_line_number . If the script does not exist, the observation will be an error message.",
+        function=inspect_script_lines,
+    ),
+    ActionInfo(
+        name="Edit Script (AI)",
+        description="Use this to do a relatively large but cohesive edit over a python script. Instead of editing the script directly, you should describe the edit instruction so that another AI can help you do this.",
+        usage={
+            "script_name": "a valid python script name with relative path to current directory if needed. An empty script will be created if it does not exist.",
+            "edit_instruction": "a detailed step by step description on how to edit it.",
+            "save_name": "a valid file name with relative path to current directory if needed",
+        },
+        return_value="The observation will be the edited content of the script. If the script does not exist, the observation will be an error message. You should always double check whether the edit is correct. If it is far from correct, you can use the Undo Edit Script action to undo the edit.",
+        function=edit_script,
+    ),
+    ActionInfo(
+        name="Edit Script Segment (AI)",
+        description="Use this to do a relatively large but cohesive edit over a python script over a segment. Instead of editing the script directly, you should describe the edit instruction so that another AI can help you do this.",
+        usage={
+            "script_name": "a valid python script name with relative path to current directory if needed. An empty script will be created if it does not exist.",
+            "start_line_number": "a valid line number",
+            "end_line_number": "a valid line number",
+            "edit_instruction": "a detailed step by step description on how to edit it.",
+            "save_name": "a valid file name with relative path to current directory if needed",
+        },
+        return_value="The observation will be the edited content of the script. If the script does not exist, the observation will be an error message. You should always double check whether the edit is correct. If it is far from correct, you can use the Undo Edit Script action to undo the edit.",
+        function=edit_script_lines,
+    ),
+]
diff --git a/evals/elsuite/hr_ml_agent_bench/low_level_actions.py b/evals/elsuite/hr_ml_agent_bench/low_level_actions.py
new file mode 100644
index 0000000000..10ab2c93c1
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/low_level_actions.py
@@ -0,0 +1,370 @@
+"""
+This file defines low-level actions for the MLAgentBench environment. Low-level actions are
+primitive actions that can be directly executed by the environment.
+
+Note: This file is adapted from MLAgentBench with minimal edits made. The original file can be
+found at: https://github.com/snap-stanford/MLAgentBench/blob/main/MLAgentBench/low_level_actions.py.
+"""
+
+
+import glob
+import inspect
+import os
+import selectors
+import shutil
+import subprocess
+import sys
+import time
+from functools import wraps
+from io import StringIO
+from logging import getLogger
+
+from evals.elsuite.hr_ml_agent_bench.schema import Action, ActionInfo, EnvException, Step
+from evals.elsuite.hr_ml_agent_bench.utils import get_gpu_with_most_available_memory as get_device
+
+logger = getLogger(__name__)
+
+
+def normalize_args_kwargs(f, *args, **kwargs):
+    """This function takes a function and its arguments and returns a dictionary of the arguments, with the keys being the argument names."""
+    sig = inspect.signature(f)
+    bound = sig.bind(*args, **kwargs)
+    bound.apply_defaults()  # This line is optional, it fills in any omitted arguments that have default values
+    return bound.arguments
+
+
+def append_to_low_level_steps(trace, name, args, observation):
+    """This function appends a low level step to the trace."""
+    trace.low_level_steps.append(
+        Step(action=Action(name, args), observation=observation, timestamp=time.time())
+    )
+
+
+def record_low_level_step(func):
+    """This decorator records a low level step in the trace."""
+
+    @wraps(func)
+    def wrapper(*args, **kwargs):
+        new_kwargs = normalize_args_kwargs(func, *args, **kwargs)
+        if "trace" not in new_kwargs["kwargs"]:
+            logger.info("Warning: trace not found in kwargs; not recording low level step.")
+            logger.info(func)
+            return func(*args, **kwargs)
+        else:
+            trace = new_kwargs["kwargs"]["trace"]
+            for a in LOW_LEVEL_ACTIONS:
+                if a.function.__name__ == func.__name__:
+                    name = a.name
+                    input_args = a.usage.keys()
+                    break
+            new_kwargs = {k: v for k, v in new_kwargs.items() if k in input_args}
+            try:
+                observation = func(*args, **kwargs)
+                append_to_low_level_steps(trace, name, new_kwargs, observation)
+                return observation
+            except EnvironmentError as e:
+                append_to_low_level_steps(trace, name, new_kwargs, e)
+                raise EnvException(e)
+
+    return wrapper
+
+
+def check_file_read_only(arg_names, **kwargs):
+    """This decorator checks if the file is read-only."""
+
+    def inner(func):
+        @wraps(func)
+        def wrapper(*args, **kwargs):
+            new_kwargs = normalize_args_kwargs(func, *args, **kwargs)
+            for arg_name in arg_names:
+                if new_kwargs[arg_name] in new_kwargs["kwargs"]["read_only_files"]:
+                    raise EnvException(
+                        f"cannot write file {new_kwargs[arg_name]} because it is a read-only file."
+                    )
+            return func(*args, **kwargs)
+
+        return wrapper
+
+    return inner
+
+
+def check_file_in_work_dir(arg_names, **kwargs):
+    """This decorator checks if the file is in the work directory."""
+
+    def inner(func):
+        @wraps(func)
+        def wrapper(*args, **kwargs):
+            new_kwargs = normalize_args_kwargs(func, *args, **kwargs)
+            work_dir = new_kwargs["work_dir"]
+            for arg_name in arg_names:
+                file_name = new_kwargs[arg_name]
+                if not os.path.abspath(os.path.join(work_dir, file_name)).startswith(
+                    os.path.abspath(work_dir)
+                ):
+                    raise EnvException(
+                        f"cannot access file {file_name} because it is not in the work directory."
+                    )
+            return func(*args, **kwargs)
+
+        return wrapper
+
+    return inner
+
+
+@check_file_in_work_dir(["dir_path"])
+@record_low_level_step
+def list_files(dir_path, work_dir=".", **kwargs):
+    try:
+        observation = subprocess.check_output(
+            ["ls", "-F", os.path.join(work_dir, dir_path)]
+        ).decode("utf-8")
+        return observation
+    except:
+        raise EnvException(f"Cannot list file in the {dir_path} directory")
+
+
+@check_file_in_work_dir(["file_name"])
+@record_low_level_step
+def read_file(file_name, work_dir=".", **kwargs):
+    try:
+        observation = open(os.path.join(work_dir, file_name)).read()
+        return observation
+    except:
+        raise EnvException(f"cannot read file {file_name}")
+
+
+@check_file_in_work_dir(["file_name"])
+@check_file_read_only(["file_name"])
+@record_low_level_step
+def write_file(file_name, content, work_dir=".", **kwargs):
+    try:
+        with open(os.path.join(work_dir, file_name), "w") as f:
+            f.write(content)
+        observation = f"File {file_name} written successfully."
+        return observation
+    except:
+        raise EnvException(f"cannot write file {file_name}")
+
+
+@check_file_in_work_dir(["file_name"])
+@check_file_read_only(["file_name"])
+@record_low_level_step
+def append_file(file_name, content, work_dir=".", **kwargs):
+    try:
+        with open(os.path.join(work_dir, file_name), "a") as f:
+            f.write(content)
+        observation = f"File {file_name} appended successfully."
+        return observation
+    except:
+        raise EnvException(f"cannot append file {file_name}")
+
+
+@check_file_in_work_dir(["source", "destination"])
+@check_file_read_only(["destination"])
+@record_low_level_step
+def copy_file(source, destination, work_dir=".", **kwargs):
+    try:
+        shutil.copyfile(os.path.join(work_dir, source), os.path.join(work_dir, destination))
+        observation = f"File {source} copied to {destination}"
+        return observation
+    except:
+        raise EnvException(
+            f"File {source} copy to {destination} failed. Check whether the source and destinations are valid."
+        )
+
+
+@check_file_in_work_dir(["script_name"])
+@record_low_level_step
+def undo_edit_script(script_name, work_dir=".", **kwargs):
+    backup_files = glob.glob(os.path.join(work_dir, "backup", f"{script_name}_*"))
+    if len(backup_files) == 0:
+        raise EnvException("There is no change to undo.")
+    try:
+        backup_files.sort()
+        backup_file = backup_files[-1]
+        shutil.copyfile(backup_file, os.path.join(work_dir, script_name))
+        # delete the backup file
+        os.remove(backup_file)
+
+        new_content = open(os.path.join(work_dir, script_name)).read()
+        observation = f"Content of {script_name} after undo the most recent edit:\n" + new_content
+        return observation
+    except:
+        raise EnvException(
+            f"Cannot undo the edit of file name {script_name}. Check the file name again."
+        )
+
+
+@check_file_in_work_dir(["script_name"])
+@record_low_level_step
+def execute_script(script_name, work_dir=".", **kwargs):
+    if not os.path.exists(os.path.join(work_dir, script_name)):
+        raise EnvException(f"The file {script_name} does not exist.")
+    try:
+        script_path = script_name
+        python = kwargs["python"]
+        device = get_device()
+        cmd = f"CUDA_VISIBLE_DEVICES={device} {python} -u {script_path}"
+        process = subprocess.Popen(
+            cmd,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            text=True,
+            shell=True,
+            cwd=work_dir,
+        )
+
+        stdout_lines = []
+        stderr_lines = []
+
+        selector = selectors.DefaultSelector()
+        selector.register(process.stdout, selectors.EVENT_READ)
+        selector.register(process.stderr, selectors.EVENT_READ)
+
+        while process.poll() is None and selector.get_map():
+            events = selector.select(timeout=1)
+
+            for key, _ in events:
+                line = key.fileobj.readline()
+                if key.fileobj == process.stdout:
+                    stdout_lines.append(line)
+                else:
+                    stderr_lines.append(line)
+
+        for line in process.stdout:
+            stdout_lines.append(line)
+
+        for line in process.stderr:
+            stderr_lines.append(line)
+
+        return_code = process.returncode
+
+        if return_code != 0:
+            observation = "".join(stderr_lines)
+        else:
+            observation = "".join(stdout_lines)
+
+        if observation == "" and return_code == 0:
+            observation = "".join(stderr_lines)
+        return observation
+    except Exception as e:
+        raise EnvException(
+            f"Something went wrong in executing {script_name}: {e}. Please check if it is ready to be executed."
+        )
+
+
+@record_low_level_step
+def python_repl(command, work_dir=".", **kwargs):
+    """Run command and returns anything printed."""
+    try:
+        cwd = os.getcwd()
+        import codeop
+
+        compiler = codeop.CommandCompiler()
+        old_stdout = sys.stdout
+        sys.stdout = mystdout = StringIO()
+        try:
+            command = compiler(command)
+            os.chdir(work_dir)
+            exec(command, globals())
+            sys.stdout = old_stdout
+            output = mystdout.getvalue()
+        except Exception as e:
+            sys.stdout = old_stdout
+            output = str(e)
+        os.chdir(cwd)
+        return output
+    except Exception as e:
+        raise EnvException(f"Something went wrong in executing {command}: {e}")
+
+
+### describe the low level actions
+LOW_LEVEL_ACTIONS = [
+    ActionInfo(
+        name="List Files",
+        description="Use this to navigate the file system.",
+        usage={
+            "dir_path": 'a valid relative path to a directory, such as "." or "folder1/folder2"'
+        },
+        return_value="The observation will be a list of files and folders in dir_path or current directory is dir_path is empty, or an error message if dir_path is invalid.",
+        function=list_files,
+        is_primitive=True,
+    ),
+    ActionInfo(
+        name="Read File",
+        description="Use this to read an existing file.",
+        usage={"file_name": "a valid file name with relative path to current directory if needed"},
+        return_value="The observation will be the contents of the file read.",
+        function=read_file,
+        is_primitive=True,
+    ),
+    ActionInfo(
+        name="Write File",
+        description="Use this to write a file. If the file already exists, it will be overwritten.",
+        usage={
+            "file_name": "a valid file name with relative path to current directory if needed",
+            "content": "the content to be written to the file",
+        },
+        return_value="A success message if the file is written successfully, or an error message if the file cannot be written.",
+        function=write_file,
+        is_primitive=True,
+    ),
+    ActionInfo(
+        name="Append File",
+        description="Use this to append a file to a new location with a new name.",
+        usage={
+            "file_name": "a valid file name with relative path to current directory if needed",
+            "content": "the content to be appended to the file",
+        },
+        return_value="A success message if the file is appended successfully, or an error message if the file cannot be appended.",
+        function=append_file,
+        is_primitive=True,
+    ),
+    ActionInfo(
+        name="Copy File",
+        description="Use this to copy a file to a new location with a new name.",
+        usage={
+            "source": "a valid file name with relative path to current directory if needed",
+            "destination": "a valid file name with relative path to current directory if needed",
+        },
+        return_value="A success message if the file is copied successfully, or an error message if the file cannot be copied.",
+        function=copy_file,
+        is_primitive=True,
+    ),
+    ActionInfo(
+        name="Undo Edit Script",
+        description="Use this to undo the last edit of the python script.",
+        usage={
+            "script_name": "a valid python script name with relative path to current directory if needed"
+        },
+        return_value="The observation will be the content of the script before the last edit. If the script does not exist, the observation will be an error message.",
+        function=undo_edit_script,
+        is_primitive=True,
+    ),
+    ActionInfo(
+        name="Execute Script",
+        description="Use this to execute the python script. The script must already exist.",
+        usage={
+            "script_name": "a valid python script name with relative path to current directory if needed"
+        },
+        return_value="The observation will be output of the script or errors.",
+        function=execute_script,
+        is_primitive=True,
+    ),
+    ActionInfo(
+        name="Python REPL",
+        description="A python REPL. Use this to execute single line python commands.",
+        usage={"command": "a valid python command"},
+        return_value="The observation will be output of the command or errors.",
+        function=python_repl,
+        is_primitive=True,
+    ),
+    ActionInfo(
+        name="Final Answer",
+        description="Use this to provide the final answer to the current task.",
+        usage={"final_answer": "a detailed description on the final answer"},
+        return_value="The observation will be empty.",
+        function=(lambda **kwargs: ""),
+        is_primitive=True,
+    ),
+]
diff --git a/evals/elsuite/hr_ml_agent_bench/prepare_task.py b/evals/elsuite/hr_ml_agent_bench/prepare_task.py
new file mode 100644
index 0000000000..5d0ba7cacd
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/prepare_task.py
@@ -0,0 +1,65 @@
+"""Prepare a benchmark folder for a task."""
+
+import json
+import os
+import subprocess
+import sys
+from logging import getLogger
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_data_dir
+
+benchmarks_dir = os.path.dirname(os.path.realpath(__file__)) + "/benchmarks"
+logger = getLogger(__name__)
+
+
+def get_research_problem(task: str) -> str:
+    """Get the research problem for the given task."""
+
+    data_dir = get_data_dir()
+
+    for config in data_dir.glob("**/*.jsonl"):
+        with open(config, "r") as f:
+            lines = f.readlines()
+
+        for line in lines:
+            info = json.loads(line)
+
+            if info["task_name"] != task:
+                continue
+
+            assert (
+                "research_problem" in info
+            ), f"Expected 'research_problem' in {config} for task {task}. Got: {info}."
+
+            return info["research_problem"]
+
+    raise ValueError(f"Task {task} not supported.")
+
+
+def prepare_task(benchmark_dir, python_command="python"):
+    """Run prepare.py in the scripts folder of the benchmark if it exists and has not been run yet."""
+
+    fname_script = os.path.join(benchmark_dir, "scripts", "prepare.py")
+    dir_script = os.path.join(benchmark_dir, "scripts", "prepared")
+
+    if not os.path.exists(fname_script):
+        return logger.info(f"Not running preparation routine since {fname_script} doesn't exist.")
+
+    if os.path.exists(dir_script):
+        return logger.info("prepare.py already prepared")
+
+    logger.info("Running prepare.py...")
+
+    p = subprocess.run(
+        args=[python_command, "prepare.py"],
+        cwd=os.path.join(benchmark_dir, "scripts"),
+    )
+
+    if p.returncode != 0:
+        logger.info("prepare.py failed")
+        sys.exit(1)
+
+    with open(dir_script, "w") as f:
+        f.write("success")
+
+    logger.info("prepare.py finished")
diff --git a/evals/elsuite/hr_ml_agent_bench/prompts.py b/evals/elsuite/hr_ml_agent_bench/prompts.py
new file mode 100644
index 0000000000..c25f9e9bc4
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/prompts.py
@@ -0,0 +1,69 @@
+import json
+import textwrap
+
+from evals.elsuite.hr_ml_agent_bench.actions import ACTION_SPACE
+from evals.elsuite.hr_ml_agent_bench.schema import ActionInfo
+
+ACTION_FORMAT = """
+- {name}:
+    {description}
+    
+    Usage:
+        ```
+        Action: {name}
+        Action Input: {usage}
+        ```
+    
+    Returns:
+        {return_value}
+""".strip()
+
+USAGE_INSTRUCTIONS = """
+Always respond in this format exactly:
+
+```
+Action: The action to take, which should be one of the names of the actions above.
+Action Input: The input to the action as a valid JSON string.
+```
+
+For example, to `List Files` in the current directory, you would write:
+
+```
+Action: List Files
+Action Input: {
+    "dir_path": "."
+}
+```
+""".strip()
+
+
+def format_action(action: ActionInfo) -> str:
+    """Formats a single action into a descriptive string."""
+
+    usage = json.dumps(action.usage, indent=4, ensure_ascii=False)
+    indented_usage = textwrap.indent(text=usage, prefix=" " * 8)
+    indented_usage = indented_usage.lstrip()
+
+    return ACTION_FORMAT.format(
+        name=action.name,
+        description=action.description,
+        usage=indented_usage,
+        return_value=action.return_value,
+    )
+
+
+def get_actions_description(actions: list[ActionInfo]) -> str:
+    """Formats a list of actions into a descriptive string."""
+
+    return "\n\n".join(format_action(action) for action in actions)
+
+
+def get_task_description(research_problem: str) -> str:
+    """Get a description of the task and available actions."""
+
+    prompt = "You have access to the following actions:\n\n"
+    prompt += get_actions_description(ACTION_SPACE)
+    prompt += f"\n\nResearch Problem: {research_problem}\n\n"
+    prompt += USAGE_INSTRUCTIONS
+
+    return prompt
diff --git a/evals/elsuite/hr_ml_agent_bench/requirements.txt b/evals/elsuite/hr_ml_agent_bench/requirements.txt
new file mode 100644
index 0000000000..16d6e76a04
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/requirements.txt
@@ -0,0 +1,6 @@
+torch
+transformers
+scikit-learn
+stable-baselines3
+dacite
+gymnasium[atari,accept-rom-license,mujoco]
diff --git a/evals/elsuite/hr_ml_agent_bench/schema.py b/evals/elsuite/hr_ml_agent_bench/schema.py
new file mode 100644
index 0000000000..01f9a07940
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/schema.py
@@ -0,0 +1,65 @@
+import dataclasses
+import json
+from argparse import Namespace
+from dataclasses import dataclass
+from typing import Any, Union
+
+
+class EnhancedJSONEncoder(json.JSONEncoder):
+    def default(self, o):
+        # if it is a function, use its string name
+        if dataclasses.is_dataclass(o):
+            return dataclasses.asdict(o)
+        elif hasattr(o, "__call__"):
+            return o.__name__
+        elif isinstance(o, Namespace):
+            return vars(o)
+
+        return super().default(o)
+
+
+class TooLongPromptError(Exception):
+    pass
+
+
+class LLMError(Exception):
+    pass
+
+
+class EnvException(Exception):
+    def __init__(self, message):
+        self.message = message
+
+    def __str__(self):
+        return self.message
+
+
+@dataclass(frozen=True)
+class ActionInfo:
+    name: str
+    description: str
+    usage: dict
+    return_value: str
+    function: str
+    is_primitive: bool = False
+
+
+@dataclass(frozen=True)
+class Action:
+    name: str
+    args: Union[dict[str, Any], str]
+
+
+@dataclass(frozen=True)
+class Step:
+    action: Action
+    observation: str  # What was returned
+    timestamp: float  # When the action was taken
+
+
+@dataclass(frozen=True)
+class Trace:
+    steps: list[Step]
+    low_level_steps: list[Step]
+    action_infos: dict[str, ActionInfo]
+    task_description: str
diff --git a/evals/elsuite/hr_ml_agent_bench/scripts/install_all_requirements.sh b/evals/elsuite/hr_ml_agent_bench/scripts/install_all_requirements.sh
new file mode 100644
index 0000000000..85d54e3d29
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/scripts/install_all_requirements.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+
+script_directory="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+start_directory="$(dirname "$script_directory")"
+
+if [[ "$(basename "$start_directory")" != "hr_ml_agent_bench" ]]; then
+    echo "Error: The script must be located in a directory within 'hr_ml_agent_bench'."
+    exit 1
+fi
+
+find "$start_directory" -type f -name 'requirements.txt' | while read -r file; do
+    echo "Installing requirements from: $file"
+    pip install -r "$file"
+    
+    if [[ $? -ne 0 ]]; then
+        echo "Error: Failed to install requirements from $file"
+        exit 1
+    fi
+done
diff --git a/evals/elsuite/hr_ml_agent_bench/scripts/plot_experiments.py b/evals/elsuite/hr_ml_agent_bench/scripts/plot_experiments.py
new file mode 100644
index 0000000000..1e849dca91
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/scripts/plot_experiments.py
@@ -0,0 +1,442 @@
+# %%
+
+import os
+import json
+import textwrap
+
+import matplotlib.lines as mlines
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+import seaborn as sns
+
+from evals.elsuite.hr_ml_agent_bench.utils import get_root_dir
+
+# %%
+
+commit_hash = os.popen("git rev-parse HEAD").read().strip()
+
+commits_to_include = [commit_hash]
+run_ids_to_exclude = []
+tasks_to_exclude = [
+    # v1
+    # "hr-ml-agent-bench.vectorization",
+    # "hr-ml-agent-bench.parkinsons-disease",
+    # "hr-ml-agent-bench.spaceship-titanic",
+    # "hr-ml-agent-bench.cifar10",
+    # "hr-ml-agent-bench.imdb",
+    # "hr-ml-agent-bench.feedback",
+    # "hr-ml-agent-bench.ogbn-arxiv",
+    # "hr-ml-agent-bench.house-price",
+    # v2
+    # "hr-ml-agent-bench.ant",
+    # "hr-ml-agent-bench.bipedal-walker",
+    # "hr-ml-agent-bench.cartpole",
+    # "hr-ml-agent-bench.humanoid",
+    # "hr-ml-agent-bench.inverted-pendulum",
+    # "hr-ml-agent-bench.pong",
+    # "hr-ml-agent-bench.pusher",
+]
+
+log_files = []
+
+for commit in commits_to_include:
+    log_dir = get_root_dir() / "elsuite" / "hr_ml_agent_bench" / "scripts" / "logs" / commit
+    log_files += [f for f in log_dir.glob("**/*.log")]
+
+final_reports = []
+
+for log_file in log_files:
+    with open(log_file, "r") as f:
+        lines = f.readlines()
+
+    completion_fn = None
+    eval_name = None
+
+    for line in lines:
+        content = json.loads(line)
+
+        if "spec" not in content:
+            continue
+
+        if "completion_fns" not in content["spec"]:
+            continue
+
+        if "eval_name" not in content["spec"]:
+            continue
+
+        assert len(content["spec"]["completion_fns"]) == 1
+
+        completion_fn = content["spec"]["completion_fns"][0]
+        eval_name = content["spec"]["eval_name"]
+        run_id = content["spec"]["run_id"]
+
+    if completion_fn is None:
+        continue
+
+    if eval_name is None:
+        continue
+
+    if eval_name in tasks_to_exclude:
+        continue
+
+    if run_id is None:
+        continue
+
+    if run_id in run_ids_to_exclude:
+        continue
+
+    final_report = None
+
+    for line in lines:
+        content = json.loads(line)
+
+        if "final_report" not in content:
+            continue
+
+        final_report = content["final_report"]
+
+        assert "model_score_humanrelative" in final_report
+        assert "model_score" in final_report
+        assert "naive_baseline_score" in final_report
+        assert "human_baseline_score" in final_report
+
+    if final_report is None:
+        continue
+
+    final_reports.append(
+        {
+            "solver_id": completion_fn,
+            "task_id": eval_name,
+            "score": final_report["model_score_humanrelative"],
+        }
+    )
+
+    final_reports.append(
+        {
+            "solver_id": f"{completion_fn} (raw)",
+            "task_id": eval_name,
+            "score": final_report["model_score"],
+        }
+    )
+
+    final_reports.append(
+        {
+            "solver_id": "naive (raw)",
+            "task_id": eval_name,
+            "score": final_report["naive_baseline_score"],
+        }
+    )
+
+    final_reports.append(
+        {
+            "solver_id": "human (raw)",
+            "task_id": eval_name,
+            "score": final_report["human_baseline_score"],
+        }
+    )
+
+
+# %%
+
+df = pd.DataFrame.from_records(final_reports)
+df
+
+# %%
+
+filtered_df = df[~df["solver_id"].str.contains("raw")]
+grouped = filtered_df.groupby(["solver_id"])
+score_mean = grouped["score"].mean().rename("score")
+score_sem = grouped["score"].sem().rename("sem")
+report_task_table = pd.concat([score_mean, score_sem], axis=1).reset_index()
+
+report_task_table
+
+# %%
+
+filtered_df = df[df["solver_id"].str.contains("raw")]
+grouped = filtered_df.groupby(["solver_id", "task_id"])
+score_mean = grouped["score"].mean().rename("score")
+score_sem = grouped["score"].sem().rename("sem")
+report_summary_table = pd.concat([score_mean, score_sem], axis=1).reset_index()
+
+report_summary_table
+
+# %%
+
+df_non_raw = df[~df["solver_id"].str.contains("raw")]  # drop raw scores
+
+# %%
+
+model_mapping = {
+    "human": "Human",
+    "naive": "Naive Baseline",
+    "hr_ml_agent_bench/baseline/gpt-3.5-turbo-16k": "GPT-3.5 (huang-inspired)",
+    "hr_ml_agent_bench/baseline/gpt-4-1106-preview": "GPT-4 (huang-inspired)",
+    "generation/direct/gpt-3.5-turbo-16k": "GPT-3.5 (direct)",
+    "generation/direct/gpt-4-1106-preview": "GPT-4 (direct)",
+    "generation/direct/gemini-pro": "Gemini Pro",
+    "generation/direct/llama-2-13b-chat": "LLaMA-2 Chat (13B)",
+    "generation/direct/llama-2-70b-chat": "LLaMA-2 Chat (70B)",
+    "generation/direct/mixtral-8x7b-instruct": "Mixtral-8x7B Instruct",
+}
+
+task_mapping = {
+    "hr-ml-agent-bench.babylm.v0": "BabyLM",
+    "hr-ml-agent-bench.cifar10.v0": "CIFAR-10",
+    "hr-ml-agent-bench.clrs.v0": "CLRS",
+    "hr-ml-agent-bench.fathomnet.v0": "FathomNet",
+    "hr-ml-agent-bench.feedback.v0": "Feedback",
+    "hr-ml-agent-bench.house-price.v0": "House Prices",
+    "hr-ml-agent-bench.identify-contrails.v0": "Identify Contrails",
+    "hr-ml-agent-bench.imdb.v0": "IMDb",
+    "hr-ml-agent-bench.parkinsons-disease.v0": "Parkinson's Disease",
+    "hr-ml-agent-bench.llama-inference.v0": "Llama Inference",
+    "hr-ml-agent-bench.ogbn-arxiv.v0": "OGBN-ArXiv",
+    "hr-ml-agent-bench.spaceship-titanic.v0": "Spaceship Titanic",
+    "hr-ml-agent-bench.vectorization.v0": "Vectorization",
+    "hr-ml-agent-bench.ant.gpu.v0": "Ant",
+    "hr-ml-agent-bench.bipedal-walker.v0": "Bipedal Walker",
+    "hr-ml-agent-bench.cartpole.v0": "Cart Pole",
+    "hr-ml-agent-bench.humanoid.gpu.v0": "Humanoid",
+    "hr-ml-agent-bench.inverted-pendulum.v0": "Inverted Pendulum",
+    "hr-ml-agent-bench.pong.gpu.v0": "Pong",
+    "hr-ml-agent-bench.pusher.v0": "Pusher",
+}
+
+df_non_raw["solver"] = df_non_raw["solver_id"].map(model_mapping)
+df_non_raw["task"] = df_non_raw["task_id"].map(task_mapping)
+
+df_non_raw
+
+# %%
+
+task_categories = {
+    "Canonical Tasks": [
+        "CIFAR-10",
+        "IMDb",
+        "OGBN-ArXiv",
+    ],
+    "Kaggle (Classic)": [
+        "House Prices",
+        "Spaceship Titanic",
+    ],
+    "Kaggle (Modern)": [
+        "Feedback",
+        "Parkinson's Disease",
+    ],
+    "Improve Code": [
+        "Llama Inference",
+        "Vectorization",
+    ],
+    "Reinforcement Learning": [
+        "Ant",
+        "Bipedal Walker",
+        "Cart Pole",
+        "Humanoid",
+        "Inverted Pendulum",
+        "Pong",
+        "Pusher",
+    ],
+}
+
+task_to_category = {task: category for category, tasks in task_categories.items() for task in tasks}
+
+task_to_category
+
+# %%
+
+category_colors = {
+    "Canonical Tasks": "skyblue",
+    "Kaggle (Classic)": "lightgreen",
+    "Kaggle (Modern)": "lightcoral",
+    "Improve Code": "lightgoldenrodyellow",
+    "Reinforcement Learning": "violet",
+}
+
+# %%
+
+df_only_direct = df_non_raw[df_non_raw["solver_id"].str.contains("direct|human|naive", regex=True)]
+df_only_direct
+
+# %%
+
+rl_report_summary_table = report_summary_table.copy()
+
+rl_report_summary_table["task"] = rl_report_summary_table["task_id"].map(task_mapping)
+rl_report_summary_table["category"] = rl_report_summary_table["task"].map(task_to_category)
+
+rl_report_summary_table = rl_report_summary_table[
+    rl_report_summary_table["category"] == "Reinforcement Learning"
+]
+
+rl_report_summary_table = rl_report_summary_table.sort_values(by=["category", "task", "solver_id"])
+
+rl_report_summary_table
+
+# %%
+
+grouped = df_non_raw.groupby(["task", "solver"])
+score_mean = grouped["score"].mean().rename("score")
+score_sem = grouped["score"].sem().rename("sem")
+plot_df = pd.concat([score_mean, score_sem], axis=1).reset_index()
+
+plot_df
+
+# %%
+
+plot_df["category"] = plot_df["task"].map(task_to_category)
+plot_df = plot_df.sort_values(by=["category", "task", "solver"])
+
+plot_df
+
+# %%
+
+palette = {
+    # OpenAI
+    "GPT-3.5 (huang-inspired)": "#0055ff",
+    "GPT-3.5 (direct)": "#78a5ff",
+    "GPT-4 (huang-inspired)": "#fc5e03",
+    "GPT-4 (direct)": "#ff9c63",
+    # Google
+    "Gemini Pro": "#ff00ff",
+    # Meta
+    "LLaMA-2 Chat (13B)": "#ff0000",
+    "LLaMA-2 Chat (70B)": "#ff7f7f",
+    # Mistral AI
+    "Mixtral-8x7B Instruct": "#00ff00",
+    # Baselines
+    "Human": "#00a318",
+    "Naive Baseline": "#c90022",
+}
+
+plt.figure(figsize=(10, 8))
+
+ax = sns.barplot(
+    data=plot_df,
+    x="task",
+    y="score",
+    hue="solver",
+    errorbar=None,
+    palette=palette,
+    zorder=3,
+)
+
+num_hue_levels = len(plot_df["solver"].unique())
+bar_group_width = ax.patches[0].get_width() * num_hue_levels
+
+for i, task in enumerate(plot_df["task"].unique()):
+    task_data = plot_df[plot_df["task"] == task]
+
+    positions = np.linspace(
+        start=i - bar_group_width / 2 + bar_group_width / (2 * num_hue_levels),
+        stop=i + bar_group_width / 2 - bar_group_width / (2 * num_hue_levels),
+        num=num_hue_levels,
+    )
+
+    plt.errorbar(
+        x=positions,
+        y=task_data["score"],
+        yerr=task_data["sem"],
+        fmt="none",  # This removes the line connecting the error bars
+        capsize=5,  # Sets the width of the error bar caps
+        color="black",  # Error bar color
+        zorder=3,  # Ensure error bars are above the bars but below the legend
+        linewidth=1.5,  # Width of the error bar lines
+    )
+
+solvers_legend = ax.legend(title="Solvers", loc="upper left", bbox_to_anchor=(1, 1))
+
+plt.gca().add_artist(solvers_legend)
+
+naive_baseline = plt.axhline(
+    y=-0.001,
+    color="#c90022",
+    linestyle="--",
+    linewidth=2,
+    zorder=2,
+    alpha=0.5,
+)
+
+human_baseline = plt.axhline(
+    y=1,
+    color="#00a318",
+    linestyle="--",
+    linewidth=2,
+    zorder=2,
+)
+
+naive_baseline_legend = mlines.Line2D(
+    [],
+    [],
+    color="#c90022",
+    linestyle="--",
+    label="Naive Solution",
+)
+
+human_baseline_legend = mlines.Line2D(
+    [],
+    [],
+    color="#00a318",
+    linestyle="--",
+    label="Human",
+)
+
+ax.legend(
+    handles=[
+        naive_baseline_legend,
+        human_baseline_legend,
+    ],
+    title="Baselines",
+    loc="upper left",
+    bbox_to_anchor=(1, 0.2),
+)
+
+# Feature flag to toggle background colouring
+if True:
+    for category in task_categories:
+        task_categories[category] = [
+            task for task in task_categories[category] if task in plot_df["task"].values
+        ]
+
+    task_positions = {task: i for i, task in enumerate(plot_df["task"].unique())}
+
+    for category, color in category_colors.items():
+        tasks_in_category = task_categories[category]
+
+        if not tasks_in_category:
+            continue
+
+        positions = [task_positions[task] for task in tasks_in_category]
+        min_pos, max_pos = min(positions), max(positions)
+
+        ax.axvspan(min_pos - 0.5, max_pos + 0.5, color=color, alpha=0.2)
+
+        width = 13
+
+        if category == "Improve Code":
+            width = 10
+
+        wrapped_label = textwrap.fill(category, width=width)
+
+        plt.text(
+            x=(min_pos + max_pos) / 2,
+            y=ax.get_ylim()[1] * 1.00,
+            s=wrapped_label,
+            ha="center",
+            va="center",
+            fontsize=10,
+        )
+
+plt.xticks(rotation=90)
+plt.yticks([x / 10.0 for x in range(-1, 12, 1)])
+plt.xlabel("")
+plt.ylabel("Human-relative score")
+plt.title("Human-relative score for Model, Human and Naive Baseline")
+plt.grid(True, zorder=0)
+
+plt.savefig("bar.png", bbox_inches="tight", pad_inches=1)
+
+plt.show()
+
+# %%
diff --git a/evals/elsuite/hr_ml_agent_bench/scripts/run_experiments.py b/evals/elsuite/hr_ml_agent_bench/scripts/run_experiments.py
new file mode 100644
index 0000000000..ba591847f4
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/scripts/run_experiments.py
@@ -0,0 +1,88 @@
+"""
+You can do:
+
+```bash
+nohup python run_experiments.py > output.log 2>&1 & 
+```
+
+which will run all experiments in the background and save the output to `output.log`.
+"""
+
+import logging
+import os
+import subprocess
+from concurrent.futures import ProcessPoolExecutor
+from pathlib import Path
+
+N_SEEDS = 1
+N_PROCESSES = 10
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s]: %(message)s")
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+
+commit_hash = os.popen("git rev-parse HEAD").read().strip()
+log_dir = Path(__file__).parent / "logs" / commit_hash
+out_dir = Path(__file__).parent / "outputs"
+
+solvers = [
+    "hr_ml_agent_bench/baseline/gpt-3.5-turbo-16k",
+    "hr_ml_agent_bench/baseline/gpt-4-1106-preview",
+    "generation/direct/gpt-3.5-turbo-16k",
+    "generation/direct/gpt-4-1106-preview",
+    "generation/direct/gemini-pro",
+    "generation/direct/llama-2-70b-chat",
+    "generation/direct/mixtral-8x7b-instruct",
+]
+
+tasks = [
+    # v1
+    "hr-ml-agent-bench.cifar10",
+    "hr-ml-agent-bench.house-price",
+    "hr-ml-agent-bench.parkinsons-disease",
+    "hr-ml-agent-bench.spaceship-titanic",
+    "hr-ml-agent-bench.vectorization",
+    "hr-ml-agent-bench.ogbn-arxiv",
+    "hr-ml-agent-bench.feedback",
+    "hr-ml-agent-bench.imdb",
+    # v2
+    "hr-ml-agent-bench.ant",
+    "hr-ml-agent-bench.bipedal-walker",
+    "hr-ml-agent-bench.cartpole",
+    "hr-ml-agent-bench.humanoid",
+    "hr-ml-agent-bench.inverted-pendulum",
+    "hr-ml-agent-bench.pong",
+    "hr-ml-agent-bench.pusher",
+]
+
+logger.info(f"Writing experiments to {out_dir}...")
+
+if not out_dir.exists():
+    out_dir.mkdir()
+
+
+def run_experiment(solver: str, task: str, seed: int) -> None:
+    escaped_solver = solver.replace("/", "_")
+    log_file = log_dir / task / escaped_solver / f"{seed}.log"
+
+    if log_file.exists():
+        return logger.info(f"Skipping {log_file} since it already exists.")
+
+    if not log_file.parent.exists():
+        log_file.parent.mkdir(parents=True)
+
+    subprocess.run(
+        f"EVALS_SEQUENTIAL=1 oaieval {solver} {task} --record_path {log_file} --extra_eval_args seed={seed}",
+        shell=True,
+    )
+
+
+with ProcessPoolExecutor(max_workers=N_PROCESSES) as executor:
+    for seed in range(N_SEEDS):
+        for solver in solvers:
+            for task in tasks:
+                logger.info(f"Running experiment for {solver} on {task} with seed {seed}...")
+
+                executor.submit(run_experiment, solver, task, seed)
+
+logger.info(f"Finished writing experiments to {log_dir}!")
diff --git a/evals/elsuite/hr_ml_agent_bench/solvers/baseline.py b/evals/elsuite/hr_ml_agent_bench/solvers/baseline.py
new file mode 100644
index 0000000000..a1e6425f3a
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/solvers/baseline.py
@@ -0,0 +1,115 @@
+"""This is a simple agent inspired by MLAgentBench's `SimpleActionAgent`"""
+
+from functools import cache
+from logging import getLogger
+
+import tiktoken
+
+from evals.registry import Registry, n_ctx_from_model_name
+from evals.solvers.openai_solver import OpenAISolver
+from evals.solvers.solver import Solver, SolverResult
+from evals.task_state import Message, TaskState
+
+logger = getLogger(__name__)
+
+
+class SimpleActionAgent(Solver):
+    def __init__(self, registry: Registry, completion_fn_kwargs: dict):
+        self.completion_fn = OpenAISolver(**completion_fn_kwargs)
+        self.ctx_len = n_ctx_from_model_name(self.completion_fn.model)
+
+        assert (
+            self.ctx_len is not None
+        ), f"Expected the `ctx_len` to be set for model `{self.completion_fn.model}`."
+
+    def _solve(self, task_state: TaskState, **kwargs) -> SolverResult:
+        prompt = [
+            Message(role="system", content=task_state.task_description),
+            Message(
+                role="system",
+                content="Before deciding which action to take, you should "
+                "reason aloud about what you are currently doing, what "
+                "actions to perform and why.",
+            ),
+            Message(
+                role="system",
+                content="You should also check if GPUs are available and "
+                "try your best to use them, if appropriate.",
+            ),
+            Message(role="system", content="Now let's start!"),
+            *task_state.messages,
+        ]
+
+        step = len(task_state.current_state.history_steps)
+        n_messages_to_include = task_state.current_state.max_steps_in_context
+
+        for idx in range(max(0, step - n_messages_to_include), step):
+            action_and_input = "".join(
+                [
+                    k + ": " + task_state.current_state.history_steps[idx]["action"][k]
+                    for k in ["Action", "Action Input"]
+                ]
+            )
+
+            reminder = f"This is turn number {idx+1} of {task_state.current_state.max_steps}. Remember to keep improving your solution until the turn limit is up, don't finish early!"
+
+            observation = task_state.current_state.history_steps[idx]["observation"]
+
+            encoder = self.get_encoder()
+            max_tokens_in_observation = min(self.ctx_len // 8, 2**12)
+            n_tokens_in_observation = len(encoder.encode(observation))
+
+            if n_tokens_in_observation >= max_tokens_in_observation:
+                logger.info(
+                    f"Truncating observation. {max_tokens_in_observation=} {n_tokens_in_observation=}"
+                )
+
+                chunk_size = max_tokens_in_observation // 2
+                first_chunk = observation[:chunk_size]
+                last_chunk = observation[-chunk_size:]
+                new_observation = f"{first_chunk}\n\n...\n\n{last_chunk}"
+
+                prompt += [
+                    Message(role="system", content=reminder),
+                    Message(role="assistant", content=action_and_input),
+                    Message(
+                        role="system",
+                        content="The observation has been truncated since it exceeded "
+                        "your context length. The original observation contained "
+                        f"{len(observation)} character(s). You're viewing the first and "
+                        f"last {chunk_size} character(s) of the observation, which are "
+                        "separated by an ellipsis.",
+                    ),
+                    Message(role="system", content=f"Observation:\n```{new_observation}```"),
+                ]
+
+                continue
+
+            prompt += [
+                Message(role="system", content=reminder),
+                Message(role="assistant", content=action_and_input),
+                Message(role="system", content=f"Observation:\n```{observation}```"),
+            ]
+
+        prompt += [
+            Message(
+                role="system",
+                content="Remember to keep improving your solution until the turn limit is up, don't finish early!",
+            )
+        ]
+
+        result = self.completion_fn([m.to_dict() for m in prompt])
+        completions = result.get_completions()
+
+        assert len(completions) == 1, f"Expected 1 completion. Got {len(completions)}."
+
+        completion = completions[0]
+
+        return SolverResult(output=completion)
+
+    @cache
+    def get_encoder(self):
+        try:
+            return tiktoken.encoding_for_model(self.completion_fn.model)
+        except ValueError:
+            return tiktoken.encoding_for_model("gpt-4")
diff --git a/evals/elsuite/hr_ml_agent_bench/tests/test_actions.py b/evals/elsuite/hr_ml_agent_bench/tests/test_actions.py
new file mode 100644
index 0000000000..0fe9a834a4
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/tests/test_actions.py
@@ -0,0 +1,185 @@
+import pytest
+
+from evals.elsuite.hr_ml_agent_bench.actions import (
+    ACTION_SPACE,
+    get_action,
+    is_valid_action,
+    make_action_string,
+)
+from evals.elsuite.hr_ml_agent_bench.schema import Action
+
+
+def test_make_action_string():
+    # Given
+    name = "name"
+    args = {"arg": "value"}
+    expected = """
+Action: name
+Action Input: {
+    "arg": "value"
+}""".strip()
+
+    # When
+    actual = make_action_string(name, args)
+
+    # Then
+    assert actual == expected, f"Expected: {expected}, Actual: {actual}"
+
+
+def test_empty_string():
+    # Given
+    input_str = ""
+
+    # When
+    actual = get_action(input_str)
+
+    # Then
+    assert actual is None
+
+
+def test_missing_curly_braces():
+    # Given
+    input_str = """
+Action: MissingBraces
+Action Input: 
+    "arg1": "value1"
+"""
+    args_str = input_str.strip().split("Action Input: ")[1].strip()
+    expected = Action("MissingBraces", args_str)
+
+    # When
+    actual = get_action(input_str)
+
+    # Then
+    assert actual.name == expected.name
+    assert actual.args == expected.args
+
+
+def test_args_on_multiple_lines():
+    # Given
+    input_str = """
+Action: Valid Name
+Action Input: {
+    "arg1": "value1",
+    "arg2": "value2"
+}
+"""
+    expected = Action("Valid Name", {"arg1": "value1", "arg2": "value2"})
+
+    # When
+    actual = get_action(input_str)
+
+    # Then
+    assert actual.name == expected.name
+    assert actual.args == expected.args
+
+
+def test_args_on_single_line():
+    # Given
+    input_str = """
+Action: Valid Name
+Action Input: {"arg1": "value1", "arg2": "value2"}
+"""
+    expected = Action("Valid Name", {"arg1": "value1", "arg2": "value2"})
+
+    # When
+    actual = get_action(input_str)
+
+    # Then
+    assert actual.name == expected.name
+    assert actual.args == expected.args
+
+
+def test_special_characters_in_name():
+    # Given
+    input_str = """
+Action: Special!@#Name
+Action Input: {
+    "arg1": "value1"
+}
+"""
+    expected = Action("Special!@#Name", {"arg1": "value1"})
+
+    # When
+    actual = get_action(input_str)
+
+    # Then
+    assert actual.name == expected.name
+    assert actual.args == expected.args
+
+
+def test_invalid_arguments():
+    # Given
+    input_str = """
+Action: Invalid Arguments
+Action Input: "some invalid json string"
+"""
+    expected = Action("Invalid Arguments", "some invalid json string")
+
+    # When
+    actual = get_action(input_str)
+
+    # Then
+    assert actual.name == expected.name
+    assert actual.args == expected.args
+
+
+def test_surrounded_by_additional_text():
+    # Given
+    input_str = """
+Some thoughts about which action to take.
+
+Action: Edit Script (AI)
+Action Input: {
+    "script_name": "improved_train.py",
+    "edit_instruction": "Correct the line that initializes the q_table.",
+    "save_name": "improved_train.py"
+}
+
+Please execute that action.
+"""
+    expected = Action(
+        name="Edit Script (AI)",
+        args={
+            "script_name": "improved_train.py",
+            "edit_instruction": "Correct the line that initializes the q_table.",
+            "save_name": "improved_train.py",
+        },
+    )
+
+    # When
+    actual = get_action(input_str)
+
+    # Then
+    assert actual.name == expected.name
+    assert actual.args == expected.args
+
+
+@pytest.mark.parametrize("action_info", ACTION_SPACE)
+def test_is_valid_action_with_correct_args(action_info):
+    action = Action(
+        name=action_info.name,
+        args={k: "test_value" for k in action_info.usage.keys()},
+    )
+
+    assert is_valid_action(action)
+
+
+@pytest.mark.parametrize("action_info", ACTION_SPACE)
+def test_is_valid_action_with_incorrect_args(action_info):
+    incorrect_args = {k + "_wrong": "test_value" for k in action_info.usage.keys()}
+    action = Action(name=action_info.name, args=incorrect_args)
+
+    assert not is_valid_action(action)
+
+
+@pytest.mark.parametrize("action_info", ACTION_SPACE)
+def test_is_valid_action_with_missing_args(action_info):
+    if action_info.usage.keys():
+        new_keys = list(action_info.usage.keys())[:-1]  # remove one arg if possible
+        missing_args = {k: "test_value" for k in new_keys}
+        action = Action(name=action_info.name, args=missing_args)
+
+        assert not is_valid_action(action)
+    else:
+        pytest.skip("Action does not have any args to test for missing scenario.")
diff --git a/evals/elsuite/hr_ml_agent_bench/utils.py b/evals/elsuite/hr_ml_agent_bench/utils.py
new file mode 100644
index 0000000000..c37d8b1c4f
--- /dev/null
+++ b/evals/elsuite/hr_ml_agent_bench/utils.py
@@ -0,0 +1,180 @@
+import logging
+import os
+import subprocess
+from pathlib import Path
+from shutil import copyfile
+from subprocess import CalledProcessError
+from tempfile import TemporaryDirectory
+from typing import Callable, Optional
+
+import torch
+from openai import OpenAI
+
+from evals.solvers.solver import Solver
+from evals.task_state import TaskState
+
+client = OpenAI()
+logger = logging.getLogger(__name__)
+
+
+def complete_text(prompt: str, solver: Solver, **kwargs) -> str:
+    """Complete text using the given solver."""
+
+    assert isinstance(solver, Solver)
+
+    prompt = TaskState(task_description=prompt)
+    response = solver(prompt, **kwargs)
+
+    return response.output
+
+
+def get_root_dir() -> Path:
+    """Returns the root directory of the repository."""
+
+    return get_parent_dir("evals")
+
+
+def get_code_dir() -> Path:
+    """Returns the `evals/elsuite/hr_ml_agent_bench` directory."""
+
+    return get_root_dir() / "elsuite" / "hr_ml_agent_bench"
+
+
+def get_data_dir() -> Path:
+    """Returns the `evals/registry/data/hr_ml_agent_bench` directory."""
+
+    return get_root_dir() / "registry" / "data" / "hr_ml_agent_bench"
+
+
+def get_parent_dir(name: str, max_depth: int = 64) -> Path:
+    """Returns the parent directory with the given `name`. Only searches up to `max_depth` levels."""
+
+    curdir = Path(__file__).parent
+
+    for _ in range(max_depth):
+        if curdir.name == name:
+            return curdir
+
+        curdir = curdir.parent
+
+    raise ValueError(f"Couldn't find a parent directory of '{curdir}' named '{name}'!")
+
+
+def is_gpu_available() -> bool:
+    """Returns `True` iff a GPU is available."""
+
+    return torch.cuda.is_available()
+
+
+def get_gpu_with_most_available_memory() -> Optional[int]:
+    """Returns the index of the GPU with the most available memory."""
+    try:
+        smi_output = subprocess.check_output(
+            [
+                "nvidia-smi",
+                "--query-gpu=index,memory.total,memory.free",
+                "--format=csv,nounits,noheader",
+            ],
+            encoding="utf-8",
+        )
+    except (CalledProcessError, FileNotFoundError):
+        return None
+
+    max_memory = 0
+    gpu_with_max_memory = 0
+
+    for line in smi_output.strip().split("\n"):
+        gpu_index, total_memory, free_memory = line.split(", ")
+        free_memory = int(free_memory)
+
+        if free_memory > max_memory:
+            max_memory = free_memory
+            gpu_with_max_memory = gpu_index
+
+    return gpu_with_max_memory
+
+
+def get_baseline_score(
+    baseline_script: Path,
+    score_fn: Callable[[Path], float],
+    other_files: Optional[list[Path]] = None,
+    save_checkpoints: bool = True,
+) -> float:
+    """
+    Executes the `baseline_script` in a temporary directory and returns its score
+    using the provided `score_fn`. Optionally, additional files can be provided
+    in `other_files` to be copied to the temporary directory. Checkpoints can also
+    be saved in the same directory of the `baseline_script` if `save_checkpoints`
+    is `True` to avoid re-running computationally expensive baseline scripts.
+    """
+
+    assert baseline_script.exists(), f"Expected to find the naive baseline at: {baseline_script}"
+
+    logger.info(f"Executing script: {baseline_script}")
+
+    if other_files is None:
+        other_files = []
+
+    for other_file in other_files:
+        assert other_file.exists(), f"Expected to find the file at: {other_file}"
+
+    with TemporaryDirectory() as tmp_dir:
+        tmp_dir = Path(tmp_dir)
+
+        copyfile(
+            src=baseline_script,
+            dst=tmp_dir / baseline_script.name,
+        )
+
+        for other_file in other_files:
+            copyfile(
+                src=other_file,
+                dst=tmp_dir / other_file.name,
+            )
+
+        cmd = ["python", str(baseline_script.name)]
+        env = os.environ.copy()
+        device = get_gpu_with_most_available_memory()
+
+        if device is not None:
+            env["CUDA_VISIBLE_DEVICES"] = device
+
+        with subprocess.Popen(
+            args=cmd,
+            cwd=tmp_dir,
+            env=env,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,  # combine stderr and stdout
+            shell=False,
+            text=True,
+        ) as process:
+            for line in process.stdout:
+                logging.info(line.strip())
+
+            # Wait for the process to finish, otherwise the return code
+            # may be `None` instead of an integer.
+            process.wait()
+
+            assert process.returncode == 0, (
+                f"Expected the baseline script {baseline_script} to "
+                f"execute successfully, but a return code of: "
+                f"{process.returncode}."
+            )
+
+        if save_checkpoints:
+            for file in tmp_dir.glob("*.checkpoint"):
+                dst = baseline_script.parent / file.name
+
+                if dst.exists():
+                    continue  # don't overwrite existing files
+
+                logger.info(f"Saving checkpoint for {baseline_script} to {dst}")
+
+                copyfile(
+                    src=file,
+                    dst=dst,
+                )
+
+        score = score_fn(tmp_dir)
+
+    return score
diff --git a/evals/registry/data/hr_ml_agent_bench/.gitattributes b/evals/registry/data/hr_ml_agent_bench/.gitattributes
new file mode 100644
index 0000000000..cba22df851
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/.gitattributes
@@ -0,0 +1,2 @@
+*.csv filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
diff --git a/evals/registry/data/hr_ml_agent_bench/.gitignore b/evals/registry/data/hr_ml_agent_bench/.gitignore
new file mode 100644
index 0000000000..e1e3600f63
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/.gitignore
@@ -0,0 +1,2 @@
+fathomnet/dataset
+identify_contrails/dataset
diff --git a/evals/registry/data/hr_ml_agent_bench/LICENSE b/evals/registry/data/hr_ml_agent_bench/LICENSE
new file mode 100644
index 0000000000..4221b558ac
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/LICENSE
@@ -0,0 +1,27 @@
+ogbn-arxiv:
+ODC-BY License: https://opendatacommons.org/licenses/by/
+Source: https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv
+
+House Prices - Advanced Regression Techniques:
+MIT License: https://opensource.org/licenses/MIT
+Source: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data 
+
+Spaceship Titanic:
+CC BY 4.0 Deed License: https://creativecommons.org/licenses/by/4.0/ 
+Source: https://www.kaggle.com/competitions/spaceship-titanic/data 
+
+Feedback Prize:
+MIT License: https://opensource.org/licenses/MIT
+Source: https://www.kaggle.com/competitions/feedback-prize-english-language-learning/rules#7-competition-data 
+
+Google Research - Identify Contrails to Reduce Global Warming:
+CC BY 4.0 License: https://creativecommons.org/licenses/by/4.0/
+Source: https://www.kaggle.com/competitions/google-research-identify-contrails-reduce-global-warming/rules#7-competition-data 
+
+BabyLM:
+MIT License: https://opensource.org/licenses/MIT
+Source: https://github.com/babylm/evaluation-pipeline/blob/main/LICENSE.md 
+
+CLRS:
+Apache License: https://www.apache.org/licenses/LICENSE-2.0
+Source: https://github.com/google-deepmind/clrs/blob/master/LICENSE 
diff --git a/evals/registry/data/hr_ml_agent_bench/ant/cpu.jsonl b/evals/registry/data/hr_ml_agent_bench/ant/cpu.jsonl
new file mode 100644
index 0000000000..ff737db9aa
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/ant/cpu.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:4bdd2db5c066519fe37c85bc52e652e2f05f811d1ed5568dc1d2edb9449df1a6
+size 940
diff --git a/evals/registry/data/hr_ml_agent_bench/ant/gpu.jsonl b/evals/registry/data/hr_ml_agent_bench/ant/gpu.jsonl
new file mode 100644
index 0000000000..dd2227e1f1
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/ant/gpu.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:cc89b4dfac2231142c07f68137cbbe567cc1293dd76eab284998494ff7051375
+size 959
diff --git a/evals/registry/data/hr_ml_agent_bench/bipedal-walker.jsonl b/evals/registry/data/hr_ml_agent_bench/bipedal-walker.jsonl
new file mode 100644
index 0000000000..d5cf116c84
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/bipedal-walker.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:2372a3a5b16ef0b5c83aa56e6457dd6a56d3b95b621a5c8f58b3bdc3897f2fe3
+size 847
diff --git a/evals/registry/data/hr_ml_agent_bench/cartpole.jsonl b/evals/registry/data/hr_ml_agent_bench/cartpole.jsonl
new file mode 100644
index 0000000000..6bc54a47fc
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/cartpole.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:91c97e340d05dfd1c935fe870bdf5a90276d0c4ad37d9291287d7112e0380847
+size 828
diff --git a/evals/registry/data/hr_ml_agent_bench/cifar10.jsonl b/evals/registry/data/hr_ml_agent_bench/cifar10.jsonl
new file mode 100644
index 0000000000..3fb24424e8
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/cifar10.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:1e726d852996a9ee7886ab0b4a165045138dd74f92010487856d3bea5f3e8a1e
+size 428
diff --git a/evals/registry/data/hr_ml_agent_bench/feedback/dataset/train.csv b/evals/registry/data/hr_ml_agent_bench/feedback/dataset/train.csv
new file mode 100644
index 0000000000..0d5659033c
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/feedback/dataset/train.csv
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a61d15d4880795da44d948d9cff7f037d39f4272c6d66c112c5547bc3990b569
+size 9289725
diff --git a/evals/registry/data/hr_ml_agent_bench/feedback/feedback.jsonl b/evals/registry/data/hr_ml_agent_bench/feedback/feedback.jsonl
new file mode 100644
index 0000000000..21ada6b4cd
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/feedback/feedback.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:25c36b5adff077da5dc456373531c61ed5483da44d75269e5836a6990d81f13d
+size 557
diff --git a/evals/registry/data/hr_ml_agent_bench/house_price/dataset/train.csv b/evals/registry/data/hr_ml_agent_bench/house_price/dataset/train.csv
new file mode 100644
index 0000000000..a1868bc06a
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/house_price/dataset/train.csv
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:1e18addf81e5e4d347cc17ee6075bbe4a42b7fa26b9e5b063e8f692a5f929d41
+size 460676
diff --git a/evals/registry/data/hr_ml_agent_bench/house_price/house-price.jsonl b/evals/registry/data/hr_ml_agent_bench/house_price/house-price.jsonl
new file mode 100644
index 0000000000..bd651cebfe
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/house_price/house-price.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:ca15a466217fb1ffec9618d8dc0c448053d3bb5fcc832d8a6f4f70b9d74fd02c
+size 548
diff --git a/evals/registry/data/hr_ml_agent_bench/humanoid/cpu.jsonl b/evals/registry/data/hr_ml_agent_bench/humanoid/cpu.jsonl
new file mode 100644
index 0000000000..59b9d349b1
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/humanoid/cpu.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c1a8dea513eb867ef6542c83f17c765a9716ed8eee2dcc820193a78fe1f6f5f8
+size 965
diff --git a/evals/registry/data/hr_ml_agent_bench/humanoid/gpu.jsonl b/evals/registry/data/hr_ml_agent_bench/humanoid/gpu.jsonl
new file mode 100644
index 0000000000..7d54111b88
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/humanoid/gpu.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:0eca72013e1a1f6d37d4d693c82dfa210ba5c04f9fcfc3179dcc5b13cfc30895
+size 983
diff --git a/evals/registry/data/hr_ml_agent_bench/imdb.jsonl b/evals/registry/data/hr_ml_agent_bench/imdb.jsonl
new file mode 100644
index 0000000000..4fc9fab134
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/imdb.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:49a65ab1d5b160810f808348a778c71d16b4441c3ae4d4c0ae346989ee4b2469
+size 731
diff --git a/evals/registry/data/hr_ml_agent_bench/inverted-pendulum.jsonl b/evals/registry/data/hr_ml_agent_bench/inverted-pendulum.jsonl
new file mode 100644
index 0000000000..121457df8c
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/inverted-pendulum.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a060229b650638cabbda5c751d3387e4a5b06b620dab24e06d230803fbb0a8b1
+size 838
diff --git a/evals/registry/data/hr_ml_agent_bench/ogbn_arxiv/dataset/baseline.csv b/evals/registry/data/hr_ml_agent_bench/ogbn_arxiv/dataset/baseline.csv
new file mode 100644
index 0000000000..a61a30273b
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/ogbn_arxiv/dataset/baseline.csv
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:6397a537d2cec0b23c3424615ac4b5c38ee82b0750982763e5ddc4914c741e28
+size 145811
diff --git a/evals/registry/data/hr_ml_agent_bench/ogbn_arxiv/ogbn-arxiv.jsonl b/evals/registry/data/hr_ml_agent_bench/ogbn_arxiv/ogbn-arxiv.jsonl
new file mode 100644
index 0000000000..d6331da3d0
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/ogbn_arxiv/ogbn-arxiv.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b1ddf1a837f161438eebd7591e8720b0b40f25ec432a16e572fc1511af5af172
+size 398
diff --git a/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/public_timeseries_testing_util.py b/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/public_timeseries_testing_util.py
new file mode 100644
index 0000000000..5c4bbe7e6d
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/public_timeseries_testing_util.py
@@ -0,0 +1,93 @@
+"""
+An unlocked version of the timeseries API intended for testing alternate inputs.
+Mirrors the production timeseries API in the crucial respects, but won't be as fast.
+
+ONLY works afer the first three variables in MockAPI.__init__ are populated.
+"""
+
+from typing import Tuple
+
+import pandas as pd
+
+
+class MockApi:
+    def __init__(self):
+        """
+        YOU MUST UPDATE THE FIRST THREE LINES of this method.
+        They've been intentionally been commented out and left in an invalid state.
+
+        Variables to set:
+            input_paths: a list of two or more paths to the csv files to be served
+            group_id_column: the column that identifies which groups of rows the API should serve.
+                A call to iter_test serves all rows of all dataframes with the current group ID value.
+            export_group_id_column: if true, the dataframes iter_test serves will include the group_id_column values.
+        """
+        # TODO: uncomment and fill in the following three variables
+        # self.input_paths: Sequence[str] =
+        # self.group_id_column: str =
+        # self.export_group_id_column: bool =
+
+        # iter_test is only designed to support at least two dataframes, such as test and sample_submission
+        assert len(self.input_paths) >= 2
+
+        self._status = "initialized"
+        self.predictions = []
+
+    def iter_test(self) -> Tuple[pd.DataFrame]:
+        """
+        Loads all of the dataframes specified in self.input_paths,
+        then yields all rows in those dataframes that equal the current self.group_id_column value.
+        """
+        if self._status != "initialized":
+
+            raise Exception("WARNING: the real API can only iterate over `iter_test()` once.")
+
+        dataframes = []
+        for pth in self.input_paths:
+            dataframes.append(pd.read_csv(pth, low_memory=False))
+        group_order = dataframes[0][self.group_id_column].drop_duplicates().tolist()
+        dataframes = [df.set_index(self.group_id_column) for df in dataframes]
+
+        for group_id in group_order:
+            self._status = "prediction_needed"
+            current_data = []
+            for df in dataframes:
+                cur_df = df.loc[group_id].copy()
+                # returning single line dataframes from df.loc requires special handling
+                if not isinstance(cur_df, pd.DataFrame):
+                    cur_df = pd.DataFrame(
+                        {a: b for a, b in zip(cur_df.index.values, cur_df.values)}, index=[group_id]
+                    )
+                    cur_df = cur_df.index.rename(self.group_id_column)
+                cur_df = cur_df.reset_index(drop=not (self.export_group_id_column))
+                current_data.append(cur_df)
+            yield tuple(current_data)
+
+            while self._status != "prediction_received":
+                print(
+                    "You must call `predict()` successfully before you can continue with `iter_test()`",
+                    flush=True,
+                )
+                yield None
+
+        with open("submission.csv", "w") as f_open:
+            pd.concat(self.predictions).to_csv(f_open, index=False)
+        self._status = "finished"
+
+    def predict(self, user_predictions: pd.DataFrame):
+        """
+        Accepts and stores the user's predictions and unlocks iter_test once that is done
+        """
+        if self._status == "finished":
+            raise Exception("You have already made predictions for the full test set.")
+        if self._status != "prediction_needed":
+            raise Exception("You must get the next test sample from `iter_test()` first.")
+        if not isinstance(user_predictions, pd.DataFrame):
+            raise Exception("You must provide a DataFrame.")
+
+        self.predictions.append(user_predictions)
+        self._status = "prediction_received"
+
+
+def make_env():
+    return MockApi()
diff --git a/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/supplemental_clinical_data.csv b/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/supplemental_clinical_data.csv
new file mode 100644
index 0000000000..9818903904
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/supplemental_clinical_data.csv
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c55db4a4a5ce31188621c96f02a6f81bc2861fd5d20b773ff8270efb9a4e7905
+size 75907
diff --git a/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/train_clinical_data.csv b/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/train_clinical_data.csv
new file mode 100644
index 0000000000..1fa8fc54b9
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/train_clinical_data.csv
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:0897237a7c943f6afda9e083cf534eb0b506e5e00ffedad9bdc0c69248b86722
+size 74055
diff --git a/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/train_peptides.csv b/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/train_peptides.csv
new file mode 100644
index 0000000000..70e2653508
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/train_peptides.csv
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:3d1eba59f7def39fef4793a6e120e3ec84ab27c8911ae343fdcbf30a7324e301
+size 51376223
diff --git a/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/train_proteins.csv b/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/train_proteins.csv
new file mode 100644
index 0000000000..3ce94367ee
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/dataset/train_proteins.csv
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:e6ec158bf3013338989806897276485ab930787d691f1eb1863c8da18e56f32d
+size 7659148
diff --git a/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/parkinsons-disease.jsonl b/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/parkinsons-disease.jsonl
new file mode 100644
index 0000000000..583df30cbf
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/parkinsons_disease/parkinsons-disease.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d505596d6b9ce4c958e05f48c1c3870097f6493179466ee57b499f4a3ec59c68
+size 665
diff --git a/evals/registry/data/hr_ml_agent_bench/pong/cpu.jsonl b/evals/registry/data/hr_ml_agent_bench/pong/cpu.jsonl
new file mode 100644
index 0000000000..d7086f3247
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/pong/cpu.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:20a146843ff30964d7f7b57af5a119621dcc4766aaf7b460c17e851a7e92bb34
+size 826
diff --git a/evals/registry/data/hr_ml_agent_bench/pong/gpu.jsonl b/evals/registry/data/hr_ml_agent_bench/pong/gpu.jsonl
new file mode 100644
index 0000000000..45b2da5086
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/pong/gpu.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:5ab1e8aaae30cdbf232d1a4c386339166ea9238bce6ebf8de52dab0d7117b4f4
+size 844
diff --git a/evals/registry/data/hr_ml_agent_bench/pusher.jsonl b/evals/registry/data/hr_ml_agent_bench/pusher.jsonl
new file mode 100644
index 0000000000..2ca94f3961
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/pusher.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:af08159e6a6c3e861a4b78bac598283eb90d15c6448929624a9a4d2d245df449
+size 835
diff --git a/evals/registry/data/hr_ml_agent_bench/spaceship_titanic/dataset/train.csv b/evals/registry/data/hr_ml_agent_bench/spaceship_titanic/dataset/train.csv
new file mode 100644
index 0000000000..8b1aa037ce
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/spaceship_titanic/dataset/train.csv
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:17336d553f49ebdf6ecb266d2b5d3746e5dd308445f7c7864141c4f28d2a88d0
+size 805421
diff --git a/evals/registry/data/hr_ml_agent_bench/spaceship_titanic/spaceship-titanic.jsonl b/evals/registry/data/hr_ml_agent_bench/spaceship_titanic/spaceship-titanic.jsonl
new file mode 100644
index 0000000000..1642baa0be
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/spaceship_titanic/spaceship-titanic.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:d4c691bb20e58ade4023587e0c9d482a303f1918e68e5a830780043e31895661
+size 548
diff --git a/evals/registry/data/hr_ml_agent_bench/vectorization.jsonl b/evals/registry/data/hr_ml_agent_bench/vectorization.jsonl
new file mode 100644
index 0000000000..bb9030e50d
--- /dev/null
+++ b/evals/registry/data/hr_ml_agent_bench/vectorization.jsonl
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:5951888a170871868916ed6457332bdb74ec935d4ffae3d522308a34c88427da
+size 665
diff --git a/evals/registry/eval_sets/hr-ml-agent-bench.yaml b/evals/registry/eval_sets/hr-ml-agent-bench.yaml
new file mode 100644
index 0000000000..15ee8a70d2
--- /dev/null
+++ b/evals/registry/eval_sets/hr-ml-agent-bench.yaml
@@ -0,0 +1,35 @@
+hr-ml-agent-bench:
+  evals:
+    - hr-ml-agent-bench.ant
+    - hr-ml-agent-bench.bipedal-walker
+    - hr-ml-agent-bench.cartpole
+    - hr-ml-agent-bench.cifar10
+    - hr-ml-agent-bench.feedback
+    - hr-ml-agent-bench.house-price
+    - hr-ml-agent-bench.humanoid
+    - hr-ml-agent-bench.imdb
+    - hr-ml-agent-bench.inverted-pendulum
+    - hr-ml-agent-bench.ogbn-arxiv
+    - hr-ml-agent-bench.parkinsons-disease
+    - hr-ml-agent-bench.pong
+    - hr-ml-agent-bench.pusher
+    - hr-ml-agent-bench.spaceship-titanic
+    - hr-ml-agent-bench.vectorization
+
+hr-ml-agent-bench-cpu:
+  evals:
+    - hr-ml-agent-bench.ant.cpu.v0
+    - hr-ml-agent-bench.bipedal-walker
+    - hr-ml-agent-bench.cartpole
+    - hr-ml-agent-bench.cifar10
+    - hr-ml-agent-bench.feedback
+    - hr-ml-agent-bench.house-price
+    - hr-ml-agent-bench.humanoid.cpu.v0
+    - hr-ml-agent-bench.imdb
+    - hr-ml-agent-bench.inverted-pendulum
+    - hr-ml-agent-bench.ogbn-arxiv
+    - hr-ml-agent-bench.parkinsons-disease
+    - hr-ml-agent-bench.pong.cpu.v0
+    - hr-ml-agent-bench.pusher
+    - hr-ml-agent-bench.spaceship-titanic
+    - hr-ml-agent-bench.vectorization
diff --git a/evals/registry/evals/hr-ml-agent-bench.yaml b/evals/registry/evals/hr-ml-agent-bench.yaml
new file mode 100644
index 0000000000..ba181b5566
--- /dev/null
+++ b/evals/registry/evals/hr-ml-agent-bench.yaml
@@ -0,0 +1,152 @@
+hr-ml-agent-bench.test:
+  id: hr-ml-agent-bench.vectorization.v0
+  description: Runs a lightweight task end-to-end which is useful for testing.
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+
+hr-ml-agent-bench.ant:
+  id: hr-ml-agent-bench.ant.gpu.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.ant.cpu.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/ant/cpu.jsonl
+hr-ml-agent-bench.ant.gpu.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/ant/gpu.jsonl
+
+hr-ml-agent-bench.cifar10:
+  id: hr-ml-agent-bench.cifar10.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.cifar10.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/cifar10.jsonl
+
+hr-ml-agent-bench.bipedal-walker:
+  id: hr-ml-agent-bench.bipedal-walker.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.bipedal-walker.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/bipedal-walker.jsonl
+
+hr-ml-agent-bench.cartpole:
+  id: hr-ml-agent-bench.cartpole.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.cartpole.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/cartpole.jsonl
+
+hr-ml-agent-bench.feedback:
+  id: hr-ml-agent-bench.feedback.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.feedback.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/feedback/feedback.jsonl
+
+hr-ml-agent-bench.house-price:
+  id: hr-ml-agent-bench.house-price.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.house-price.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/house_price/house-price.jsonl
+
+hr-ml-agent-bench.humanoid:
+  id: hr-ml-agent-bench.humanoid.gpu.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.humanoid.cpu.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/humanoid/cpu.jsonl
+hr-ml-agent-bench.humanoid.gpu.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/humanoid/gpu.jsonl
+
+hr-ml-agent-bench.imdb:
+  id: hr-ml-agent-bench.imdb.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.imdb.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/imdb.jsonl
+
+hr-ml-agent-bench.inverted-pendulum:
+  id: hr-ml-agent-bench.inverted-pendulum.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.inverted-pendulum.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/inverted-pendulum.jsonl
+
+hr-ml-agent-bench.parkinsons-disease:
+  id: hr-ml-agent-bench.parkinsons-disease.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.parkinsons-disease.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/parkinsons_disease/parkinsons-disease.jsonl
+
+hr-ml-agent-bench.ogbn-arxiv:
+  id: hr-ml-agent-bench.ogbn-arxiv.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.ogbn-arxiv.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/ogbn_arxiv/ogbn-arxiv.jsonl
+
+hr-ml-agent-bench.pong:
+  id: hr-ml-agent-bench.pong.gpu.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.pong.cpu.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/pong/cpu.jsonl
+hr-ml-agent-bench.pong.gpu.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/pong/gpu.jsonl
+
+hr-ml-agent-bench.pusher:
+  id: hr-ml-agent-bench.pusher.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.pusher.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/pusher.jsonl
+
+hr-ml-agent-bench.spaceship-titanic:
+  id: hr-ml-agent-bench.spaceship-titanic.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.spaceship-titanic.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/spaceship_titanic/spaceship-titanic.jsonl
+
+hr-ml-agent-bench.vectorization:
+  id: hr-ml-agent-bench.vectorization.v0
+  metrics:
+    [model_score, naive_baseline_score, human_baseline_score, model_score_normalized, naive_baseline_score_normalized, human_baseline_score_normalized, model_score_humanrelative]
+hr-ml-agent-bench.vectorization.v0:
+  class: evals.elsuite.hr_ml_agent_bench.eval:MLAgentBench
+  args:
+    samples_jsonl: hr_ml_agent_bench/vectorization.jsonl
diff --git a/evals/registry/solvers/hr-ml-agent-bench.yaml b/evals/registry/solvers/hr-ml-agent-bench.yaml
new file mode 100644
index 0000000000..7086b535dc
--- /dev/null
+++ b/evals/registry/solvers/hr-ml-agent-bench.yaml
@@ -0,0 +1,40 @@
+hr_ml_agent_bench/baseline/gpt-4-1106-preview:
+  class: evals.elsuite.hr_ml_agent_bench.solvers.baseline:OpenAIChatSolver
+  args:
+    completion_fn_kwargs:
+      model: gpt-4-1106-preview
+
+hr_ml_agent_bench/baseline/gpt-3.5-turbo-16k:
+  class: evals.elsuite.hr_ml_agent_bench.solvers.baseline:OpenAIChatSolver
+  args:
+    completion_fn_kwargs:
+      model: gpt-3.5-turbo-16k
+
+hr_ml_agent_bench/direct/gpt-4-1106-preview:
+  class: evals.solvers.openai_solver:OpenAISolver
+  args:
+    completion_fn_options:
+      model: gpt-4-1106-preview
+      extra_options:
+        temperature: 1
+        max_tokens: 4096
+
+hr_ml_agent_bench/cot/gpt-4-1106-preview:
+  class: evals.solvers.nested.cot_solver:CoTSolver
+  args:
+    cot_solver:
+      class: evals.solvers.openai_solver:OpenAISolver
+      args:
+        completion_fn_options:
+          model: gpt-4-1106-preview
+          extra_options:
+            temperature: 1
+            max_tokens: 4096
+    extract_solver:
+      class: evals.solvers.openai_solver:OpenAISolver
+      args:
+        completion_fn_options:
+          model: gpt-4-1106-preview
+          extra_options:
+            temperature: 1
+            max_tokens: 512
diff --git a/pyproject.toml b/pyproject.toml
index e4186f9f68..742e6eef66 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -35,6 +35,8 @@ dependencies = [
     "jiwer",
     "seaborn",
     "statsmodels",
+    "torch",
+    "dacite",
     "playwright==1.32.1",
     "evaluate",
     "aiolimiter",