Skip to content

Commit

Permalink
Merge pull request #151 from st-tech/update-version
Browse files Browse the repository at this point in the history
Update version to 0.5.2
  • Loading branch information
usaito authored Jan 13, 2022
2 parents 73e065c + ff82deb commit 288a5c9
Show file tree
Hide file tree
Showing 50 changed files with 2,172 additions and 4,512 deletions.
4 changes: 2 additions & 2 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Open Bandit Pipeline Examples

This page contains a list of example codes written with the Open Bandit Pipeline.
This page contains a list of examples written with Open Bandit Pipeline.

- [`obd/`](./obd/): example implementations for evaluating standard off-policy estimators with the small sample Open Bandit Dataset.
- [`synthetic/`](./synthetic/): example implementations for evaluating several off-policy estimators with synthetic bandit datasets.
- [`multiclass/`](./multiclass/): example implementations for evaluating several off-policy estimators with multi-class classification datasets.
- [`online/`](./online/): example implementations for evaluating Replay Method with online bandit algorithms.
- [`opl/`](./opl/): example implementations for comparing the performance of several off-policy learners with synthetic bandit datasets.
- [`quickstart/`](./quickstart/): some quickstart notebooks to guide the usage of the Open Bandit Pipeline.
- [`quickstart/`](./quickstart/): some quickstart notebooks to guide the usage of Open Bandit Pipeline.
56 changes: 32 additions & 24 deletions examples/multiclass/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Example with Multi-class Classification Data
# Example Experiment with Multi-class Classification Data


## Description

Here, we use multi-class classification datasets to evaluate OPE estimators.
Specifically, we evaluate the estimation performances of well-known off-policy estimators using the ground-truth policy value of an evaluation policy calculable with multi-class classification data.
We use multi-class classification datasets to evaluate OPE estimators. Specifically, we evaluate the estimation performance of some well-known OPE estimators using the ground-truth policy value of an evaluation policy calculable with multi-class classification data.

## Evaluating Off-Policy Estimators

In the following, we evaluate the estimation performances of
In the following, we evaluate the estimation performance of

- Direct Method (DM)
- Inverse Probability Weighting (IPW)
- Self-Normalized Inverse Probability Weighting (SNIPW)
Expand All @@ -17,12 +17,12 @@ In the following, we evaluate the estimation performances of
- Switch Doubly Robust (Switch-DR)
- Doubly Robust with Optimistic Shrinkage (DRos)

For Switch-DR and DRos, we try some different values of hyperparameters.
For Switch-DR and DRos, we tune the built-in hyperparameters using SLOPE (Su et al., 2020; Tucker et al., 2021), a data-driven hyperparameter tuning method for OPE estimators.
See [our documentation](https://zr-obp.readthedocs.io/en/latest/estimators.html) for the details about these estimators.

### Files
- [`./evaluate_off_policy_estimators.py`](./evaluate_off_policy_estimators.py) implements the evaluation of OPE estimators using multi-class classification data.
- [`./conf/hyperparams.yaml`](./conf/hyperparams.yaml) defines hyperparameters of some machine learning methods used to define regression model.
- [`./conf/hyperparams.yaml`](./conf/hyperparams.yaml) defines hyperparameters of some ML methods used to define regression model.

### Scripts

Expand Down Expand Up @@ -50,38 +50,46 @@ python evaluate_off_policy_estimators.py\
- `$base_model_for_reg_model` specifies the base ML model for defining regression model and should be one of "logistic_regression", "random_forest", or "lightgbm".
- `$n_jobs` is the maximum number of concurrently running jobs.

For example, the following command compares the estimation performances (relative estimation error; relative-ee) of the OPE estimators using the digits dataset.
For example, the following command compares the estimation performance (relative estimation error; relative-ee) of the OPE estimators using the digits dataset.

```bash
python evaluate_off_policy_estimators.py\
--n_runs 20\
--n_runs 30\
--dataset_name digits\
--eval_size 0.7\
--base_model_for_behavior_policy logistic_regression\
--alpha_b 0.8\
--base_model_for_evaluation_policy logistic_regression\
--alpha_b 0.4\
--base_model_for_evaluation_policy random_forest\
--alpha_e 0.9\
--base_model_for_reg_model logistic_regression\
--base_model_for_reg_model lightgbm\
--n_jobs -1\
--random_state 12345

# relative-ee of OPE estimators and their standard deviations (lower is better).
# It appears that the performances of some OPE estimators depend on the choice of their hyperparameters.
# =============================================
# random_state=12345
# ---------------------------------------------
# mean std
# dm 0.093439 0.015391
# ipw 0.013286 0.008496
# snipw 0.006797 0.004094
# dr 0.007780 0.004492
# sndr 0.007210 0.004089
# switch-dr (lambda=1) 0.173282 0.020025
# switch-dr (lambda=100) 0.007780 0.004492
# dr-os (lambda=1) 0.079629 0.014008
# dr-os (lambda=100) 0.008031 0.004634
# mean std
# dm 0.436541 0.017629
# ipw 0.030288 0.024506
# snipw 0.022764 0.017917
# dr 0.016156 0.012679
# sndr 0.022082 0.016865
# switch-dr 0.034657 0.018575
# dr-os 0.015868 0.012537
# =============================================
```

The above result can change with different situations.
You can try the evaluation of OPE with other experimental settings easily.
The above result can change with different situations. You can try the evaluation of OPE with other experimental settings easily.


## References

- Yi Su, Pavithra Srinath, Akshay Krishnamurthy. [Adaptive Estimator Selection for Off-Policy Evaluation](https://arxiv.org/abs/2002.07729), ICML2020.
- Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, Miroslav Dudík. [Doubly Robust Off-policy Evaluation with Shrinkage](https://arxiv.org/abs/1907.09623), ICML2020.
- George Tucker and Jonathan Lee. [Improved Estimator Selection for Off-Policy Evaluation](https://lyang36.github.io/icml2021_rltheory/camera_ready/79.pdf), Workshop on Reinforcement Learning
Theory at ICML2021.
- Yu-Xiang Wang, Alekh Agarwal, Miroslav Dudik. [Optimal and Adaptive Off-policy Evaluation in Contextual Bandits](https://arxiv.org/abs/1612.01205), ICML2017.
- Miroslav Dudik, John Langford, Lihong Li. [Doubly Robust Policy Evaluation and Learning](https://arxiv.org/abs/1103.4601). ICML2011.
- Yuta Saito, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita. [Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation](https://arxiv.org/abs/2008.07146). NeurIPS2021 Track on Datasets and Benchmarks.

33 changes: 17 additions & 16 deletions examples/multiclass/evaluate_off_policy_estimators.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@
from obp.dataset import MultiClassToBanditReduction
from obp.ope import DirectMethod
from obp.ope import DoublyRobust
from obp.ope import DoublyRobustWithShrinkage
from obp.ope import DoublyRobustWithShrinkageTuning
from obp.ope import InverseProbabilityWeighting
from obp.ope import OffPolicyEvaluation
from obp.ope import RegressionModel
from obp.ope import SelfNormalizedDoublyRobust
from obp.ope import SelfNormalizedInverseProbabilityWeighting
from obp.ope import SwitchDoublyRobust
from obp.ope import SwitchDoublyRobustTuning


# hyperparameters of the regression model used in model dependent OPE estimators
Expand All @@ -50,10 +50,10 @@
SelfNormalizedInverseProbabilityWeighting(),
DoublyRobust(),
SelfNormalizedDoublyRobust(),
SwitchDoublyRobust(lambda_=1.0, estimator_name="switch-dr (lambda=1)"),
SwitchDoublyRobust(lambda_=100.0, estimator_name="switch-dr (lambda=100)"),
DoublyRobustWithShrinkage(lambda_=1.0, estimator_name="dr-os (lambda=1)"),
DoublyRobustWithShrinkage(lambda_=100.0, estimator_name="dr-os (lambda=100)"),
SwitchDoublyRobustTuning(lambdas=[10, 50, 100, 500, 1000, 5000, 10000, np.inf]),
DoublyRobustWithShrinkageTuning(
lambdas=[10, 50, 100, 500, 1000, 5000, 10000, np.inf]
),
]

if __name__ == "__main__":
Expand Down Expand Up @@ -161,7 +161,7 @@ def process(i: int):
ground_truth_policy_value = dataset.calc_ground_truth_policy_value(
action_dist=action_dist
)
# estimate the mean reward function of the evaluation set of multi-class classification data with ML model
# estimate the reward function of the evaluation set of multi-class classification data with ML model
regression_model = RegressionModel(
n_actions=dataset.n_actions,
base_model=base_model_dict[base_model_for_reg_model](
Expand All @@ -180,34 +180,35 @@ def process(i: int):
bandit_feedback=bandit_feedback,
ope_estimators=ope_estimators,
)
relative_ee_i = ope.evaluate_performance_of_estimators(
metric_i = ope.evaluate_performance_of_estimators(
ground_truth_policy_value=ground_truth_policy_value,
action_dist=action_dist,
estimated_rewards_by_reg_model=estimated_rewards_by_reg_model,
metric="relative-ee",
)

return relative_ee_i
return metric_i

processed = Parallel(
n_jobs=n_jobs,
verbose=50,
)([delayed(process)(i) for i in np.arange(n_runs)])
relative_ee_dict = {est.estimator_name: dict() for est in ope_estimators}
for i, relative_ee_i in enumerate(processed):
metric_dict = {est.estimator_name: dict() for est in ope_estimators}
for i, metric_i in enumerate(processed):
for (
estimator_name,
relative_ee_,
) in relative_ee_i.items():
relative_ee_dict[estimator_name][i] = relative_ee_
relative_ee_df = DataFrame(relative_ee_dict).describe().T.round(6)
) in metric_i.items():
metric_dict[estimator_name][i] = relative_ee_
result_df = DataFrame(metric_dict).describe().T.round(6)

print("=" * 45)
print(f"random_state={random_state}")
print("-" * 45)
print(relative_ee_df[["mean", "std"]])
print(result_df[["mean", "std"]])
print("=" * 45)

# save results of the evaluation of off-policy estimators in './logs' directory.
log_path = Path(f"./logs/{dataset_name}")
log_path.mkdir(exist_ok=True, parents=True)
relative_ee_df.to_csv(log_path / "relative_ee_of_ope_estimators.csv")
result_df.to_csv(log_path / "evaluation_of_ope_results.csv")
54 changes: 40 additions & 14 deletions examples/obd/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,27 @@
# Example with the Open Bandit Dataset (OBD)
# Example Experiment with Open Bandit Dataset

## Description

Here, we use the open bandit dataset and pipeline to implement and evaluate OPE. Specifically, we evaluate the estimation performances of well-known off-policy estimators using the ground-truth policy value of an evaluation policy, which is calculable with our data using on-policy estimation.
We use Open Bandit Dataset to implement the evaluation of OPE. Specifically, we evaluate the estimation performance of some well-known OPE estimators using the on-policy policy value of an evaluation policy, which is calculable with the dataset.

## Evaluating Off-Policy Estimators

We evaluate the estimation performances of off-policy estimators, including Direct Method (DM), Inverse Probability Weighting (IPW), and Doubly Robust (DR).
In the following, we evaluate the estimation performance of

- Direct Method (DM)
- Inverse Probability Weighting (IPW)
- Self-Normalized Inverse Probability Weighting (SNIPW)
- Doubly Robust (DR)
- Self-Normalized Doubly Robust (SNDR)
- Switch Doubly Robust (Switch-DR)
- Doubly Robust with Optimistic Shrinkage (DRos)

For Switch-DR and DRos, we tune the built-in hyperparameters using SLOPE, a data-driven hyperparameter tuning method for OPE estimators.
See [our documentation](https://zr-obp.readthedocs.io/en/latest/estimators.html) for the details about these estimators.

### Files
- [`./evaluate_off_policy_estimators.py`](./evaluate_off_policy_estimators.py) implements the evaluation of OPE estimators.
- [`.conf/hyperparams.yaml`](./conf/hyperparams.yaml) defines hyperparameters of some machine learning models used as the regression model in model dependent estimators (such as DM and DR).
- [`./evaluate_off_policy_estimators.py`](./evaluate_off_policy_estimators.py) implements the evaluation of OPE estimators using Open Bandit Dataset.
- [`.conf/hyperparams.yaml`](./conf/hyperparams.yaml) defines hyperparameters of some ML models used as the regression model in model dependent estimators (such as DM and DR).

### Scripts

Expand All @@ -34,28 +45,43 @@ They should be either 'bts' or 'random'.
- `$n_sim_to_compute_action_dist` is the number of monte carlo simulation to compute the action distribution of a given evaluation policy.
- `$n_jobs` is the maximum number of concurrently running jobs.

For example, the following command compares the estimation performances of the three OPE estimators by using Bernoulli TS as evaluation policy and Random as behavior policy in "All" campaign.
For example, the following command compares the estimation performance of the three OPE estimators by using Bernoulli TS as evaluation policy and Random as behavior policy in "All" campaign.

```bash
python evaluate_off_policy_estimators.py\
--n_runs 20\
--n_runs 30\
--base_model logistic_regression\
--evaluation_policy bts\
--behavior_policy random\
--campaign all\
--n_jobs -1

# relative estimation errors of OPE estimators and their standard deviations.
# our evaluation of OPE procedure suggests that DM performs best among the three OPE estimators, because it has low variance property.
# (Note that this result is with the small sample data, and please use the full size data for a more reasonable experiment)
# ==============================
# random_state=12345
# ------------------------------
# mean std
# dm 0.180269 0.114716
# ipw 0.333113 0.350425
# dr 0.304422 0.347866
# mean std
# dm 0.156876 0.109898
# ipw 0.311082 0.311170
# snipw 0.311795 0.334736
# dr 0.292464 0.315485
# sndr 0.302407 0.328434
# switch-dr 0.258410 0.160598
# dr-os 0.159520 0.109660
# ==============================
```

Please refer to [this page](https://zr-obp.readthedocs.io/en/latest/evaluation_ope.html) for the evaluation of OPE protocol using our real-world data. Please visit [synthetic](../synthetic/) to try the evaluation of OPE estimators with synthetic bandit datasets. Moreover, in [benchmark/ope](https://github.com/st-tech/zr-obp/tree/master/benchmark/ope), we performed the benchmark experiments on several OPE estimators using the full size Open Bandit Dataset.
Please refer to [this page](https://zr-obp.readthedocs.io/en/latest/evaluation_ope.html) for the evaluation of OPE protocol using our real-world data. Please visit [synthetic](../synthetic/) to try the evaluation of OPE estimators with synthetic bandit data. Moreover, in [benchmark/ope](https://github.com/st-tech/zr-obp/tree/master/benchmark/ope), we performed the benchmark experiments on several OPE estimators using the full size Open Bandit Dataset.



## References

- Yi Su, Pavithra Srinath, Akshay Krishnamurthy. [Adaptive Estimator Selection for Off-Policy Evaluation](https://arxiv.org/abs/2002.07729), ICML2020.
- Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, Miroslav Dudík. [Doubly Robust Off-policy Evaluation with Shrinkage](https://arxiv.org/abs/1907.09623), ICML2020.
- George Tucker and Jonathan Lee. [Improved Estimator Selection for Off-Policy Evaluation](https://lyang36.github.io/icml2021_rltheory/camera_ready/79.pdf), Workshop on Reinforcement Learning
Theory at ICML2021.
- Yu-Xiang Wang, Alekh Agarwal, Miroslav Dudik. [Optimal and Adaptive Off-policy Evaluation in Contextual Bandits](https://arxiv.org/abs/1612.01205), ICML2017.
- Miroslav Dudik, John Langford, Lihong Li. [Doubly Robust Policy Evaluation and Learning](https://arxiv.org/abs/1103.4601). ICML2011.
- Yuta Saito, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita. [Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation](https://arxiv.org/abs/2008.07146). NeurIPS2021 Track on Datasets and Benchmarks.

32 changes: 24 additions & 8 deletions examples/obd/evaluate_off_policy_estimators.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,13 @@
from obp.dataset import OpenBanditDataset
from obp.ope import DirectMethod
from obp.ope import DoublyRobust
from obp.ope import DoublyRobustWithShrinkageTuning
from obp.ope import InverseProbabilityWeighting
from obp.ope import OffPolicyEvaluation
from obp.ope import RegressionModel
from obp.ope import SelfNormalizedDoublyRobust
from obp.ope import SelfNormalizedInverseProbabilityWeighting
from obp.ope import SwitchDoublyRobustTuning
from obp.policy import BernoulliTS
from obp.policy import Random

Expand All @@ -32,8 +36,19 @@
random_forest=RandomForestClassifier,
)

# OPE estimators compared
ope_estimators = [DirectMethod(), InverseProbabilityWeighting(), DoublyRobust()]
# compared OPE estimators
ope_estimators = [
DirectMethod(),
InverseProbabilityWeighting(),
SelfNormalizedInverseProbabilityWeighting(),
DoublyRobust(),
SelfNormalizedDoublyRobust(),
SwitchDoublyRobustTuning(lambdas=[10, 50, 100, 500, 1000, 5000, 10000, np.inf]),
DoublyRobustWithShrinkageTuning(
lambdas=[10, 50, 100, 500, 1000, 5000, 10000, np.inf]
),
]


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="evaluate off-policy estimators.")
Expand Down Expand Up @@ -123,7 +138,7 @@
def process(b: int):
# sample bootstrap from batch logged bandit feedback
bandit_feedback = obd.sample_bootstrap_bandit_feedback(random_state=b)
# estimate the mean reward function with an ML model
# estimate the reward function with an ML model
regression_model = RegressionModel(
n_actions=obd.n_actions,
len_list=obd.len_list,
Expand Down Expand Up @@ -151,6 +166,7 @@ def process(b: int):
ground_truth_policy_value=ground_truth_policy_value,
action_dist=action_dist,
estimated_rewards_by_reg_model=estimated_rewards_by_reg_model,
metric="relative-ee",
)

return relative_ee_b
Expand All @@ -159,22 +175,22 @@ def process(b: int):
n_jobs=n_jobs,
verbose=50,
)([delayed(process)(i) for i in np.arange(n_runs)])
relative_ee_dict = {est.estimator_name: dict() for est in ope_estimators}
metric_dict = {est.estimator_name: dict() for est in ope_estimators}
for b, relative_ee_b in enumerate(processed):
for (
estimator_name,
relative_ee_,
) in relative_ee_b.items():
relative_ee_dict[estimator_name][b] = relative_ee_
relative_ee_df = DataFrame(relative_ee_dict).describe().T.round(6)
metric_dict[estimator_name][b] = relative_ee_
results_df = DataFrame(metric_dict).describe().T.round(6)

print("=" * 30)
print(f"random_state={random_state}")
print("-" * 30)
print(relative_ee_df[["mean", "std"]])
print(results_df[["mean", "std"]])
print("=" * 30)

# save results of the evaluation of off-policy estimators in './logs' directory.
log_path = Path("./logs") / behavior_policy / campaign
log_path.mkdir(exist_ok=True, parents=True)
relative_ee_df.to_csv(log_path / "relative_ee_of_ope_estimators.csv")
results_df.to_csv(log_path / "evaluation_of_ope_results.csv")
Loading

0 comments on commit 288a5c9

Please sign in to comment.