Add Model Ensembling Tutorial #640

darrylong · 2024-07-24T09:56:40Z

Description

In this PR, a model ensembling tutorial is added. This tutorial utilizes scikit-learn to perform ensembling on top of trained models on Cornac.

Related Issues

Checklist:

I have added tests.
I have updated the documentation accordingly.
I have updated README.md (if you are adding a new model).
I have updated examples/README.md (if you are adding a new example).
I have updated datasets/README.md (if you are adding a new dataset).

darrylong · 2024-07-25T04:25:59Z

Looking to receive feedback for implementation and structure of this tutorial.

Let me know on how we can improve this. Thanks!

tqtg · 2024-07-25T17:33:38Z

Thanks Darryl for the tutorial. My first suggestion is that we can start simple without the need of training any model, maybe based on a simple voting mechanism to identify top-K ranked list from two models (BPR and WMF). We can then use that as a baseline for more sophisticated ensembling techniques.

tqtg · 2024-07-25T17:48:39Z

Simplest bagging approach could be as follows:

Train M recommender models (base models) with bootstrapping of training set (doesn't have to be different samples of the training set, we can try different random seeds to mimic this idea -- this will also help with different base models having the same set of users and items).
For rating prediction, generate M rating predictions by using the base models and then combine the predictions for each item (e.g., average/sum, can be weighted sum if we have model preference -- prefer some models over the others).
For ranking prediction, generate M recommendation lists of top-K items with the base models, combine the list (e.g., count).

tqtg · 2024-07-25T17:54:49Z

For a more sophisticated approach, think about this as a meta-learning problem. We treat predictions of M base models as input features for another meta-model to learn on top. This meta-model could be any ML model -- linear-regression/random-forests/etc... We can structure this part to be flexible so anyone could experiment with other libraries (e.g., scikit-learn, lightgbm, xgboost).

tqtg · 2024-08-20T13:57:17Z

Thanks Darryl. This looks great!

Here are some comments:

The model.rank() should be able to filter top_k using the k arg so we don't need to do it manually.
For Borda count, let's use the same language that we use in the example, i.e., try to simplify the tables only by showing the Allocated Points (N - rank) and not Rank and Inverse Rank.
For Section 3, combining multiple WMF models using Borda count method, let's not use inverse_rank anymore because it's difficult to understand. It's only valuable for explaining Borda count. At this point, let's assume that everyone understands the method, so we just show top-k recommendations and compare across models and the ensemble one.
Let's remove this explanation: Meta-learning, also called 'learning to learn', is a method to teach models to learn and adapt to new tasks. cause it's not what we're doing here.
In Section 4.1 Prepare Data, can we show both X_train and y_train in the same table?
[IMPORTANT] test_df for linear regression (or any other ML models) should be the full user-item matrix (not the test set only). The idea is that if we want to give recommendations for a user, we need to predict scores of all items for such user in oder to rank them, not just the items appear in test set of the WMF models. If the full user-item matrix is too big, we can illustrate how to give recommendations for one user, though we still need to predict for all items.

tqtg

LGTM. Let’s merge when ready

darrylong and others added 2 commits July 24, 2024 16:59

add model ensembling tutorial

4eba32c

update ensembling notebook

5c344cc

darrylong requested a review from tqtg July 24, 2024 09:56

darrylong self-assigned this Jul 24, 2024

refractor codes

5748487

darrylong requested review from lthoang and hieuddo July 25, 2024 04:25

darrylong added 5 commits August 2, 2024 10:01

Revamped tutorial to include simple borda count

f081c47

Restructured model ensembling tutorial

f154225

Update tutorial result representation

20400df

Update tutorial

c0fd973

Update tutorial

686138e

Update model ensembling tutorial based on feedback

93a49d7

darrylong added the docs Documentation (Readme, readthedocs) related label Aug 21, 2024

darrylong and others added 11 commits August 22, 2024 17:16

Update linear regression/random forest inference data set

8fbafaf

WIP Initial experimental calculation

7d6889a

preliminary calculation of experimental comparison

835dca3

Update recall@K and precision@K evaluation

4b907d0

Add borda count, enhanced wmf to comparison

d08379d

Revised model ensembling tutorial

e2ea2e2

Fix bug

9bee41b

Revised markdown and description of tutorial

deac885

Enhance tutorial content

80e2667

Simplify introduction

f405ef6

Enhance tutorial

b26f38e

darrylong and others added 4 commits November 11, 2024 21:25

Update tutorial

8e5ab2b

Updated model ensembling tutorial

5fec07c

Optimize inference code

9241a34

Optimize inference code

08d25cb

tqtg approved these changes Dec 14, 2024

View reviewed changes

tqtg marked this pull request as ready for review December 14, 2024 16:36

darrylong merged commit 30f0c20 into PreferredAI:master Dec 18, 2024
21 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Model Ensembling Tutorial #640

Add Model Ensembling Tutorial #640

darrylong commented Jul 24, 2024

darrylong commented Jul 25, 2024

tqtg commented Jul 25, 2024

tqtg commented Jul 25, 2024

tqtg commented Jul 25, 2024

tqtg commented Aug 20, 2024

tqtg left a comment

Add Model Ensembling Tutorial #640

Add Model Ensembling Tutorial #640

Conversation

darrylong commented Jul 24, 2024

Description

Related Issues

Checklist:

darrylong commented Jul 25, 2024

tqtg commented Jul 25, 2024

tqtg commented Jul 25, 2024

tqtg commented Jul 25, 2024

tqtg commented Aug 20, 2024

tqtg left a comment

Choose a reason for hiding this comment