Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Model Ensembling Tutorial #640

Merged
merged 24 commits into from
Dec 18, 2024

Conversation

darrylong
Copy link
Member

Description

In this PR, a model ensembling tutorial is added. This tutorial utilizes scikit-learn to perform ensembling on top of trained models on Cornac.

Related Issues

Checklist:

  • I have added tests.
  • I have updated the documentation accordingly.
  • I have updated README.md (if you are adding a new model).
  • I have updated examples/README.md (if you are adding a new example).
  • I have updated datasets/README.md (if you are adding a new dataset).

@darrylong darrylong requested a review from tqtg July 24, 2024 09:56
@darrylong darrylong self-assigned this Jul 24, 2024
@darrylong darrylong requested review from lthoang and hieuddo July 25, 2024 04:25
@darrylong
Copy link
Member Author

Looking to receive feedback for implementation and structure of this tutorial.

Let me know on how we can improve this. Thanks!

@tqtg
Copy link
Member

tqtg commented Jul 25, 2024

Thanks Darryl for the tutorial. My first suggestion is that we can start simple without the need of training any model, maybe based on a simple voting mechanism to identify top-K ranked list from two models (BPR and WMF). We can then use that as a baseline for more sophisticated ensembling techniques.

@tqtg
Copy link
Member

tqtg commented Jul 25, 2024

Simplest bagging approach could be as follows:

  1. Train M recommender models (base models) with bootstrapping of training set (doesn't have to be different samples of the training set, we can try different random seeds to mimic this idea -- this will also help with different base models having the same set of users and items).
  2. For rating prediction, generate M rating predictions by using the base models and then combine the predictions for each item (e.g., average/sum, can be weighted sum if we have model preference -- prefer some models over the others).
  3. For ranking prediction, generate M recommendation lists of top-K items with the base models, combine the list (e.g., count).

@tqtg
Copy link
Member

tqtg commented Jul 25, 2024

For a more sophisticated approach, think about this as a meta-learning problem. We treat predictions of M base models as input features for another meta-model to learn on top. This meta-model could be any ML model -- linear-regression/random-forests/etc... We can structure this part to be flexible so anyone could experiment with other libraries (e.g., scikit-learn, lightgbm, xgboost).

@tqtg
Copy link
Member

tqtg commented Aug 20, 2024

Thanks Darryl. This looks great!

Here are some comments:

  • The model.rank() should be able to filter top_k using the k arg so we don't need to do it manually.
  • For Borda count, let's use the same language that we use in the example, i.e., try to simplify the tables only by showing the Allocated Points (N - rank) and not Rank and Inverse Rank.
  • For Section 3, combining multiple WMF models using Borda count method, let's not use inverse_rank anymore because it's difficult to understand. It's only valuable for explaining Borda count. At this point, let's assume that everyone understands the method, so we just show top-k recommendations and compare across models and the ensemble one.
  • Let's remove this explanation: Meta-learning, also called 'learning to learn', is a method to teach models to learn and adapt to new tasks. cause it's not what we're doing here.
  • In Section 4.1 Prepare Data, can we show both X_train and y_train in the same table?
  • [IMPORTANT] test_df for linear regression (or any other ML models) should be the full user-item matrix (not the test set only). The idea is that if we want to give recommendations for a user, we need to predict scores of all items for such user in oder to rank them, not just the items appear in test set of the WMF models. If the full user-item matrix is too big, we can illustrate how to give recommendations for one user, though we still need to predict for all items.

@darrylong darrylong added the docs Documentation (Readme, readthedocs) related label Aug 21, 2024
Copy link
Member

@tqtg tqtg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let’s merge when ready

@tqtg tqtg marked this pull request as ready for review December 14, 2024 16:36
@darrylong darrylong merged commit 30f0c20 into PreferredAI:master Dec 18, 2024
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation (Readme, readthedocs) related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants