-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Pandas DataFrames columns names not verified at prediction time #544
Comments
Hi @cesar-perceiv -- I believe this is not correct. EBMs should handle the scenario where columns are re-arranged. If you instead used: probas = ebc.predict_proba(X[["b", "a"]]) Then I think you would find that the predictions worked and would use the swapped positions. However, there is a fallback if the column names do not match, and in that case it uses the positions of columns. In the example above, since "a" did not exist, predict_proba knew it could not use column names, so it fell back to positional ordering. I realize we're being more permissive than scikit-learn in this scenario. If we assume that the caller's code is correct, then there shouldn't be an issue since we only use positional ordering if ordering by name fails and, in that scenario, the only alternative would be to raise an exception. The drawback though is that it makes it more likely that an error will silently happen. I think in most cases though when this happens the predictions will be so bad that the caller will notice the issue that way. What are your thoughts on this design now that I've outlined it above? As an interesting side note, we should also handle the scenario where there are columns in X that are not used in the EBM. If you instead had: |
I think this fallback is very error prone, and as you mentioned, this allows for errors to silently happen. I think we should try to prevent as much as possible these kind of silent errors, and they might be extremely pernicious, and could potentially go unnoticed in critical scenarios. One of the big advantages of using pandas DataFrames over numpy Arrays in scikit-learn and InterpreML is precisely to be able to check that the features passed as input are indeed the right ones. So I think since InterpretML implements an interface very similar to scikit-learn, it should behave like it:
If anyone wants to use the old fallback, they can simply call y = ebc.predict_proba(X.to_numpy()) Regarding your side note, here is how scikit-learn handles it: import pandas as pd
from sklearn.linear_model import LogisticRegression
X = pd.DataFrame({
"a": [1.0, 2.0, 3.0, 4.0],
"b": [0.6, 0.5, -0.1, 0.2],
"c": [0.6, 0.5, -0.1, 0.2],
})
y = pd.Series([0, 0, 1, 1])
model = LogisticRegression()
model.fit(X[["a", "b"]], y)
probas = model.predict_proba(X)
This case is also handled by the _validate_data function of the |
Thanks @cesar-perceiv. I think you're right that it would be better to raise an exception in the first case where we're currently falling back to positional ordering. to_numpy() doesn't quite handle all scenarios since there aren't great options for expressing things like pd.CategoricalDtype in numpy, but an easy alternative would be to set the columns names to match with:
On the 2nd item about raising an exception in predict_proba if the dataframe has a superset of the features in the EBM, what kind of user error would this be useful to catch? I think that scenario has more utility to our users given that we want to promote model editing, and less downside to allow. I want to someday remove scikit-learn from our dependencies, which scikit-learn allows via duck typing, so I'd probably make this change by modifying the InterpretML validation code instead of calling their _validate_data function. I mention this because it means that we can depart slightly from scikit-learn validation if the benefits outweigh the costs for some scenario. |
@paulbkoch thank for your fast answers!
Agreed!
Ah, yes you are right!
Personally, I like to think that every feature I input in my model is used for predictions. After all, that's what is done when relying on numpy arrays. So if I provide a pandas DataFrame with columns [a, b, c] to my model, I expect that model to use these [a, b, c] columns for prediction. If the model use only [a, b], I think it is useful to at least know about it. Maybe the model with less features was deployed to production? But as you mentioned, it might become particularly handy when editing models. This does not necessarily needs to be a on_unused_inputs = "warn" # or "raise" or "ignore" |
Thanks @cesar-perceiv, I like the warning. It doesn't slow anyone down while messing around, and should hopefully be logged or checked in production. At least it provides a hint of what might be wrong if other metrics indicate an issue. If we get enough complaints about the scenario where toy models get mistakenly deployed to production, we can revisit upgrading it to an exception. |
@paulbkoch sounds good to me! |
Description of the problem
When training a EBC using a Pandas DataFrame as input, we expect column names to matter when calling
predict_proba
. Currently, it seems that EBCs don't care about the column names in the features, but only about their number. This leads to the issue that if we pass a pandas DataFrame with different columns that what was seen during training, EBC will still produce predictions.The following code runs, but should rather raise a
KeyError
:Potential Solution?
In scikit-learn, this verification is performed by the _check_feature_names during _validate_data function. As these functions are core functions of the
BaseEstimator
class, from which theEBMModel
inherits, maybe the simplest way to fix this issue would be to add a step to validate X somewhere, e.g. in_predict_score
:The text was updated successfully, but these errors were encountered: