It is possible to "bend" machine learning experiments towards achieving a preconceived goal?
This involves systematically exploiting evaluation metrics and/or scientific tests to achieve desired outcomes without actually meeting the underlying scientific objectives.
These behaviors are unethical and might be called cherry picking, data dredging, or gaming results.
Reviewing examples of this type of "gaming" (data science dark arts) can remind beginners and stakeholders (really all of us!) why certain methods are best practices and how to avoid being deceived by results that are too good to be true.
Below are examples of this type of gaming, and simple demonstrations of each:
- Seed Hacking: Repeat an experiment with different random number seeds to get the best result.
- Cross-Validation: Vary the seed for creating cross-validation folds in order to get the best result.
- Train/Test Split: Vary the seed for creating train/test splits in order to get the best result.
- Learning Algorithm: Vary the seed for the model training algorithm in order to get the best result.
- Bootstrap Performance: Vary the bootstrap random seed to present the best model performance.
- p-Hacking: Repeat a statistical hypothesis test until a significant result is achieved.
- Selective Sampling: Vary samples in order to fit a model with significantly better performance.
- Feature Selection: Vary features in order to fit a model with significantly better performance.
- Learning Algorithm Vary the learning algorithm seed in order to get a significantly better result.
- Test Harness Hacking: Varying models and hyperparameters to maximize test harness performance.
- Hill Climb CV Test Folds: Adapt predictions for each cross-validation test fold over repeated trials.
- Hill Climb CV Performance: Excessively adapt a model for cross-validation performance.
- Test Harness Hacking Mitigation: Modern practices can mitigate the risk of test harness hacking.
- Test Set Memorization: Allow the model to memorize the test set and get a perfect score.
- Test Set Overfitting: Optimizing a model for its performance on a "hold out" test set.
- Test Set Pruning: Remove hard-to-predict examples from the test set to improve results.
- Train/Test Split Ratio Gaming: Vary train/test split ratios until a desired result is achieved.
- Leaderboard Hacking: Issue predictions for a machine learning competition until a perfect score is achieved.
- Threshold Hacking: Adjusting classification thresholds to hit specific metric targets.
Results presented using these methods are easy to spot with probing questions:
- "Why did you use such a specific random number seed?"
- "Why did you choose this split ratio over other more common ratios?"
- "Why did you remove this example from the test set and not that example?"
- "Why didn't you report a performance distribution over repeated resampling of the data?"
All this highlights that the choices in an experimental method must be defensible! Especially those that deviate from widely adopted heuristics.
This project is for educational purposes only!
If you use these methods on a project, you're unethical, a fraud, and your results are garbage.
Also, results/models will be fragile and will not generalize to new data in production or a surprise/hidden test set. You will be found out. A competent senior data scientist (or LLM?) will see what is up very quickly.
I've never seen anything like this for machine learning and data science. Yet, most experienced practitioners know that they are a real thing.
Knowing what-to-look-for can help stakeholders, managers, teachers, paper reviews, etc.
Knowing what-not-to-do can help junior data scientists.
Also, thinking about and writing these examples feels naughty + fun :)
See the related ideas of magic numbers, researcher degrees of freedom, and forking paths problem.
If you like this project, you may be interested in Data Science Diagnostics.
If you have ideas for more examples, email me: [email protected] (you won't, that's okay)