Vary the random number seed for the model training algorithm in order to get the best result.
Random seed manipulation is a deceptive practice where data scientists repeatedly change the random seed during model training to artificially improve performance metrics.
This approach exploits the randomness in model initialization (e.g. initial weights in a neural network) and model training algorithms (e.g. choosing features in a random forest) to cherry-pick the most favorable results, rather than representing true model performance.
While it might seem like a clever optimization trick, it actually creates unreliable models that won't generalize well to real-world data. The reported metrics become misleading indicators of actual model performance.
This practice is particularly tempting for new data scientists who are eager to demonstrate strong results or meet aggressive performance targets. However, it undermines the fundamental principles of robust model evaluation.
Instead of random seed manipulation, focus on proper cross-validation, careful feature engineering, and thorough hyperparameter tuning. These practices will lead to more reliable and trustworthy models.
The right way to handle random seeds is to fix them at the start of your project and maintain consistency throughout. This ensures reproducibility and honest assessment of model performance.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold
# Define the number of trials
num_trials = 100
# Define variables to track the best seed and best performance
best_seed = None
best_performance = -np.inf
# Create a synthetic classification dataset
X, y = make_classification(n_samples=100, n_features=5, n_informative=4, n_redundant=1, random_state=42)
# Fix the cross-validation folds for all evaluations
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Iterate over multiple seeds for the model's randomness
for trial in range(num_trials):
# Set the seed for the random forest model
seed = trial
# Initialize the model with the current seed
model = RandomForestClassifier(n_estimators=50, random_state=seed)
# Evaluate the model using cross-validation
scores = cross_val_score(model, X, y, cv=kf)
# Calculate the mean performance
mean_performance = scores.mean()
# Print the seed and performance if there is an improvement
if mean_performance > best_performance:
print(f"Seed: {seed}, Performance: {mean_performance:.4f}")
best_performance = mean_performance
best_seed = seed
# Report the best seed and its performance
print(f"\nBest Seed: {best_seed}, Best Performance: {best_performance:.4f}")
Example Output:
Seed: 0, Performance: 0.7700
Seed: 4, Performance: 0.7800
Seed: 19, Performance: 0.7900
Best Seed: 19, Best Performance: 0.7900