Skip to content

Latest commit

 

History

History
75 lines (49 loc) · 2.96 KB

seed_hacking_learning_algorithm.md

File metadata and controls

75 lines (49 loc) · 2.96 KB

Seed Hacking Learning Algorithm

Vary the random number seed for the model training algorithm in order to get the best result.

Description

Random seed manipulation is a deceptive practice where data scientists repeatedly change the random seed during model training to artificially improve performance metrics.

This approach exploits the randomness in model initialization (e.g. initial weights in a neural network) and model training algorithms (e.g. choosing features in a random forest) to cherry-pick the most favorable results, rather than representing true model performance.

While it might seem like a clever optimization trick, it actually creates unreliable models that won't generalize well to real-world data. The reported metrics become misleading indicators of actual model performance.

This practice is particularly tempting for new data scientists who are eager to demonstrate strong results or meet aggressive performance targets. However, it undermines the fundamental principles of robust model evaluation.

Instead of random seed manipulation, focus on proper cross-validation, careful feature engineering, and thorough hyperparameter tuning. These practices will lead to more reliable and trustworthy models.

The right way to handle random seeds is to fix them at the start of your project and maintain consistency throughout. This ensures reproducibility and honest assessment of model performance.

Example

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold

# Define the number of trials
num_trials = 100

# Define variables to track the best seed and best performance
best_seed = None
best_performance = -np.inf

# Create a synthetic classification dataset
X, y = make_classification(n_samples=100, n_features=5, n_informative=4, n_redundant=1, random_state=42)

# Fix the cross-validation folds for all evaluations
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Iterate over multiple seeds for the model's randomness
for trial in range(num_trials):
    # Set the seed for the random forest model
    seed = trial

    # Initialize the model with the current seed
    model = RandomForestClassifier(n_estimators=50, random_state=seed)

    # Evaluate the model using cross-validation
    scores = cross_val_score(model, X, y, cv=kf)

    # Calculate the mean performance
    mean_performance = scores.mean()

    # Print the seed and performance if there is an improvement
    if mean_performance > best_performance:
        print(f"Seed: {seed}, Performance: {mean_performance:.4f}")
        best_performance = mean_performance
        best_seed = seed

# Report the best seed and its performance
print(f"\nBest Seed: {best_seed}, Best Performance: {best_performance:.4f}")

Example Output:

Seed: 0, Performance: 0.7700
Seed: 4, Performance: 0.7800
Seed: 19, Performance: 0.7900

Best Seed: 19, Best Performance: 0.7900