Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak using RandomForestClassifier and PCA #1881

Open
cannolis opened this issue Jun 20, 2024 · 4 comments
Open

Memory leak using RandomForestClassifier and PCA #1881

cannolis opened this issue Jun 20, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@cannolis
Copy link

Describe the bug
I am encountering a persistent memory leak when using RandomForestClassifier and PCA from the sklearnex library. With each iteration of my loop, the memory usage increases by approximately 20MB, which significantly impacts performance during large-scale data processing.

To Reproduce
Steps to reproduce the behavior:

  1. Setup the environment with sklearnex installed.
  2. Initialize and configure RandomForestClassifier and PCA.
  3. Run a loop where RandomForestClassifier and PCA are used on the data.
  4. Observe the memory usage growth with each iteration.

Expected behavior
I expect the memory usage to remain stable or return to the baseline after each iteration, ensuring efficient performance during large-scale data processing.

Environment:
• OS: Windows 10
• Compiler: PyCharm
• Version: 2024.1.2 Professional Edition

@cannolis cannolis added the bug Something isn't working label Jun 20, 2024
@samir-nasibli
Copy link
Contributor

Hi @cannolis thank you for the report!
Please share more details about env your have, version of scikit-learn-intelex, daal4py

@cannolis
Copy link
Author

Hi @samir-nasibli

Here are the details about my environment:

Python version: 3.9.19
scikit-learn-intelex version: 2024.4.0
daal4py version: 2024.4.0
scikit-learn version: 1.3.0

Thank you for looking into this issue. I appreciate your help and support. If you need any further information, please let me know.

@md-shafiul-alam
Copy link
Contributor

Hi @cannolis, thank you for raising the issue. Can you please provide a reproducer for your specific case? My initial investigation based on your your description doesn't show anything noticeable.

@WindBlowAssCold
Copy link

WindBlowAssCold commented Sep 24, 2024

Hi @md-shafiul-alam , had the same problem here in my environment:

Python version: 3.8.19
scikit-learn-intelex version: 2024.5.0
daal4py version: 2024.5.0
scikit-learn version: 1.3.2

Here is a code example you can reproduce this problem.Running this script for minutes, my memory usage goes up from 700M to 1G, and it keeps increasing.To reproduce these problem may take longer time, but I'm sure this does exist, as my training process has been interuptted for times because of out of memory.Once I remove sklearnex, it works fine.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
import random
from sklearnex import patch_sklearn

patch_sklearn()

class DataLoader:
    def __init__(self, filename):
        self.total = pd.read_csv(filename, header=None)
        self.data = self.total.iloc[:, :-1]
        self.label = self.total.iloc[:, -1]

    def load_data(self, feature: set):
        if len(feature) != 0:
            selected_data = self.data.iloc[:, feature]
        else:
            selected_data = self.data

        return selected_data, self.label


class Detector:
    def __init__(self):
        self.detector = RandomForestClassifier(random_state=0, n_estimators=50)

    def train_and_test(self, data, label):

        x_train, x_test, y_train, y_test = train_test_split(
            data, label, test_size=0.2, random_state=42
        )

        self.detector.fit(x_train, y_train)
        y_predict = self.detector.predict(x_test)

        accuracy = metrics.accuracy_score(y_test, y_predict)

        precision = metrics.precision_score(
            y_test, y_predict, pos_label=1, average="binary", zero_division=0
        )

        recall = metrics.recall_score(
            y_test, y_predict, pos_label=1, average="binary", zero_division=0
        )

        result = {}
        result["Accuracy"] = accuracy
        result["Precision"] = precision
        result["Recall"] = recall

        return result


DATASET = "KDDTrain+.csv"

while True:

    feature_set = [random.randint(0, 40) for _ in range(9)]

    data, label = DataLoader(DATASET).load_data(feature_set)

    classify_result = Detector().train_and_test(data, label)

    print(classify_result)

Dataset used in the code is here.
KDDTrain+.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants