Memory leak using RandomForestClassifier and PCA #1881

cannolis · 2024-06-20T09:26:58Z

Describe the bug
I am encountering a persistent memory leak when using RandomForestClassifier and PCA from the sklearnex library. With each iteration of my loop, the memory usage increases by approximately 20MB, which significantly impacts performance during large-scale data processing.

To Reproduce
Steps to reproduce the behavior:

Setup the environment with sklearnex installed.
Initialize and configure RandomForestClassifier and PCA.
Run a loop where RandomForestClassifier and PCA are used on the data.
Observe the memory usage growth with each iteration.

Expected behavior
I expect the memory usage to remain stable or return to the baseline after each iteration, ensuring efficient performance during large-scale data processing.

Environment:
• OS: Windows 10
• Compiler: PyCharm
• Version: 2024.1.2 Professional Edition

samir-nasibli · 2024-06-20T12:11:58Z

Hi @cannolis thank you for the report!
Please share more details about env your have, version of scikit-learn-intelex, daal4py

cannolis · 2024-06-20T12:26:20Z

Hi @samir-nasibli

Here are the details about my environment:

Python version: 3.9.19
scikit-learn-intelex version: 2024.4.0
daal4py version: 2024.4.0
scikit-learn version: 1.3.0

Thank you for looking into this issue. I appreciate your help and support. If you need any further information, please let me know.

md-shafiul-alam · 2024-06-28T14:21:23Z

Hi @cannolis, thank you for raising the issue. Can you please provide a reproducer for your specific case? My initial investigation based on your your description doesn't show anything noticeable.

WindBlowAssCold · 2024-09-24T09:38:54Z

Hi @md-shafiul-alam , had the same problem here in my environment:

Python version: 3.8.19
scikit-learn-intelex version: 2024.5.0
daal4py version: 2024.5.0
scikit-learn version: 1.3.2

Here is a code example you can reproduce this problem.Running this script for minutes, my memory usage goes up from 700M to 1G, and it keeps increasing.To reproduce these problem may take longer time, but I'm sure this does exist, as my training process has been interuptted for times because of out of memory.Once I remove sklearnex, it works fine.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
import random
from sklearnex import patch_sklearn

patch_sklearn()

class DataLoader:
    def __init__(self, filename):
        self.total = pd.read_csv(filename, header=None)
        self.data = self.total.iloc[:, :-1]
        self.label = self.total.iloc[:, -1]

    def load_data(self, feature: set):
        if len(feature) != 0:
            selected_data = self.data.iloc[:, feature]
        else:
            selected_data = self.data

        return selected_data, self.label


class Detector:
    def __init__(self):
        self.detector = RandomForestClassifier(random_state=0, n_estimators=50)

    def train_and_test(self, data, label):

        x_train, x_test, y_train, y_test = train_test_split(
            data, label, test_size=0.2, random_state=42
        )

        self.detector.fit(x_train, y_train)
        y_predict = self.detector.predict(x_test)

        accuracy = metrics.accuracy_score(y_test, y_predict)

        precision = metrics.precision_score(
            y_test, y_predict, pos_label=1, average="binary", zero_division=0
        )

        recall = metrics.recall_score(
            y_test, y_predict, pos_label=1, average="binary", zero_division=0
        )

        result = {}
        result["Accuracy"] = accuracy
        result["Precision"] = precision
        result["Recall"] = recall

        return result


DATASET = "KDDTrain+.csv"

while True:

    feature_set = [random.randint(0, 40) for _ in range(9)]

    data, label = DataLoader(DATASET).load_data(feature_set)

    classify_result = Detector().train_and_test(data, label)

    print(classify_result)

Dataset used in the code is here.
KDDTrain+.csv

cannolis added the bug Something isn't working label Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak using RandomForestClassifier and PCA #1881

Memory leak using RandomForestClassifier and PCA #1881

cannolis commented Jun 20, 2024

samir-nasibli commented Jun 20, 2024

cannolis commented Jun 20, 2024

md-shafiul-alam commented Jun 28, 2024

WindBlowAssCold commented Sep 24, 2024 •

edited

Loading

Memory leak using RandomForestClassifier and PCA #1881

Memory leak using RandomForestClassifier and PCA #1881

Comments

cannolis commented Jun 20, 2024

samir-nasibli commented Jun 20, 2024

cannolis commented Jun 20, 2024

md-shafiul-alam commented Jun 28, 2024

WindBlowAssCold commented Sep 24, 2024 • edited Loading

WindBlowAssCold commented Sep 24, 2024 •

edited

Loading