Skip to content

An intelligent and accurate hourly electricity power prediction system developed using ML regression techniques.


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



3 Commits

Repository files navigation

Net Hourly Electrical Power Output Prediction in a Combined Cycle Power Plant


The dataset is open source, available here, and contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. A combined cycle power plant (CCPP) is composed of gas turbines, steam turbines and heat recovery steam generators. In a CCPP, the electricity, in the range of 420.26-495.76 MW, is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the vacuum is collected from and has effect on the steam turbine, the other ambient variables effect the gas turbine performance.

- Features

Features consist of hourly average ambient variables, namely:

  • Ambient Temperature (AT) in the range 1.81-37.11 °C
  • Ambient Pressure (AP) in the range 992.89-1033.30 milibar
  • Relative Humidity (RH) in the range 25.56%-100.16%
  • Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg

- Target

The target is to predict the net hourly electrical power output (EP) of the plant.

Implementation and Results Interpretation

Step 1: Importing the necessary modules

The electricity prediction system utilizes the following Python libraries.

  • NumPy
  • Pandas
  • Seaborn
  • Matplotlib
  • Scikit-learn

Step 2: Importing dataset, exploratory data analysis

The combined cycle power plant dataset that spans a variety of ambient conditions over 6 years of operation was read using pandas.read_csv( ) function. All features were found to be numeric with no NaN values. The descriptive statistics for the dataset are listed in Table 1.

Table 1: Dataset statistics

count 9568 9568 9568 9568 9568
mean 19.651231 54.305804 1013.259078 73.308978 454.365009
std 7.452473 12.707893 5.938784 14.600269 17.066995
min 1.810000 25.360000 992.890000 25.560000 420.260000
25% 13.510000 41.740000 1009.100000 63.327500 439.750000
50% 20.345000 52.080000 1012.940000 74.975000 451.550000
75% 25.720000 66.540000 1017.260000 84.830000 468.430000
max 37.110000 81.560000 1033.300000 100.160000 495.760000

a. Checking skewness in data

To analyze the density distribution and spread of the data, a pair-plot was sketched using the seaborn module. From Figure 1, it could be observed that the kernel density estimate (KDE) subplots, shown diagonally, have somewhat normal distributions rather than having left or right skewed values. This eliminates the need of log transformation.

Figure 1: Pairwise relationships in dataset

b. Analyzing linearity trend with target variable

The regression plots illustrated in Figure 2 indicate how the independent variables vary with the dependent variable. At different intercepts, the relationship with output electrical power (EP) is linear with decreasing slope in case of ambient temperature (AT) and exhaust vacuum (V), while the slope for ambient pressure (AP) and relative humidity (RH) is positive.

Figure 2: Linearity trend of features with response variable

c. Checking multicollinearity

Multicollinearity is a condition when two or more input features have high correlation with each other besides having strong correlation with the target variable. From Figure 3, it could be observed that the predictors – AT and V, have a correlation of 0.84. So, a general intuition could be that including both Temperature and Vacuum in the regression model would lead the model to overfitting. However, the actual experimentation done with the model and its independent variables revealed that the model made better predictions when trained on all four ambient features.

Figure 3: Correlation matrix

Moreover, the last column in the correlation matrix verifies the observations drawn from Figure 2. The strong negative correlation of AT and V with EP is in accordance with the decreasing linear trend. The correlation between RH and EP is not a strong one due to the high variance in humidity values (Table 1) and scattered data (Figure 2 subplot 4).

Step 3: Preprocessing

Since there were no outliers in the dataset nor any skewed distributions, it could be referred as clean data. This saved the computation cost in terms of data cleaning and manipulation. After extracting the independent and dependent variables, the only preliminary processing step being performed was feature scaling using MinMaxScaler from scikit-learn module i.e., the input features were scaled in the range of [0,1]. The dataset was then split into training and test set. The stats could be read from Table 2.

Table 2: Train-test split

Parameters Training set Test set
Split ratio 70% 30%
Features (6697,4) (2871,4)
Target (6697,1) (2871,1)

Step 4: Building Machine Learning Models

We have developed the electrical power prediction system based on four different regression models, namely:

  • Multiple Linear Regressor
  • Support Vector Regressor
  • Random Forest Regressor (using 10 estimators)
  • K-Nearest Neighbors Regressor

These 04 models were trained on the training set and predictions were made on the test set. A comparison between the true and predicted electrical outputs is summarized for the first few samples in Table 3.

Table 3: Actual EP vs Predicted EP

Actual EP (MW) Predicted EP (MW)
MLR SVR Random Forest KNN
431.23 431.690360 446.207784 434.829 435.482
460.01 458.157572 454.278001 456.961 457.708
461.14 463.972658 456.869228 466.618 467.592
445.90 447.510 450.720681 446.643 447.510
451.29 456.906435 451.807264 461.511 458.602

Step 5: Performance Evaluation

Once trained and tested, the performance of each model was evaluated via R^2 score, mean absolute error, and root mean squared error. The evaluation results are tabulated in Table 4. It could be observed that the Random Forest Regressor performed the best with the lowest error and the highest R^2 score.

Table 4: Performance evaluation

ML Model MAE RMSE R^2 score
MLR 3.5785 4.4736 0.9316
SVR 3.1967 4.1662 0.9406
Random Forest 2.9016 3.8929 0.94822
KNN 3.2245 4.2781 0.93746

Step 6: Cross-Validation

10-fold cross validation was applied on the dataset. The corresponding results are shown in Table 5.

Table 5: Cross Validation performance evaluation

ML Model MAE RMSE R^2 score
MLR 3.6278 4.5565 0.9285
SVR 3.1578 4.1636 0.9403
Random Forest 2.4329 3.4312 0.95939
KNN 2.6886 3.7312 0.9520


No releases published


No packages published