The dataset is open source, available here, and contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. A combined cycle power plant (CCPP) is composed of gas turbines, steam turbines and heat recovery steam generators. In a CCPP, the electricity, in the range of 420.26-495.76 MW, is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the vacuum is collected from and has effect on the steam turbine, the other ambient variables effect the gas turbine performance.
Features consist of hourly average ambient variables, namely:
- Ambient Temperature (AT) in the range 1.81-37.11 °C
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar
- Relative Humidity (RH) in the range 25.56%-100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
The target is to predict the net hourly electrical power output (EP) of the plant.
The electricity prediction system utilizes the following Python libraries.
- NumPy
- Pandas
- Seaborn
- Matplotlib
- Scikit-learn
The combined cycle power plant dataset that spans a variety of ambient conditions over 6 years of operation was read using pandas.read_csv( ) function. All features were found to be numeric with no NaN values. The descriptive statistics for the dataset are listed in Table 1.
Table 1: Dataset statistics
Stats | AT | V | AP | RH | EP |
---|---|---|---|---|---|
count | 9568 | 9568 | 9568 | 9568 | 9568 |
mean | 19.651231 | 54.305804 | 1013.259078 | 73.308978 | 454.365009 |
std | 7.452473 | 12.707893 | 5.938784 | 14.600269 | 17.066995 |
min | 1.810000 | 25.360000 | 992.890000 | 25.560000 | 420.260000 |
25% | 13.510000 | 41.740000 | 1009.100000 | 63.327500 | 439.750000 |
50% | 20.345000 | 52.080000 | 1012.940000 | 74.975000 | 451.550000 |
75% | 25.720000 | 66.540000 | 1017.260000 | 84.830000 | 468.430000 |
max | 37.110000 | 81.560000 | 1033.300000 | 100.160000 | 495.760000 |
To analyze the density distribution and spread of the data, a pair-plot was sketched using the seaborn module. From Figure 1, it could be observed that the kernel density estimate (KDE) subplots, shown diagonally, have somewhat normal distributions rather than having left or right skewed values. This eliminates the need of log transformation.
The regression plots illustrated in Figure 2 indicate how the independent variables vary with the dependent variable. At different intercepts, the relationship with output electrical power (EP) is linear with decreasing slope in case of ambient temperature (AT) and exhaust vacuum (V), while the slope for ambient pressure (AP) and relative humidity (RH) is positive.
Multicollinearity is a condition when two or more input features have high correlation with each other besides having strong correlation with the target variable. From Figure 3, it could be observed that the predictors – AT and V, have a correlation of 0.84. So, a general intuition could be that including both Temperature and Vacuum in the regression model would lead the model to overfitting. However, the actual experimentation done with the model and its independent variables revealed that the model made better predictions when trained on all four ambient features.
Moreover, the last column in the correlation matrix verifies the observations drawn from Figure 2. The strong negative correlation of AT and V with EP is in accordance with the decreasing linear trend. The correlation between RH and EP is not a strong one due to the high variance in humidity values (Table 1) and scattered data (Figure 2 subplot 4).Since there were no outliers in the dataset nor any skewed distributions, it could be referred as clean data. This saved the computation cost in terms of data cleaning and manipulation. After extracting the independent and dependent variables, the only preliminary processing step being performed was feature scaling using MinMaxScaler from scikit-learn module i.e., the input features were scaled in the range of [0,1]. The dataset was then split into training and test set. The stats could be read from Table 2.
Table 2: Train-test split
Parameters | Training set | Test set |
---|---|---|
Split ratio | 70% | 30% |
Features | (6697,4) | (2871,4) |
Target | (6697,1) | (2871,1) |
We have developed the electrical power prediction system based on four different regression models, namely:
- Multiple Linear Regressor
- Support Vector Regressor
- Random Forest Regressor (using 10 estimators)
- K-Nearest Neighbors Regressor
These 04 models were trained on the training set and predictions were made on the test set. A comparison between the true and predicted electrical outputs is summarized for the first few samples in Table 3.
Table 3: Actual EP vs Predicted EP
Actual EP (MW) | Predicted EP (MW) | |||
---|---|---|---|---|
MLR | SVR | Random Forest | KNN | |
431.23 | 431.690360 | 446.207784 | 434.829 | 435.482 |
460.01 | 458.157572 | 454.278001 | 456.961 | 457.708 |
461.14 | 463.972658 | 456.869228 | 466.618 | 467.592 |
445.90 | 447.510 | 450.720681 | 446.643 | 447.510 |
451.29 | 456.906435 | 451.807264 | 461.511 | 458.602 |
Once trained and tested, the performance of each model was evaluated via R^2 score, mean absolute error, and root mean squared error. The evaluation results are tabulated in Table 4. It could be observed that the Random Forest Regressor performed the best with the lowest error and the highest R^2 score.
Table 4: Performance evaluation
ML Model | MAE | RMSE | R^2 score |
---|---|---|---|
MLR | 3.5785 | 4.4736 | 0.9316 |
SVR | 3.1967 | 4.1662 | 0.9406 |
Random Forest | 2.9016 | 3.8929 | 0.94822 |
KNN | 3.2245 | 4.2781 | 0.93746 |
10-fold cross validation was applied on the dataset. The corresponding results are shown in Table 5.
Table 5: Cross Validation performance evaluation
ML Model | MAE | RMSE | R^2 score |
---|---|---|---|
MLR | 3.6278 | 4.5565 | 0.9285 |
SVR | 3.1578 | 4.1636 | 0.9403 |
Random Forest | 2.4329 | 3.4312 | 0.95939 |
KNN | 2.6886 | 3.7312 | 0.9520 |