library(gmodels)
setwd("~/mds/git-repos/Machine_Learning_in_R/")
loan_data <- readRDS("./loan_data_ch1.rds")
str(loan_data)
## 'data.frame': 29092 obs. of 8 variables:
## $ loan_status : int 0 0 0 0 0 0 1 0 1 0 ...
## $ loan_amnt : int 5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
## $ int_rate : num 10.7 NA 13.5 NA NA ...
## $ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
## $ emp_length : int 10 25 13 3 9 11 0 3 3 0 ...
## $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
## $ annual_inc : num 24000 12252 49200 36000 48000 ...
## $ age : int 33 31 24 39 24 28 22 22 28 22 ...
summary(loan_data)
## loan_status loan_amnt int_rate grade
## Min. :0.0000 Min. : 500 Min. : 5.42 A:9649
## 1st Qu.:0.0000 1st Qu.: 5000 1st Qu.: 7.90 B:9329
## Median :0.0000 Median : 8000 Median :10.99 C:5748
## Mean :0.1109 Mean : 9594 Mean :11.00 D:3231
## 3rd Qu.:0.0000 3rd Qu.:12250 3rd Qu.:13.47 E: 868
## Max. :1.0000 Max. :35000 Max. :23.22 F: 211
## NA's :2776 G: 56
## emp_length home_ownership annual_inc age
## Min. : 0.000 MORTGAGE:12002 Min. : 4000 Min. : 20.0
## 1st Qu.: 2.000 OTHER : 97 1st Qu.: 40000 1st Qu.: 23.0
## Median : 4.000 OWN : 2301 Median : 56424 Median : 26.0
## Mean : 6.145 RENT :14692 Mean : 67169 Mean : 27.7
## 3rd Qu.: 8.000 3rd Qu.: 80000 3rd Qu.: 30.0
## Max. :62.000 Max. :6000000 Max. :144.0
## NA's :809
There are generally three strategies to deal with missing data.
- Delete row/column
- Replace with the median (for contiuous variable) or the mode (for categorical variable)
- Keep, coarse classification, put variable in "bins"
# Make the necessary replacements in the coarse classification example below
loan_data$ir_cat <- rep(NA, length(loan_data$int_rate))
loan_data$ir_cat[which(loan_data$int_rate <= 8)] <- "0-8"
loan_data$ir_cat[which(loan_data$int_rate > 8 & loan_data$int_rate <= 11)] <- "8-11"
loan_data$ir_cat[which(loan_data$int_rate > 11 & loan_data$int_rate <= 13.5)] <- "11-13.5"
loan_data$ir_cat[which(loan_data$int_rate > 13.5)] <- "13.5+"
loan_data$ir_cat[which(is.na(loan_data$int_rate))] <- "Missing"
loan_data$ir_cat <- as.factor(loan_data$ir_cat)
# Look at your new variable using plot()
plot(loan_data$ir_cat)
set.seed(567)
ix_train <- sample(1:nrow(loan_data), size = 2/3*nrow(loan_data))
train_set <- loan_data[ix_train, ]
test_set <- loan_data[-ix_train, ]
lr_model_ir_cat <- glm(loan_status ~ ir_cat, data=train_set)
summary(lr_model_ir_cat)
##
## Call:
## glm(formula = loan_status ~ ir_cat, data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.17744 -0.13300 -0.09084 -0.05269 0.94731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.052687 0.004491 11.733 < 2e-16 ***
## ir_cat11-13.5 0.080316 0.006400 12.550 < 2e-16 ***
## ir_cat13.5+ 0.124749 0.006671 18.699 < 2e-16 ***
## ir_cat8-11 0.038157 0.006579 5.800 6.74e-09 ***
## ir_catMissing 0.052863 0.008523 6.203 5.67e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.0964473)
##
## Null deviance: 1907.8 on 19393 degrees of freedom
## Residual deviance: 1870.0 on 19389 degrees of freedom
## AIC: 9686.9
##
## Number of Fisher Scoring iterations: 2
When you include a categorical variable in a logistic regression model in R, you will obtain a parameter estimate for all but one of its categories. This category for which no parameter estimate is given is called the reference category which is interest rate lower than 8% in this case. The parameter for each of the other categories represents the odds ratio in favor of a loan default between the category of interest and the reference category.
How to interpret the parameter estimate for the interest rates that are between 8% and 11%? Compared to the reference category with interest rates between 0% and 8%, the odds in favor of default change by a multiple of
exp(lr_model_ir_cat$coefficients[[4]])
## [1] 1.038894
lr_model_multi <- glm(loan_status ~ age+ir_cat+grade+loan_amnt+annual_inc, family="binomial", data=train_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lr_model_multi)
##
## Call:
## glm(formula = loan_status ~ age + ir_cat + grade + loan_amnt +
## annual_inc, family = "binomial", data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9727 -0.5361 -0.4380 -0.3344 3.3508
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.395e+00 1.279e-01 -18.720 < 2e-16 ***
## age -5.375e-03 3.875e-03 -1.387 0.165423
## ir_cat11-13.5 6.058e-01 1.323e-01 4.578 4.70e-06 ***
## ir_cat13.5+ 5.256e-01 1.472e-01 3.570 0.000357 ***
## ir_cat8-11 3.798e-01 1.177e-01 3.228 0.001246 **
## ir_catMissing 3.771e-01 1.296e-01 2.910 0.003617 **
## gradeB 2.697e-01 1.061e-01 2.541 0.011040 *
## gradeC 5.430e-01 1.213e-01 4.476 7.62e-06 ***
## gradeD 9.425e-01 1.376e-01 6.847 7.54e-12 ***
## gradeE 1.067e+00 1.669e-01 6.397 1.59e-10 ***
## gradeF 1.568e+00 2.275e-01 6.891 5.53e-12 ***
## gradeG 1.648e+00 3.713e-01 4.439 9.05e-06 ***
## loan_amnt -1.885e-06 4.128e-06 -0.457 0.647917
## annual_inc -5.344e-06 7.404e-07 -7.218 5.29e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13489 on 19393 degrees of freedom
## Residual deviance: 12933 on 19380 degrees of freedom
## AIC: 12961
##
## Number of Fisher Scoring iterations: 5
Also important is the statistical significance of a certain parameter estimate. The significance of a parameter is often refered to as a p-value, however in a model output you will see it denoted as Pr(>|t|)
. In glm, mild significance is denoted by a "." to very strong significance denoted by "***". When a parameter is not significant, this means you cannot assure that this parameter is significantly different from 0. Statistical significance is important. In general, it only makes sense to interpret the effect on default for significant parameters.
Gini-measure is used to create the perfect split for a tree.
Gini of a certain node = 2 * proportion of defaults in this node * proportion of non-defaults in this node
.
Gain = gini_root - (prop(cases left leaf) * gini_left) - (prop(cases right leaf * gini_right))