Skip to content

Latest commit

 

History

History
197 lines (166 loc) · 7.38 KB

03.Credit_risk.md

File metadata and controls

197 lines (166 loc) · 7.38 KB

Credit Risk Modelling in R

library(gmodels)

Data preparation

Load the data

setwd("~/mds/git-repos/Machine_Learning_in_R/")
loan_data <- readRDS("./loan_data_ch1.rds")

Descriptive statistics

str(loan_data)
## 'data.frame':	29092 obs. of  8 variables:
##  $ loan_status   : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ loan_amnt     : int  5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
##  $ int_rate      : num  10.7 NA 13.5 NA NA ...
##  $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
##  $ emp_length    : int  10 25 13 3 9 11 0 3 3 0 ...
##  $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
##  $ annual_inc    : num  24000 12252 49200 36000 48000 ...
##  $ age           : int  33 31 24 39 24 28 22 22 28 22 ...
summary(loan_data)
##   loan_status       loan_amnt        int_rate     grade   
##  Min.   :0.0000   Min.   :  500   Min.   : 5.42   A:9649  
##  1st Qu.:0.0000   1st Qu.: 5000   1st Qu.: 7.90   B:9329  
##  Median :0.0000   Median : 8000   Median :10.99   C:5748  
##  Mean   :0.1109   Mean   : 9594   Mean   :11.00   D:3231  
##  3rd Qu.:0.0000   3rd Qu.:12250   3rd Qu.:13.47   E: 868  
##  Max.   :1.0000   Max.   :35000   Max.   :23.22   F: 211  
##                                   NA's   :2776    G:  56  
##    emp_length      home_ownership    annual_inc           age       
##  Min.   : 0.000   MORTGAGE:12002   Min.   :   4000   Min.   : 20.0  
##  1st Qu.: 2.000   OTHER   :   97   1st Qu.:  40000   1st Qu.: 23.0  
##  Median : 4.000   OWN     : 2301   Median :  56424   Median : 26.0  
##  Mean   : 6.145   RENT    :14692   Mean   :  67169   Mean   : 27.7  
##  3rd Qu.: 8.000                    3rd Qu.:  80000   3rd Qu.: 30.0  
##  Max.   :62.000                    Max.   :6000000   Max.   :144.0  
##  NA's   :809

Work with missing data

There are generally three strategies to deal with missing data.

  • Delete row/column
  • Replace with the median (for contiuous variable) or the mode (for categorical variable)
  • Keep, coarse classification, put variable in "bins"
# Make the necessary replacements in the coarse classification example below 
loan_data$ir_cat <- rep(NA, length(loan_data$int_rate))

loan_data$ir_cat[which(loan_data$int_rate <= 8)] <- "0-8"
loan_data$ir_cat[which(loan_data$int_rate > 8 & loan_data$int_rate <= 11)] <- "8-11"
loan_data$ir_cat[which(loan_data$int_rate > 11 & loan_data$int_rate <= 13.5)] <- "11-13.5"
loan_data$ir_cat[which(loan_data$int_rate > 13.5)] <- "13.5+"
loan_data$ir_cat[which(is.na(loan_data$int_rate))] <- "Missing"

loan_data$ir_cat <- as.factor(loan_data$ir_cat)

# Look at your new variable using plot()
plot(loan_data$ir_cat)

Split the data into training set and test set

set.seed(567)
ix_train <- sample(1:nrow(loan_data), size = 2/3*nrow(loan_data))
train_set <- loan_data[ix_train, ]
test_set <- loan_data[-ix_train, ]

Logistic Regression

Logistic Regression model with one predictor ir_cat

lr_model_ir_cat <- glm(loan_status ~ ir_cat, data=train_set)
summary(lr_model_ir_cat)
## 
## Call:
## glm(formula = loan_status ~ ir_cat, data = train_set)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.17744  -0.13300  -0.09084  -0.05269   0.94731  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.052687   0.004491  11.733  < 2e-16 ***
## ir_cat11-13.5 0.080316   0.006400  12.550  < 2e-16 ***
## ir_cat13.5+   0.124749   0.006671  18.699  < 2e-16 ***
## ir_cat8-11    0.038157   0.006579   5.800 6.74e-09 ***
## ir_catMissing 0.052863   0.008523   6.203 5.67e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.0964473)
## 
##     Null deviance: 1907.8  on 19393  degrees of freedom
## Residual deviance: 1870.0  on 19389  degrees of freedom
## AIC: 9686.9
## 
## Number of Fisher Scoring iterations: 2

When you include a categorical variable in a logistic regression model in R, you will obtain a parameter estimate for all but one of its categories. This category for which no parameter estimate is given is called the reference category which is interest rate lower than 8% in this case. The parameter for each of the other categories represents the odds ratio in favor of a loan default between the category of interest and the reference category.

How to interpret the parameter estimate for the interest rates that are between 8% and 11%? Compared to the reference category with interest rates between 0% and 8%, the odds in favor of default change by a multiple of $e^{\beta_{j}}$

exp(lr_model_ir_cat$coefficients[[4]])
## [1] 1.038894

Logistic regression with multiple variables

lr_model_multi <- glm(loan_status ~ age+ir_cat+grade+loan_amnt+annual_inc, family="binomial", data=train_set)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lr_model_multi)
## 
## Call:
## glm(formula = loan_status ~ age + ir_cat + grade + loan_amnt + 
##     annual_inc, family = "binomial", data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9727  -0.5361  -0.4380  -0.3344   3.3508  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.395e+00  1.279e-01 -18.720  < 2e-16 ***
## age           -5.375e-03  3.875e-03  -1.387 0.165423    
## ir_cat11-13.5  6.058e-01  1.323e-01   4.578 4.70e-06 ***
## ir_cat13.5+    5.256e-01  1.472e-01   3.570 0.000357 ***
## ir_cat8-11     3.798e-01  1.177e-01   3.228 0.001246 ** 
## ir_catMissing  3.771e-01  1.296e-01   2.910 0.003617 ** 
## gradeB         2.697e-01  1.061e-01   2.541 0.011040 *  
## gradeC         5.430e-01  1.213e-01   4.476 7.62e-06 ***
## gradeD         9.425e-01  1.376e-01   6.847 7.54e-12 ***
## gradeE         1.067e+00  1.669e-01   6.397 1.59e-10 ***
## gradeF         1.568e+00  2.275e-01   6.891 5.53e-12 ***
## gradeG         1.648e+00  3.713e-01   4.439 9.05e-06 ***
## loan_amnt     -1.885e-06  4.128e-06  -0.457 0.647917    
## annual_inc    -5.344e-06  7.404e-07  -7.218 5.29e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13489  on 19393  degrees of freedom
## Residual deviance: 12933  on 19380  degrees of freedom
## AIC: 12961
## 
## Number of Fisher Scoring iterations: 5

Also important is the statistical significance of a certain parameter estimate. The significance of a parameter is often refered to as a p-value, however in a model output you will see it denoted as Pr(>|t|). In glm, mild significance is denoted by a "." to very strong significance denoted by "***". When a parameter is not significant, this means you cannot assure that this parameter is significantly different from 0. Statistical significance is important. In general, it only makes sense to interpret the effect on default for significant parameters.

Decision Tree

Gini-measure

Gini-measure is used to create the perfect split for a tree. Gini of a certain node = 2 * proportion of defaults in this node * proportion of non-defaults in this node. Gain = gini_root - (prop(cases left leaf) * gini_left) - (prop(cases right leaf * gini_right))