generated from databricks-industry-solutions/industry-solutions-blueprints
-
Notifications
You must be signed in to change notification settings - Fork 6
/
02_kaplan_meier.py
394 lines (250 loc) · 11.8 KB
/
02_kaplan_meier.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
# Databricks notebook source
# MAGIC %md
# MAGIC You may find this series of notebooks at https://github.com/databricks-industry-solutions/survival-analysis. For more information about this solution accelerator, visit https://www.databricks.com/solutions/accelerators/survival-analysis-for-churn-and-lifetime-value.
# COMMAND ----------
# MAGIC %md
# MAGIC <div style="text-align: left">
# MAGIC <img src="https://brysmiwasb.blob.core.windows.net/demos/images/ME_solution-accelerator.png"; width="50%">
# MAGIC </div>
# COMMAND ----------
# MAGIC %md
# MAGIC
# MAGIC ## Kaplan-Meier
# MAGIC ### In this lesson you:
# MAGIC * Fit Kaplan-Meier survival probability curves to IBM's Telco dataset.
# MAGIC * Visually assess survival probability curves at the population-level and the covariate level.
# MAGIC * Use the log-rank test to determine if survival curves are statistically equivalent.
# MAGIC * Extract survival probabilities for subsequent modeling.
# COMMAND ----------
# MAGIC %md
# MAGIC ## Overview
# MAGIC
# MAGIC * Kaplan-Meier is a statistical method that is used to construct survival probability curves. This method takes censoring into account, therefore overcoming the issue of underestimating survival probabilities that occurs when using mean or median.
# MAGIC * The log-rank test is a chi-square test that is used to test the null hypothesis that two or more survival curves are statistically equivalent.
# COMMAND ----------
# MAGIC %md
# MAGIC ## Configure the Environment
# MAGIC
# MAGIC * Install the Lifelines library using PyPi
# MAGIC * Import relevant libraries
# MAGIC * Load the telco silver table constructed in `01 Intro`
# MAGIC * Dataset title: Telco Customer Churn
# MAGIC * Dataset source URL: https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv
# MAGIC * Dataset source description: Github repository managed by IBM
# MAGIC * Dataset license: please see dataset source URL above
# COMMAND ----------
# MAGIC %run ./config/notebook_config
# COMMAND ----------
# Import libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from lifelines import KaplanMeierFitter
from lifelines.utils import median_survival_times
from lifelines.statistics import pairwise_logrank_test
# COMMAND ----------
# Load the telco silver table
telco_pd = spark.table('silver_monthly_customers').toPandas()
# COMMAND ----------
# MAGIC %md
# MAGIC ## Fit the Kaplan-Meier Model
# MAGIC
# MAGIC The first step when using Lifelines for Kaplan-Meier is to fit the model. This step requires two parameters: tenure and churn.
# MAGIC * `Tenure:` the duration that a customer has been with the company (if still a subscriber) or was with the company prior to churning.
# MAGIC * `Churn:` a Boolean indicating whether the customer is still a subscriber or not.
# COMMAND ----------
kmf = KaplanMeierFitter()
T=telco_pd['tenure']
C=telco_pd['churn'].astype(float)
kmf.fit(T,C)
# COMMAND ----------
# MAGIC %md
# MAGIC ## Visually Assess the Population-Level Survival Curve
# MAGIC * `The Plot:` As seen in the plot below, the survival probability curve is plotted with time on the x-axis and survival probability on the y-axis.
# MAGIC
# MAGIC * `Interpretation:` In the purest sense, the probability of a customer surviving **at least** 0 months is 100%. This is represented by the point (0,1.0) in the plot below. Moving down the survival curve to the median (34 months), it can be said that a customer has a 50% probability of surviving **at least** 34 months, given that they have survived 33 months. Please note that the last clause 'given that...' signifies that this is a conditional probability.
# MAGIC
# MAGIC * `Confidence Intervals:` The light blue border surrounding the survival probability curve represents the confidence interval. The wider the interval, the lower the confidence. As seen in the plot below, confidence in the estimates decrease as the timeline increases. While this reduced confidence is likely due to having less data, it is also intuitive that we would have more confidence in our predictions for the near-term than in our predictions for the longer-term.
# COMMAND ----------
kmf.plot(title='Kaplan-Meier Survival Curve: Population level')
# COMMAND ----------
kmf.median_survival_time_
# COMMAND ----------
# MAGIC %md
# MAGIC ## Assess Survival Probabilities at the Covariate-Level
# MAGIC
# MAGIC * When viewing Kaplan-Meier curves at the covariate-level, it's ideal to see some level of divergence between the groups as this indicates a difference that can be used for predictions. `Online Security` is a good example of this.
# MAGIC
# MAGIC * For the purposes of prediction, it's less ideal to see the survival curves very close together, as is the case for the variable `Gender`. When survival curves are very close together, you may want to check whether they are statistically equivalent. This is the purpose of the log-rank test. The null hypothesis for the log-rank states that the groups are statistically equivalent.
# MAGIC
# MAGIC * As seen below, the p-value in our log-rank test for `Gender` is greater than 0.05 and therefore, we cannot reject the null hypothesis that the two groups are statistically equivalent.
# MAGIC
# MAGIC * Although it's not useful for prediction to find out that the two groups are statistically equivalent, there are benefits from an inference standpoint. For example, suppose you have a new promotion that provides subscribers with free access to a streaming service. If you find out that those with the service are the same as those without the service, you may ask whether the return on providing free access is high enough to cover costs. In contrast, suppose you want to understand the effect of throttling. If you find that the effect is minimal, you may have cause to continue doing so.
# COMMAND ----------
# Helper function for plotting Kaplan-Meier curves at the covariate level
def plot_km(col):
ax = plt.subplot(111)
for r in telco_pd[col].unique():
ix = telco_pd[col] == r
kmf.fit(T[ix], C[ix],label=r)
kmf.plot(ax=ax)
# Helper function for printing out Log-rank test results
def print_logrank(col):
log_rank = pairwise_logrank_test(telco_pd['tenure'], telco_pd[col], telco_pd['churn'])
return log_rank.summary
# COMMAND ----------
plot_km('gender')
# COMMAND ----------
print_logrank('gender')
# COMMAND ----------
plot_km('onlineSecurity')
# COMMAND ----------
print_logrank('onlineSecurity')
# COMMAND ----------
# MAGIC %md
# MAGIC ## Assess Survival Probabilities at the Covariate-Level (continued)
# MAGIC
# MAGIC * For convenience, Kaplan-Meier curves and log-rank tests are provided for every variable in the dataset below.
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Gender
# COMMAND ----------
plot_km('gender')
# COMMAND ----------
print_logrank('gender')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Senior Citizen
# COMMAND ----------
plot_km('seniorCitizen')
# COMMAND ----------
print_logrank('seniorCitizen')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Partner
# COMMAND ----------
plot_km('partner')
# COMMAND ----------
print_logrank('partner')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Dependents
# COMMAND ----------
plot_km('dependents')
# COMMAND ----------
print_logrank('dependents')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Phone Service
# COMMAND ----------
plot_km('phoneService')
# COMMAND ----------
print_logrank('phoneService')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Multiple Lines
# COMMAND ----------
plot_km('multipleLines')
# COMMAND ----------
print_logrank('multipleLines')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Internet Service
# COMMAND ----------
plot_km('internetService')
# COMMAND ----------
print_logrank('internetService')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Streaming TV
# COMMAND ----------
plot_km('streamingTV')
# COMMAND ----------
print_logrank('streamingTV')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Streaming Movies
# COMMAND ----------
plot_km('streamingMovies')
# COMMAND ----------
print_logrank('streamingMovies')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Online Security
# COMMAND ----------
plot_km('onlineSecurity')
# COMMAND ----------
print_logrank('onlineSecurity')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Online Backup
# COMMAND ----------
plot_km('onlineBackup')
# COMMAND ----------
print_logrank('onlineBackup')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Device Protection
# COMMAND ----------
plot_km('deviceProtection')
# COMMAND ----------
print_logrank('deviceProtection')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Tech Support
# COMMAND ----------
plot_km('techSupport')
# COMMAND ----------
print_logrank('techSupport')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Paperless Billing
# COMMAND ----------
plot_km('paperlessBilling')
# COMMAND ----------
print_logrank('paperlessBilling')
# COMMAND ----------
# MAGIC %md
# MAGIC ##### Payment Method
# COMMAND ----------
plot_km('paymentMethod')
# COMMAND ----------
print_logrank('paymentMethod')
# COMMAND ----------
# MAGIC %md
# MAGIC ## Extract Survival Probabilities
# MAGIC
# MAGIC * After you complete your analysis, you may want to extract the survival probabilities to use for another application. Lifelines makes this quite easy to do.
# MAGIC
# MAGIC * Please note that later on, we will extract the output from a Cox Proportional Hazard model and use it as an input for a Customer Lifetime Value dashboard.
# COMMAND ----------
def get_survival_probs(col,val):
ix = telco_pd[col] == val
return kmf.fit(T[ix],C[ix],label=val)
# COMMAND ----------
# Extract survival probabilities for customers with internetService = DSL
sp_internet_dsl = get_survival_probs('internetService','DSL')
# COMMAND ----------
pd.DataFrame(sp_internet_dsl.survival_function_at_times(range(0,10)))
# COMMAND ----------
# MAGIC %md
# MAGIC ## Closing Thoughts
# MAGIC
# MAGIC * Kaplan-Meier is extremely useful for univariate analysis as it helps establish more intuition for the data at hand. Cox Proportional Hazards, which we cover in the next notebook, is more useful for multi-variate analysis.
# MAGIC
# MAGIC * In the plots above, you may notice that the survival curves for a given covariate crossover each other. This is a red flag that we will discuss in the next notebook.
# MAGIC
# MAGIC * As you'll soon see, Kaplan-Meier curves have some interesting applications that make them useful even when leveraging different techniques.
# COMMAND ----------
# MAGIC %md
# MAGIC Copyright Databricks, Inc. [2021]. The source in this notebook is provided subject to the [Databricks License](https://databricks.com/db-license-source). All included or referenced third party libraries are subject to the licenses set forth below.
# MAGIC
# MAGIC |Library Name|Library license | Library License URL | Library Source URL |
# MAGIC |---|---|---|---|
# MAGIC |Lifelines|MIT License|https://github.com/CamDavidsonPilon/lifelines/blob/master/LICENSE|https://github.com/CamDavidsonPilon/lifelines|
# MAGIC |Matplotlib|Python Software Foundation (PSF) License |https://matplotlib.org/stable/users/license.html|https://github.com/matplotlib/matplotlib|
# MAGIC |Numpy|BSD-3-Clause License|https://github.com/numpy/numpy/blob/master/LICENSE.txt|https://github.com/numpy/numpy|
# MAGIC |Pandas|BSD 3-Clause License|https://github.com/pandas-dev/pandas/blob/master/LICENSE|https://github.com/pandas-dev/pandas|
# MAGIC |Python|Python Software Foundation (PSF) |https://github.com/python/cpython/blob/master/LICENSE|https://github.com/python/cpython|
# MAGIC |Seaborn|BSD-3-Clause License|https://github.com/mwaskom/seaborn/blob/master/LICENSE|https://github.com/mwaskom/seaborn|
# MAGIC |Spark|Apache-2.0 License |https://github.com/apache/spark/blob/master/LICENSE|https://github.com/apache/spark|