-
Notifications
You must be signed in to change notification settings - Fork 5
/
Applications-of-R-in-Healthcare.Rmd
578 lines (424 loc) · 19.7 KB
/
Applications-of-R-in-Healthcare.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
---
title: "Applications of R in Healthcare"
author: "Thanh Phan"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output: bookdown::gitbook
documentclass: book
bibliography: [book.bib, packages.bib]
biblio-style: apalike
link-citations: yes
github-repo: GNtem2/HealthcareRbook
description: "This is a book describing the use of R in Healthcare. It is aimed at beginners of R language. The majority of the examples are taken from works in Neurology. Where possible data from other disease such as heart disease, cancer and vaccine are used. The ideas and principles can be applied to other aspects of Healthcare. The output format for this example is bookdown::gitbook."
---
# Preamble
Placeholder
<!--chapter:end:index.Rmd-->
# Introduction to R {#intro}
Placeholder
## Plot using base R
## ggplot2
### Histogram
### Bar plot
### Pie chart
### Scatter plot
### arrange plot in grids
### Line plot
### Facet wrap
### Polygons
### Gantt chart
### Heatmap
## ggplot2 extra
### Alluvial and Sankey diagram
### Survival plot
### ggraph and tidygraph
### ggparty-decision tree
### ggROC- ROC curve
### Map
#### Thematic map
#### tmap
#### Voronoi
### ggwordcloud
### gganimate
### ggneuro
## plotly
### Scatter plot with plotly
### Bar plot with plotly
### Heatmap
### map
<!--chapter:end:01-intro.Rmd-->
# Data Wrangling
Placeholder
## Data
### Vector, Arrays, Matrix
#### Vector and list
#### Matrix and Arrays
### apply, lapply, sapply
#### apply
#### lapply
#### sapply
#### tapply
### Simple function
### for loop
### Functional
#### Mapply
## Data storage
### Data frame
### Excel data
#### Date and time
### Foreign data
### json format
## Tidy data
### Factors
### Multiple files
### Pivot
## Regular Expressions
### base R
### stringr
## PDF to xcel
### Scanned text or picture
## Web scraping
## Manipulating Medical Images
### DICOM and nifti format
### Manipulating array of medical images
### Combining arrays
### Math operation on multidimensional array
### Math operation on list
### Vectorising nifti object
### tar file
### Image registration
### Rescaling
### MNI template
#### Atlases
#### AAL atlas
### add text
### add text
#### Eve template
#### Sensorimotor tract atlas
<!--chapter:end:02-Data-Wrangling.Rmd-->
# Statistics
Placeholder
## Univariable analyses
### Parametric tests
### Non-parametric tests
## Regression
### Brief review of matrix
### Linear (least square) regression
### Logistic regression
#### Discrimination and Calibration
#### Measuring Improvement in Regression Models
#### Shapley value
#### ICE
##### Interaction
#### Propensity matching
## Special types of regression
### Ordinal regression
### Survival analysis
### Quantile regression
##### Non-negative regression
### Poisson regression
### Conditional logistic regression
### Multinomial modelling
## Sample size estimation
### Proportion
#### Non-inferiority
### Logistic regression
### Survival studies
### Multiple regression
## Interpreting clinical trials
### NNT from ARR
### NNT from odds ratio
### NNT from hazard ratio
### NNT from metaananlysis
## Metaanalysis
### PRISMA
### Conversion of mean and median
### Inconsistency I2
### Metaanalysis of proportion
### Bivariate Metaanalysis
### Metaanalysis of clinical trial.
### Metaregression
#### summary Positive and Negative Likelihood Ratio
## Data simulation
<!--chapter:end:03-Statistics.Rmd-->
# Multivariate Analysis
The following section illustrates the different methods in multivariate analyses. These methods are not to be confused with the more multivariable analyses discussed under Statistics.
## Multivariate regression
Multivariable and multivariate regression are often used interchangeably. Some use the term multivariate when there are more than one dependent variables. Multivariable regression refers to linear, logistic or survival curve analysis in the previous chapter. Multivariate regression refers to nested models or longitudinal models or more complex type of analyses described below.
### Penalised regression
We used penalised logistic regression (PLR) to assess the relationship between the ASPECTS regions and stroke disability (binary outcome) [@pmid23838753]. PLR can be conceptualized as a modification of logistic regression. In logistic regression, there is no algebraic solution to determine the parameter estimate (β coefficient) and a numerical method (trial and error approach) such as maximum likelihood estimate is used to determine the parameter estimate. In certain situations overfitting of the model may occur with the maximum likelihood method. This situation occurs when there is collinearity (relatedness) of the data. To circumvent this, a bias factor is introduced into the calculation to prevent overfitting of the model. The tuning (regularization) parameter for the bias factor is chosen from the quadratic of the norms of the parameter estimate. This method is known as PLR. This method also allows handling of a large number of interaction terms in the model. We employed a forward and backward stepwise PLR that used all the ASPECTS regions in the analysis, calling on the penalized function in R programming environment. This program automatically assessed the interaction of factors in the regression model in the following manner. The choice of factors to be added/deleted to the stepwise regression was based on the cost complexity statistic. The asymmetric hierarchy principle was used to determine the choice of interaction of factors. In this case, any factor retained in the model can form interactions with others that are already in the model and those that are not yet in the model. In this analysis, we have specified a maximum of 5 terms to be added to the selection procedure. The significance of the interactions was plotted using a previously described method. We regressed the dichotomized mRS score against ASPECTS regions, demographic variables (such as age and sex), physiological variables (such as blood pressure and serum glucose level) and treatment (rt-PA). The results are expressed as β coefficients rather than as odds ratio for consistency due to the presence of interaction terms.
```{r 04-multivariate-analysis-2, warning=F}
library(mice)
data("BreastCancer",package = "mlbench")
colnames(BreastCancer)
#check for duplicates
sum(duplicated(BreastCancer))
#remove duplicates
#keep Id to avoid creation of new duplicates
#BreastCancer1<-unique(BreastCancer) #reduce 699 to 691 rows
#impute missing data
#m is number of multiple imputation, default is 5
#output is a list
imputed_Data <- mice(BreastCancer, m=5, maxit = 5, method = 'pmm', seed = 500)
#choose among the 5 imputed dataset
completeData <- complete(imputed_Data,2)
#convert multiple columns to numeric
#lapply output a list
BreastCancer2<-lapply(completeData[,-c(11)], as.numeric) #list
BreastCancer2<-as.data.frame(BreastCancer2)
BreastCancer2$Class<-BreastCancer$Class
#convert factor to numeric for calculatin of vif
BreastCancer2$Class<-as.character(BreastCancer2$Class)
BreastCancer2$Class[BreastCancer2$Class=="benign"]<-0
BreastCancer2$Class[BreastCancer2$Class=="malignant"]<-1
BreastCancer2$Class<-as.numeric(BreastCancer2$Class)
BC <- unique(BreastCancer2) # Remove duplicates
#check correlation
library(ggcorrplot)
ggcorrplot(cor(BC),
p.mat=cor_pmat(BC),hc.order=T, type="lower", colors=c("red","white","blue"),tl.cex = 8)
```
### MARS
Multivariate adaptive regression spline (MARS) is a non-linear regression method that fits a set of splines (hinge functions) to each of the predictor variables i.e. different hinge function for different variables [@pmid8548103]. As such,
the method can be used to plot the relationship between each variable and
outcome. Use in this way, the presence of any threshold effect on the predictors can be graphically visualized. The MARS method is implemented in R programming environment in the _earth_ package.
```{r 04-multivariate-analysis-3, warning=F}
library(earth)
BC<-BC[-1]
Fit<-earth(Class ~.,data= BC,
nfold=10,ncross=30, varmod.method = "none",
glm=list(family=binomial))
plotmo(Fit)
summary(Fit)
```
### Mixed modelling
In a standard regression analysis, the data is assumed to be random. Mixed
models assume that there are more than one source of random variability in the data. This is expressed in terms of fixed and random effects. Mixed modeling is
a useful technique for handling multilevel or group data. The intraclass correlation (ICC) is used to determine if a multilevel analysis is necessary ie
if the infarct volume varies among the surgeon or not. ICC is the between group variance to the total variance. If the ICC approaches zero then a simple regression model would suffice.
There are several R packages for performing mixed modeling such as _lme4_. Mixed modeling in meta-regression is illustrated in the section on Metaanalysis. An example of mixed model using Bayesian approach with INLA is provided in the [Bayesian section][INLA, Stan and BUGS]
#### Random intercept model
In a random intercept or fixed slope multilevel model the slope or gradient of the fitted lines are assumed to be parallel to each other and the intercept
varies for different groups. This can be the case of same treatment effect on animal experiments performed by different technician or same treatment in different clusters of hospitals. There are several approached to performing analysis with random intercept model. The choice of the model depends on the reason for performing the analysis. For example, the maximum likelihood
estimation (MLE) method is better than restricted maximum likelihood (RMLE) in that it generates estimates for fixed effects and model comparison. RMLE is preferrred if there are outliers.
#### Random slope model
In a random slope model, the slopes are not paralleled
### Trajectory modelling
Trajectory analysis attempts to group the behaviour of the subject of interest over time. There are several different approaches to trajectory analysis: data
in raw form or after orthonal transformation of the data in principal component analysis. Trajectory analysis is different from mixed modelling in that it examines group behaviour. The output of trajectory analysis is only the
beginning of the modeling analysis. For example, the analysis may identify that there are 3 groups. These groups are labelled as group A, B and C. The next step would be to use the results in a modelling analysis of your choice.
A useful library for performing trajectory analysis is _akmedoids_. This library anchored the analysis around the median value. The analysis requires the data in long format. The _traj_ library is similar to the one in _Stata_. It uses
several steps including factor and cluster analyses to idetify groups. The
_traj_ model prefers data in wide format.
### Generalized estimating equation (GEE)
GEE is used for analysis of longitudinal or clustered data. GEE is preferred when the idea is to discover the group effect or population average (marginal) log odds [@pmid20220526]. This is contrast with the mixed model approach to evaluate the average subject via maximum likelihood estimation. The fitting for mixed model is complex compare to GEE and can breakdown. The library for performing GEE is _gee_ or _geepack_.
```{r 04-multivariate-analysis-4, warning=F}
library(tidyverse)
library(gee)
#open simulated data from previous chapter
dtTime<-read.csv("./Data-Use/dtTime_simulated.csv") %>%
rename(NIHSS=Y)
fit<-gee(ENI~T+Diabetes+NIHSS,
id=id,
corstr = "unstructured",
#data=dtTrial_long)
data=dtTime)
```
## Principal component analysis
Principal component analysis (PCA) is a data dimension reduction method which
can be applied to a large dataset to determine the latent variables (principal components) which best represent that set of data. A brief description of the method is described here and a more detailed description of the method can be found in review [@pmid10703049]. The usual approach to PCA involves eigen
analysis of a covariance matrix or singular value decomposition.
PCA estimates an orthogonal transformation (variance maximising) to convert a
set of observations of correlated variables into a set of values of uncorrelated (orthogonal) variables called principal components. The first extracted
principal component aligns in the direction that contains most of the variance
of observed variables. The next principal component is orthogonal to the first principle component and contains the second most of spread of variance. The next component contains the third most of spread, and so on. The latter principal components are likely to represent noise and are discarded. Expressing this in terms of our imaging data, each component yields a linear combination of ‘ischemic’ voxels that covary with each other. These components can be interpreted as patterns of ischemic injury. The unit of measurement in PCA images is the covariance of the data.
In the case of MR images, each voxel is a variable, leading to tens of thousands of variables with relatively small numbers of samples. Specialised methods are required to compute principal components.
Based on cosine rule, principal components from different data are similar if values approach 1 and dissimilar if values approach 0 [@pmid22551679].
There are situations in which PCA may not work well if there is non-linear relationship in the data.
```{r PCA}
library(oro.nifti)
library(abind)
library(CHNOSZ) # for working with arrays
library(RNiftyReg)
#create a list using pattern matching
mca.list<-list.files(path="./Ext-Data/",pattern = "*.nii", full.names = TRUE)
#length of list
length(mca.list)
#read multiple files using lapply function
#use lappy to read in the nifti files
#note lapply returns a list
mca.list.nii <- lapply(mca.list, readNIfTI)
#convert to list of array
mca.list.array<-lapply(mca.list.nii, img_data)
#convert to array
m.listarray<-list2array(mca.list.array)
library(mand)
fit=msma(m.listarray,comp=2)
#https://rdrr.io/cran/mand/f/vignettes/a_overview.Rmd
#https://rdrr.io/cran/caret/man/plsda.html
```
## Independent component analysis
Independent component analysis is different from PCA in that it seeks components which are statistically independent.
## Partial least squares
There are several versions of partial least squares (PLS). A detailed
mathematical exposition of the PLS-PLR technique used here can be found in the paper by Fort and Lambert-Lacroix [@pmid15531609]. For the purposes of
exposition we will describe the individual components of the method. PLS is a multiple regression method that is suited to datasets comprising large sets of independent predictor variables (voxels in an image) and smaller sets of
dependent variables (neurological outcome scores). Each voxel can take on a
value of 1 (representing involvement by infarction) or 0 (representing absence
of involvement) in the MR image of each patient. PLS employs a data reduction method which generates latent variables, linear combinations of independent and dependent variables which explain as much of their covariance as possible.
Linear least squares regression of the latent variables produces coefficients or beta weights for the latent variables at each voxel location in the brain in stereotaxic coordinate space.[@pmid19660556]
The colon dataset containing microarray data comes with the _plsgenomics_
library [@pmid28968879]. The analysis involves partitioning the data into
training and test set. The classification data is in the Y column. This example
is provided by the _plsgenomics_ library
```{r 04-multivariate-analysis-5, warning=F}
library(plsgenomics)
data("Colon")
class(Colon) #list
#62 samples 2000 genes
#Outcome is in Y column as 1 and 2. 62 rows
#2000 gene names
dim(Colon$X)
#heatmap
matrix.heatmap(cbind(Colon$X,Colon$y))
#
IndexLearn <- c(sample(which(Colon$Y==2),12),sample(which(Colon$Y==1),8))
Xtrain <- Colon$X[IndexLearn,]
Ytrain <- Colon$Y[IndexLearn]
Xtest <- Colon$X[-IndexLearn,]
# preprocess data
resP <- preprocess(Xtrain= Xtrain, Xtest=Xtest,Threshold = c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE)
# Determine optimum h and lambda
hlam <- gsim.cv(Xtrain=resP$pXtrain,Ytrain=Ytrain,hARange=c(7,20),
LambdaRange=c(0.1,1),hB=NULL)
# perform prediction by GSIM
# lambda is the ridge regularization parameter from the cross validation
res <- gsim(Xtrain=resP$pXtrain,
Ytrain= Ytrain,Xtest=resP$pXtest,
Lambda=hlam$Lambda,hA=hlam$hA,hB=NULL)
res$Cvg
#difference between predicted and observed
sum(res$Ytest!=Colon$Y[-IndexLearn])
```
## Causal inference
<!--chapter:end:04-multivariate-analysis.Rmd-->
# Machine learning
Placeholder
## Decision tree analysis
### Information theory driven
### Conditional decision tree
### criticisms of decision tree
## Random Forest
### Random survival forest
## Gradient Boost Machine
### Extreme gradient boost machine
## KNN
## Support vector machine
### Survival analysis using random forest
## Non-negative matrix factorisation
## Formal concept analysis
## Evolutionary Algorithm
### Simulated Annealing
### Genetic Algorithm
## Manifold learning
### T-Stochastic Neighbourhood Embedding
### Self organising map
## Deep learning
### Multiplayer Perceptron
### CNN
### RNN
### Reinforcement learning
<!--chapter:end:05-machinelearning.Rmd-->
# Machine Learning Part 2
Placeholder
## Bag of words
### TFIDF
### Extracting data from web
## Wordcloud
## Bigram analysis
## Trigram
## Topic modeling or thematic analysis
### Probabilistic topic model
### NMF
<!--chapter:end:06-machinelearningpt2.Rmd-->
# Bayesian Analysis
Placeholder
## Baysian belief
### Conditional probability
#### Bayes Rule
#### Conditional independence
## Markov model
## INLA, Stan and BUGS
### Linear regression
### Logistic regression
### Mixed model
### Bayesian Metaanalysis
### Cost
<!--chapter:end:07-Bayesian-analysis.Rmd-->
# Operational Research
Placeholder
## Queueing theory
## Discrete Event Simulations
### Simulate capacity of system
### Queuing network
## Linear Programming
## Forecasting
### Bed requirement
### Length of stay
### Customer churns
## Process mapping
## Supply chains
## Health economics
### Cost
<!--chapter:end:08-operational-research.Rmd-->
# Graph Theory
Placeholder
## Special graphs
### Laplacian matrix
### Bimodal (bipartite) graph
## Centrality Measures
### Local centrality measures
### Global centrality measures
#### Page Rank
## Community
## Visualising graph
### Visnetwork
### Large graph
## Social Media and Network Analysis
### Twitter
### Youtube
<!--chapter:end:09-graph-theory.Rmd-->
# Geospatial analysis
Placeholder
## Geocoding
### OpenStreetMap
### Google Maps API
## Sp and sf objects
## Thematic map
### Calculate distance to Hospital-OpenStreetMap
## Spatial regression
### New York COVID-19 mortality
### Danish Stroke Registry
### INLA
### Stan
## Machine learning in spatial analysis
## Spatio-temporal regression
<!--chapter:end:10-geospatial-analysis.Rmd-->
# App
Placeholder
## Brief introduction to Shiny app
<!--chapter:end:11-App.Rmd-->
# Appendix
Placeholder
## Brief introduction to Matrix
## Regression
### Linear regression
#### Leasts squares
#### Collinearity
#### weighted least squares
### Logistic regression
<!--chapter:end:12-Appendix.Rmd-->
`r if (knitr:::is_html_output()) '
# References {-}
'`
<!--chapter:end:13-references.Rmd-->