-
Notifications
You must be signed in to change notification settings - Fork 5
/
03-Statistics.Rmd
1240 lines (863 loc) · 56.7 KB
/
03-Statistics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Statistics
This section is not intended as a textbook on statistics. Rather it demonstrates regression approaches that can be used including sample size estimation, R codes provided.
## Univariable analyses
### Parametric tests
T-test is the workhorse for comparing if 2 datasets are have the same distribution. Performing t-test in R requires data from 2 columns: one
containing the variables for comparison and one to label the group. There are different forms of t-test depending on whether the two samples are paired or unpaired. In general, the analysis takes the form of $t=\frac{\mu_1 - \mu_2}{variance}$. It is recommended to check the distribution of the data by using histogram. For this exercise, we will use the simulated data from ECR trials. The grouping variable is the trial assignment.
```{r 03-Statistics-1}
#comparison of early neurological recovery (ENI) by tral (T)
dtTrial<-read.csv("./Data-Use/dtTrial_simulated.csv")
t.test(dtTrial$ENI~dtTrial$T)
```
### Non-parametric tests
Chi-squared and Fisher-exact tests can be done by using the _table_ function for setting up the count data into 2 x 2 contingency table or confusion matrix. The formula for the Chi-squared test takes on a familiar form $\chi^2=\frac{(observed-expected)^2}{expected}$. In this example we will use the data above.
```{r 03-Statistics-1-1}
table(dtTrial$HT,dtTrial$T)
chisq.test(dtTrial$HT,dtTrial$T)
```
The Wilcoxon rank sum test is performed with continuous data organised in the
same way as the t-test. There are several different approaches to performing Wilcoxon rank sum test. The _coin_ package allows handling of ties.
```{r 03-Statistics-2}
library(coin)
wilcox.test(ENI~T, data=dtTrial)
```
## Regression
There are many different form of regression methods. A key principle is that the predictors are independent of each others. This issue will be expand on in the later in the section on collinearity. Special methods are required when the predictors are collinear.
### Brief review of matrix
A vector is has length one. A matrix is an ordered array in 2 dimensions. A
tensor is an ordered array in 3 dimensions.
A matrix in which the columns are linearly related are said to be rank
deficient. The rank of a given matrix is an expression of the number of linearly independent columns of that matrix. Given that row rank and column rank are equivalent, rank deficiency of a matrix is expressed as the difference between
the lesser of the number of rows and columns, and the rank of the matrix.
A matrix with rank of 1 is likely to be linearly related.
### Linear (least square) regression
Least square regression uses the geometric properties of Euclidean geometry to identify the line of best. The sum of squares $SSE$ is $\sum(observed-expected)^2$. The $R^2$ is a measure of the fit of the model. It
is given by $1-\frac{SS_(res)}{SS_(total)}$. Low $R^2$ indicates a poorly fitted model and high $R^2$ indicates excellent fitting. The assumption here is that
the outcome variable is a continuous variable.
```{r 03-Statistics-3, warning=F}
library(ggplot2)
load("./Data-Use/world_stroke.Rda")
ggplot(world_sfdf, aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
geom_smooth(method="lm", aes(Group=Income, linetype=Income))+geom_point()+xlab("Life Expectancy")
```
### Logistic regression
For outcome that are binary in nature such as yes or no, then least square regression is not appropriate. There are no close form solution for this
analysis and a numerical approach using maximum likelihood approach is needed. When examining the results of logistic regression one is often enchanted by the large odds ratio. It is important to look at the metrics of model calibration
(discussed below). A clue to a poorly calibrated model is the observation that
the width of the confidence interval for odds ratio is wide.
```{r 03-Statistics-4, warning=F}
#glm
data("BreastCancer",package = "mlbench")
#remove id column and column with NA to feed into iml later
BreastCancer2<-lapply(BreastCancer[,-c(1,7)], as.numeric)
BreastCancer2<-as.data.frame(BreastCancer2)
DCa<-glm(Class~., data=BreastCancer2)
summary(DCa)
```
### Discrimination and Calibration
A high _$R^2$ suggests that the linear regression model is well calibrated. This metric is often not displayed but should be sought when interpreting the data.
The areas under the receiver operating characteristic curve (AUC) is used to assess how well the models discriminate between those who have the disease and those who do not have the disease of interest. An AUC of 0.5 is classified as no better than by chance; 0.8 to 0.89 provides good (excellent) discrimination, and 0.9 to 1.0 provides outstanding discrimination. This rule of thumb about interpreting AUC when reading the literature is language the authors used to describe the AUC. This test of discrimination is not synonymous with calibration. It is possible to have a model with high discrimination but poor calibration [@pmid1738016]. The AUC is similar to Harrell's c-index but the interpretation of difference in AUC and c-index between models is not straightforward. A difference in 0.1 of AUC correspond to the number of subject rank correctly. The c-index was originally described for survival analysis [@pmid7069920]. Harrell described the c-index (concordance index) as estimating the probability that, of two randomly chosen patients, the patient with the higher prognostic score will outlive the patient with the lower prognostic score. As such the c-index should be interpreted as the number of concordant pairs relative to the total number of comparable pairs. It has been proposed that the AUC and c-index is not appropriate for survival analysis as they do not account for the dynamic nature of the data[@LONGATO2020103496]. The integrated Graf score has been proposed to account for difference in the estimated event-free survival probabilities with the actual outcome [@pmid10474158].
Calibration of logistic regression model is performed using the Hosmer–Lemeshow goodness-of-fit test and the Nagelkerke generalized R2. A model is well
calibrated when the Hosmer–Lemeshow goodness-of-fit test shows no difference between observed and expected outcome or P value approaching 1. A high
generalized $R^2$ value suggests a well-calibrated regression model. Running regression through the _rms_ or _PredictABEL_ library provide these results. The generalized $R^2$ can be obtained manually from base R by running an intercept only model and again with covariates. It is given by $1-\frac{L1}{L0}$.
```{r 03-Statistics-4-1, warning=F}
#lrm on logistic regression analysis for Breast Cancer
library(rms)
DCa_rms<-lrm(Class~., data=BreastCancer2)
DCa_rms
```
#### Measuring Improvement in Regression Models
The net reclassification improvement (NRI) and integrated discrimination improvement (IDI) have been proposed as more sensitive metrics of improvement in model discrimination.The NRI can be considered as a percentage reclassification
for the risk categories and the IDI is the mean difference in predicted probabilities between 2 models (constructed from cases with disease and without disease). The NRI and IDI scores are expressed as fractions and can be converted to percentage by multiplying 100.The continuous NRI and IDI can be performed
using _PredictABEL_ [@pmid28579970][@pmid26796056].
#### Shapley value
We can use ideas from game theory relating to fair distribution of profit in coalition games; the coalition (co-operative) game in this case can be
interpreted as contribution of the covariates to the model. The Shapley value regression method calculates the marginal contribution of each covariate as the average of all permutations of the coalition of the covariates containing the covariate of interest minus the coalition without the covariate of interest. The advantage of this approach is that it can handle multicollinearity (relatedness) among the covariates.
The feature importance is used to assess the impact of the features on the
model's decision
```{r 03-Statistics-5, warning=F}
#this section takes the output from logistic regression above.
library(iml)
X = BreastCancer2[which(names(BreastCancer2) != "Class")]
predictor = Predictor$new(DCa, data = X, y = BreastCancer2$Class)
imp = FeatureImp$new(predictor, loss = "mae")
plot(imp)
```
From the logistic regression above cell thickness and cromatin have the highest coefficient and lowest p value. This is the same as feature importance. By contrast the Shapley values show that cell shape and marg adhesion make the largest impact on the model when considering the contribution to the model after considering all the contribution by different coalitions.
```{r 03-Statistics-5-1}
#explain with game theory
shapley <- Shapley$new(predictor, x.interest = X[1, ])
shapley$plot()
```
#### ICE
The individual conditional expectation (ICE) curves is the plot of the
expectation fof the predictive value for each observation at the unique value
for the feature.
```{r 03-Statistics-5-2}
#feature is the covariate of interest
par(mfrow=c(1,2))
eff_thick <- FeatureEffect$new(predictor,
feature = "Cl.thickness",
method = "ice",
center.at = 0)
plot(eff_thick)
```
### Interaction
Interactions is plotted here using lollipop bar. The _ggalt_ library can be used to create this type of plot with _geom_lollipop_. The strength of interaction is measured using Friedman's H-statistics. The H-statistics ranges from 0 to 1 with
1 indicating the overall interaction strength. In the case with Breast Cancer data, the interaction strength is low.
When describing interaction terms it is recommended that the results be
expressed as β coefficients rather than as odds ratio.
```{r 03-Statistics-5-3}
#plot interactions
interact <- Interaction$new(predictor)
plot(interact)
```
## Confounder
A confounder is a covariate that serves as a cause of both exposure and outcome and as such confound the analysis. A mediator exist on the causal pathway from exposure to outcome. A common misconception is that the multiple regression adjust for imbalance in covariates in clinical trial. This issue was observed in the pivotal NINDS alteplase trial. The results of the trial has since been accepted with re-analysis of this trial using covariate adjustment [@pmid15345796].
There are several methods for covariate adjustment in radomised trials: direct adjustment, standardisation and inverse-probability-of-treatment weighting.
### Confounder weighted model
The issue asked is whether one should choose to perform confounder analysis or propensity matching.
### Propensity matching
Propensity matching is an important technique to adjust for imbalance in covariates between 2 arms. There are concerns with mis-use of this technique
such as difference in placebo arms from multiple sclerosis trials [@doi10.1001/jamaneurol.2020.0678]. It is proposed that this technique should be
used only if all the confounders are measurable. This situation may not be satisfied if the data were accrued at different period, in different continent etc.
### E-values
The E-values [@pmid28693043] has been proposed a measure of unmeasured confounders in observational studies. The E-value is a measure of the extent to which the confounder have on the treatment–outcome association, conditional on the measured covariates. A large E-value is desirable. The E-values is available in _Evalue_ library [@pmid29912013].
## Causal inference
Causation and association are often miscontrued to be the same. However, the finding of correlation (association) does necessarily imply causation. Causal inference evaluates the response of an effect variable in the setting of change in the cause of the effect variable. There are issues with approach to analysis of causal inference. It can be performed using frequentist such as confounder weighted model or Bayesian methods such as [(Baysian additive regression tree)][Bayesian trees method].
## Special types of regression
### Ordinal regression
Ordinal regression is appropriate when the outcome variable is in the form of ordered categorical values. For example, the Rankin scale of disability is
bounded between the values of 0 and 6. This type of analysis uses the
proportional odds model and the requirement for this model is stringent. When examining results of ordinal regression check that the authors provide this metric, the Brant test. The Brant test assesses the parallel regression assumption. Ordinal regression is performed using _polr_ function in _MASS_ library. The Brant test is available in the _Brant_ library.
### Survival analysis
Survival analysis is useful when dealing with time to event data. Time to event data can be left, interval and right censoring. Left censoring exists when events may have already occurred at the start of the study eg purchase of phones. Right censoring exists when events have not happened yet eg cancer trial. Interval censoring exists when an insurance has been purchased but the date of product purchase is not yet known.
The Cox model assesses the hazard of outcome between two groups. The assumption of this model is that the hazard between each arm is proportional [@pmid32167523]. The proportional hazard model can be tested based on the weighted Schoenfeld residuals[@10.1093/biomet/81.3.515]. There are non-parametric models available when the assumption of the proportional hazard model does not hold.
In the next chapter on machine learning, an illustration of [random survival forest with _rfrsc_ library][Random survival forest with rfsrc] and _ranger_ library are provided. In the section on [clinical trial][Interpreting clinical trials] we illustrate how the results can be converted to numbers needed to treat. The median survival corresponding to survival probability of 0.50 can be determined here. Metrics for assessing survival model was described [above][Discrimination and Calibration].
```{r 03-Statistics-6, warning=F}
library(survival)
library(survminer)
#data from survival package on NCCTG lung cancer trial
#https://stat.ethz.ch/R-manual/R-devel/library/survival/html/lung.html
data(cancer, package="survival")
#time in is available in days
#status censored=1, dead=2
#sex:Male=1 Female=2
survfit(Surv(time, status) ~ 1, data = cancer)
```
```{r 03-Statistics-6-1, warning=F}
sfit<- coxph(Surv(time, status) ~ age+sex+ph.ecog+ph.karno+wt.loss, data = cancer)
summary(sfit)
```
Test proportional hazard assumption using weighted residuals [@10.1093/biomet/81.3.515]. The finding below shows that inclusion of covariate ph.karno violate proportional hazard assumption.
```{r 03-Statistics-6-2, warning=F}
cox.zph(sfit)
```
Plot fit of survival model
```{r 03-Statistics-6-3}
ggcoxdiagnostics(sfit, type = "deviance",
ox.scale = "linear.predictions")
```
Forest plot of outcome from survival analysis
```{r 03-Statistics-6-4, warning=F}
ggforest(sfit)
```
An alternative way to display the output from Cox regression is to use _forestmodel_ library .
```{r 03-Statistics-6-5, warning=F}
forestmodel::forest_model(sfit)
```
The pseudoR2 for Cox regression model proposed by Royston can be evaluated
```{r 03-Statistics-6-6, warning=F}
royston(sfit)
```
### Quantile regression
Least spare regression is appropriate when the data is homoscedascity or the
error term remain constant. This can be seen as the data varies around the
fitted line. Homoscedascity implies that the data is homogenous. Quantile regression is appropriate when the distribution of the data is non-normal and it is more appropriate to look at the conditional median of the dependent variable. There are several libraries for this task _quantreg_ and Bayesian libraries. In the example below, the life time risk of stroke is regressed against life expectancy using lest square and quantile regression.
```{r 03-Statistics-7, warning=F}
library(quantreg)
library(ggplot2)
load("./Data-Use/world_stroke.Rda")
#quantile regression
rqfit <- rq( MeanLifetimeRisk~ LifeExpectancy, data = world_sfdf)
rqfit_sum<-summary(rqfit)
#least square regression
lsfit<-lm(MeanLifetimeRisk~LifeExpectancy,data=world_sfdf)
lsfit_sum<-summary(lsfit)
#plot
ggplot(world_sfdf, aes(x=LifeExpectancy,y=MeanLifetimeRisk))+
#add fitted line for least square
geom_abline(intercept =lsfit_sum$coefficients[1], slope=lsfit_sum$coefficients[2],color="red")+
#add fitted line for quantile regression
geom_point()+xlab("Life Expectancy")+
geom_abline(intercept =rqfit_sum$coefficients[1], slope=rqfit_sum$coefficients[2],color="blue")
#annotate least square and quantile at position x, y
#annotate("text",x=60, y=27, label=paste0("least square =", round(lsfit_sum$coefficients[1],2) ,"+", round(lsfit_sum$coefficients[2],2),"x ","Life Expectancy"),color="red")+ annotate("text",x=75, y=12,label=paste0("quantile =",round(rqfit_sum$coefficients[1],2), " + ", round(rqfit_sum$coefficients[2],2)," x ","Life Expectancy"),color="blue")
```
### Non-negative regression
In certain situations, it is necessary to constrain the analysis so that the regression coefifcients are non-negative. For example, when regressing brain regions against infarct volume, there is no reason believe that a negative coefficient attributable to a brain region is possible[@pmid16504541] . Non-negative regression can be performed in R using _nnls_.
### Poisson regression
Poisson regression is used when dealing with number of event over time or
distance such as number of new admissions or new cases of hepatitis or TIA over time. An assumption of the Poisson distribution is that the mean & lambda; and variance λ are the same.
A special case of Poisson regression is the negative binomial regression. This latter method is used when the variance is greater than the mean pf the data or over-dispersed data. Negative binomial regression can be applied to number of 'failure' event over time. Here 'failure' has a lose definition and can be stroke recurrence after TIA or cirrhosis after hepatitis C infection.
Zero-inflated data occurs when there is an abundance of zeroes in the data (true and excess zeroes).
### Conditional logistic regression
Conditional logistical regression model should be used when the aim is to
compare pair of objects from the same patient. Examples include left and right arms or left and right carotid arteries. This method is available from _clogit_ in _survival_.
### Multinomial modelling
Multinomial modelling is used when the outcome categorical variables are not ordered. This situation can occur when analysis involves choice outcome (choices of fruit: apple, orange or pear). In this case, the log odds of each of the categorical outcomes are analysed as a linear combination of the predictor variables. The _nnet_ library have functions for performing this analysis.
## Sample size estimation
Clinicians are often frustrated about sample size and power estimation for a study, grant or clinical trial. This aspect is scrutinised by ethics committee
and in peer review process for journals. Luckily, R provides several packages
for sample size amd power estimation: _pwr_ library. Cohen has written reference textbook on this subject [@COHEN1977179].
### Proportion
```{r 03-Statistics-8, warning=F}
library(pwr)
#ttest-d is effect size
#d = )mean group1 -mean group2)/variance
pwr.t.test(n=300,d=0.2,sig.level=.05,alternative="greater")
```
We provided an example below for generating power of clinical trial. Examples
are taken from a paper on sample size estimation for phase II trials [@pmid16931782].
```{r 03-Statistics-9}
library(pwr)
#h is effect size. effect size of 0.5 is very large
#sample size
pwr.2p.test(h=0.5,n=50,sig.level=0.05,alternative="two.sided")
#medium effect size
pwr.2p.test(h=0.1,n=50,sig.level=0.05,alternative="two.sided")
```
The output of the sample size calculation can be put into a table or plot.
```{r 03-Statistics-10}
library(pwr)
#pwr.2p.test(h=0.3,n=80,sig.level=0.05,alternative="two.sided")
h <- seq(.1,.5,.1) #from 0.1 to 0.3 by 0.05
nh <- length(h) #5
p <- seq(.3,.9,.1)# power from 0.5 to 0.9 by 0.1
np <- length(p) #9
# create an empty array 9 x 5
samplesize <- array(numeric(nh*np), dim=c(nh,np))
for (i in 1:np){
for (j in 1:nh){
result <- pwr.2p.test(n = NULL, h = h[j],
#result <- pwr.r.test(n = NULL, h = h[j],
sig.level = .05, power = p[i],
alternative = "two.sided")
samplesize[j,i] <- ceiling(result$n)
}
}
samplesize
#graph
xrange <- range(h)
yrange <- round(range(samplesize))
colors <- rainbow(length(p))
plot(xrange, yrange, type="n",
xlab="Effect size (h)",
ylab="Sample Size (n)" )
# add power curves
for (i in 1:np){
lines(h, samplesize[,i], type="l", lwd=2, col=colors[i])
}
# add annotation (grid lines, title, legend)
abline(v=0, h=seq(0,yrange[2],50), lty=2, col="grey89")
abline(h=0, v=seq(xrange[1],xrange[2],.02), lty=2,
col="grey89")
title("Sample Size Estimation\n Difference in Proportion")
legend("topright", title="Power", as.character(p),
fill=colors)
```
#### Non-inferiority
Non-inferiority trials may offer information in a way that a traditional superiority design do not. The design may be interested in other aspect of the treatment such as cost and lower toxicity [@pmid26080342]. Examples of non-inferiority trial designs include antibiotics versus surgery for
appendicitis [@pmid26080338]. There are concerns with reporting of
noninferiority trial. Justification for the margin provided in 27.6% [@pmid25781447]. The following describes a trial design where it's expected that drug will result in a certain outcome _p1_ and the control arm _p2_ and the
ratio of subject in treatment to control arm is _k_. The difference in outcome
is _delta_. The margin is defined as non-inferior if <0.
```{r 03-Statistics-11}
library(TrialSize)
TwoSampleProportion.NIS(alpha=.05,
beta=.8,
p1=.6,
p2=.7,
k=1,
delta = .1,
margin=-.2
)
```
### Logistic regression
```{r 03-Statistics-12}
library(powerMediation)
library(ggplot2)
#continuous predictor
#p1=event rate
#exp(0.405) =1.5
powerLogisticCon(n=200, p1=0.265, OR=exp(0.014), alpha=0.05)
# creating a data frame using data from
a=seq(0,05.4,0.05)
df_power<-data.frame(`ASPECTS`= a,
"Power"=powerLogisticCon(n=100, p1=a, OR=exp(.695), alpha=0.05)
)
ggplot(data=df_power, aes(x=ASPECTS, y=Power))+geom_point()
```
An alternative library to perform sample size for logistic regression is _WebPower_ library.
```{r 03-Statistics-12-1}
library(WebPower)
wp.logistic(p0=0.007, #Prob (Y=1|X=0)
p1=0.012, #Prob (Y=1|X=1)
alpha=0.05,
power=0.80,
alternative="two.sided",
family="normal")
```
### Survival studies
Sample size for survival studies can be performed using _powerSurvEpi_ or _gsDesign_.
```{r 03-Statistics-13}
library(powerSurvEpi)
#sample size
ssizeEpi.default(power = 0.80,
theta = 2,
p = 0.408 ,
psi = 0.15,
rho2 = 0.344^2,
alpha = 0.05)
#power
powerEpi.default(n = 2691,
theta = 2,
p = 0.408,
psi = 0.250,
rho2 = 0.344^2,
alpha = 0.05)
#Amarenco NEJM 2020 #equal sample size k=1
ssizeCT.default(power = 0.8, k = .8, pE = 0.085,
pC = 0.109,
RR = 0.78, alpha = 0.05)
```
### Multiple regression
The power for general linear model can be calculated using the _pwr.f2.test_ function.
```{r 03-Statistics-14}
library(pwr)
#u=degrees of freedom for numerator
#v=degrees of freedomfor denominator
#f2=effect size
```
## Randomised clinical trials
A common misconception is that the multiple regression adjust for imbalance in covariates in clinical trial. This issue was observed in the pivotal NINDS alteplase trial. The results of the trial has since been accepted with re-analysis of this trial using covariate adjustment[@pmid15345796]. There are several methods for covariate adjustment in radomised trials: direct adjustment, standardisation and inverse-probability-of-treatment weighting.
### Covariate adjustment in trials
Specifically, covariate adjustment refers to adjustment of covariates available at the time of randomisation, i.e. prespecified variables and not variables after randomisation such as pneumonia post stroke trials. The advantage of covariate adjustment is that it results in narrower confidence interval as well as increase the power of the trial up to 7% [@10.1186/1745-6215-15-139]. The increased power is highest when prognostic variables are used but can decrease power when non-prognostic variables are used [@10.1186/1745-6215-15-139].
### Subgroup analysis
Subgroup analysis can be misleading especially if not specified prior to trial analysis [@doi:10.1056/NEJMsr077003]. Furthermore, increasing the number of subgroup analysis will lead to increasing the chance of false positive result or multiplicity. It is important to differentiate between prespecified and posthoc analyses as posthoc analyses may be biased by examination of the data.
### p value for interaction
The p value for interaction describe the influence by a baseline variable treatment effect on outcome in clinical trial [@doi:10.1056/NEJMsr077003]. In a hypothetical trial, a significant p value for interaction between males and females for treatment effect on primary outcome indicates heterogenity of treatment effect.
```{r 03-Statistics-14-1}
library(Publish)
library(survival)
```
### Interpreting risk reduction in clinical trials
A key issue in interpreting of clinical trials occurs when the relative risk
reduction or relative hazard risk are provided. This issue affect the clinical interpretation of the trial finding and its application in practice. An example is the result of the ACAS asymptomatic carotid artery trial is often quoted as showing 50% risk reduction. In fact, there was 2% annual risk of ipsilateral stroke in the medical and 1% risk in the surgical arm. The absolute risk reduction or ARR was 1% per year. However, the 50% relative risk reduction is often quoted to explain to patients.
### NNT from ARR
In this case, the number needed to treat (NNT) is given by $\frac{1}{ARR}$ or $\frac{1}{0.01}=100$ to achieve the trial outcome or 100 patients are needed to be treated to reduce the stroke risk to 1%. The recommendation is that the 95% confidence interval for NNT should be provided.
### NNT from odds ratio
Calculation of NNT for odds ratio requires knowledge of the outcome of the placebo group. The NNT is given by $\frac{1}{ACR-\frac{OR*ACR}{1-(ACR+OR*ACR)}}$. The ACR represents the assumed control risk. The NNT can be calculated from _nnt_ function in _meta_ library.
```{r 03-Statistics-14-2}
library(meta)
#p.c = baseline risk
nnt(0.73, p.c = 0.3, sm = "OR")
```
### NNT from risk ratio
```{r 03-Statistics-14-3}
#data from EXTEND-IV trial in NEJM 2019
#outcome 35.4% in tpa and 29.5% in control
nnt(1.44, p.c = 0.295, sm = "RR")
```
### NNT from hazard ratio
Calculation of NNT for hazard ratio requires knowledge of the outcome of the placebo group and the hazard ratio or _HR_. The formula using the binomial theorem is $p=1-q$ where q is given by the ratio of outcome and numbers recruited in the placebo group. The formula is taken from [@pmid32404155] . The NNT is given
$\frac{1}{[p^{HR}-p}$ . We illustrated this using data from metaanalysis on aspirin use in stroke in Lancet 2016. There were 45 events among 16053 patients
in the control group. The HR was 0.44. The NNT from this formula $\frac{1}{.9971968^.44-.9971968}$ was 637.
### NNT from metaananlysis
There are concerns with using NNT from the results of metaanalysis as the findings are amalgations of trials with different settings [@pmid12614141] [@pmid10356018]. The caution applies when baseline risks or absolute risk differences vary across trials.
## Diagnostic test
### Sensitivity, specificity
The sensitivity of a diagnostic test is the true positive rate and the specificity is the true negative rate. Example of 2 x 2 table is provided here. As an exercise, consider a paper about a diagnostic test for peripheral vertigo reporting 100% sensitivity and 94.4% specificity. There are 114 patients, 72 patients without stroke have vertigo and positive test findings. Among patients with stroke 7 of 42 have positive test findings. The sensitivity is $TP=\frac{TP}{TP+FN}$ and the specificity is the $TN=\frac{TN}{TN+FP}$.
```{r 03-Statistics-14-3-1}
# Peripheral Vertigo
# Disease Positive Disease Negative
#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# HIT Test# # $
# Positive# True Positive # False Positive $
# # 72 # 7 $
#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# HIT Test# # $
# Negative# False Negative # True Negative $
# # 0 # 35 $
#$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# Peripheral Vertigo
#Sensitiviy=TP/(TP+FN)=100%
#Specificity=TN/(TN+FP)=83%
```
### AUC
The area under the receiver operating characteristics (ROC) curve or AUC is a measure of the accuracy of the test. It is recommended that ROC curve is used when there are multiple threshold. It should not be used when the test has only one threshold. Some investigators suggest caution regarding the validity of using receiver operating curve with single threshold diagnostic tests [@pmid33250548]. An AUC of 0.5 is classified as no better than by chance; 0.6–0.69 provides poor discrimination; 0.7–0.79 provides acceptable (fair) discrimination, 0.8 to 0.89 provides good (excellent) discrimination, and 0.9 to 1.0 provides outstanding discrimination.
### Likelihood ratio
Positive likelihood ratio (PLR) is the ratio of sensitivity to false positive rate (FPR); the negative (NLR) likelihood ratio is the ratio of 1-sensitivity to specificity. A PLR indicates the likelihood that a positive spot sign (test) would be expected in a patient with ICH (target disorder) compared with the likelihood that the same result would be expected in a patient without ICH. Using the recommendation by Jaeschke et al, a high PLR (>5) and low NLR (<0.2) indicate that the test results would make moderate changes in the likelihood of hematoma growth from baseline risk. PLRs of >10 and NLRs of <0.1 would confer very large changes from baseline risk .
#### Fagan's normogram
Fagan’s normogram can be conceptualised as a sliding ruler to match the disease
prevalence and likelihood ratios to evaluate the impact on the post-test probability [@pmid1143310]. At the current disease prevalence of 23.4% and PLR 4.65, the post-test probability remains low at 0.60.
```{r 03-Statistics-14-3-2, eval=F}
library(tidyverse)
source("https://raw.githubusercontent.com/achekroud/nomogrammer/master/nomogrammer.r")
p<-nomogrammer(Prevalence = .234, Plr = 4.85, Nlr = 0.49)
p+ggtitle("Fagan's normogram for Spot Sign and ICH growth")
#to save the file
#ggsave(p,file="Fagan_SpotSign.png",width=5.99,height=3.99,units="in")
```
#### Likelihood ratio graph
Likelihood ratio graph is a tool for comparing diagnostic tests [@pmid10700737].
```{r 03-Statistics-14-3-3}
#plot likelihood ratio graph
LR_graph<-function (Read,sheet,Sensitivity, Specificity){
Read1<-readxl::read_xlsx(Read, sheet = sheet)
#binary data
#A=True pos %B=False positive %C=False negative %D=True negative
A=Read1$TP
B=Read1$FP
C=Read1$FN
D=Read1$TN
TPR=A/(A+C)
FPR=1-(D/(D+B))
TPR_DiagnosticTest=Sensitivity
FPR_DiagnosticTest=1-Specificity
# set plot
X=seq(0,1,by=.1)
Y=seq(0,1,by=.1)
plot(X,Y,main="Likelihood Ratio graph", xlab="1-Specificity",ylab="Sensitivity",cex=.25)
#pch describe the shape. The value 1 corresponds o
points(FPR_DiagnosticTest,TPR_DiagnosticTest,pch=8,col="blue",cex=2)
#pch describe the shape. The value 8 corresponds *
points(FPR,TPR,pch=1,col="red",cex=2) #add point
#abline(coef = c(0,1)) #add diagonal line
df1<-data.frame(c1=c(0,TPR_DiagnosticTest),c2=c(0,FPR_DiagnosticTest))
reg1<-lm(c1~c2,data=df1)
df2<-data.frame(c1=c(TPR_DiagnosticTest,1),c2=c(FPR_DiagnosticTest,1))
reg2<-lm(c1~c2,data=df2)
abline(reg1)
abline(reg2)
text(x=FPR_DiagnosticTest,y=TPR_DiagnosticTest+.3,label="Superior",cex=.7)
text(x=FPR_DiagnosticTest+.2,y=TPR_DiagnosticTest+.2,label="Absence",cex=.7)
text(x=.0125,y=TPR_DiagnosticTest-.1,label="Presence",cex=.7)
text(x=FPR_DiagnosticTest+.1,y=TPR_DiagnosticTest,label="Inferior",cex=.7)
text(x=.7,y=.2,label="Reference = Content Expert",cex=.7)
text(x=.7,y=.15, label="Diagnostic Test software", cex=.7)
}
```
Runnning the function from above
```{r 03-Statistics-14-3-3-1}
#Sensitivity=0.623
#Specificity=1-.927
LR_graph("./Data-Use/Diagnostic_test_summary.xlsx",1,.623,.927)
```
## Metaanalysis
During journal club, junior doctors are often taught about the importance of metaanalysis. It is worth knowing how to perform a metaanalysis in order to critique the study. This is an important issue as the junior doctor is supervised by someone who a content expert but not necessarily a method expert. Metaanalysis can be performed for clinical trials, cohort studies or diagnostic studies. As an example, it is not well known outside of statistics journal that the bivariate analysis is the preferred method to evaluate diagnostic studies [@pmid16168343]. By contrast, the majority of metaanalysis of diagnostic studies uses the univariate method of Moses and Littenberg [@pmid8210827]. This issue will be expanded below.
### Quality of study
All studies require evaluation of the quality of the individual studies. This can be done with the QUADAS2 tool, available at
https://annals.org/aim/fullarticle/474994/quadas-2-revised-tool-quality-assessment-diagnostic-accuracy-studies.
### PRISMA
The PRISMA statement is useful for understanding the search strategy and the papers removed and retained in the metaanalysis. An example of generating the statement is provided below in R. The example given here is from a paper on the use of spot sign to predict enlargment of intracerebral hemorrhage [@pmid31272327].
```{r 03-Statistics-14-3-4}
library(PRISMAstatement)
#example from Spot sign paper. Stroke 2019
prisma(found = 193,
found_other = 27,
no_dupes = 141,
screened = 141,
screen_exclusions = 3,
full_text = 138,
full_text_exclusions = 112,
qualitative = 26,
quantitative = 26,
width = 800, height = 800)
#https://rich-iannone.github.io/DiagrammeR/graphviz_and_mermaid.html#attributes
library(DiagrammeR)
grViz("
digraph boxes_and_circles {
# a 'graph' statement
graph [overlap = true, fontsize = 10]
# several 'node' statements
node [shape = box,
fontname = Helvetica]
Stroke
node [shape = oval,
fixedsize = false,
color=red,
width = 0.9]
Hypertension; 'No Hypertension'
node [shape= circle,
fontcolor=red,
color=blue,
fixedsize=false]
Hypokalemia; 'No Hypokalemia'
# several 'edge' statements
edge [arrowhead=diamond]
Stroke->{Hypertension, 'No Hypertension'}
Hypertension->{Hypokalemia, 'No Hypokalemia'}
}
")
## alternative
grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
# edge definitions with the node IDs
tab1 -> tab3
tab1-> tab2
#tab2->tab3
tab2 -> tab4
tab2-> tab5;
}
[1]: 'Stroke n=19'
[2]: 'Hypertension n=10'
[3]: 'No Hypertension n=9'
[4]: 'Hypokalemia n=?'
[5]: 'No Hypokalemia n=?'
")
```
### Conversion of median to mean
One issue with performing metaanalysis is that one paper may report mean and another report median age. The formula for the mean is given by $\frac{a+2m+b}{4}$ where _m_ is the median, _a_ is the upper and _b_ is the lower range [@pmid25524443]. The variance is given by $S^2=\frac{1}{12}(\frac{(a+2m+b)^2}{4}+{(b-a)^2})$. This formula requires examination of the data such as the figure to obtain the upper and lower range. These changes are incorporated into _meta_ libray using _meatamean_ function [@doi10.1136/ebmental-2019-300117]. More recently, investigators suggest to also consider the skewness of the data from the 5 number summary data [@pmid37161735]. The argument _method.mean_ in function _metamean_ is used to specify the method for estimating the mean. In this case we chose the Luo method for illustration Luo [@doi:10.1177/0962280216669183]. See the help page by typing question mark before _metamean_ for more options.
```{r 03-Statistics-14-4}
library(meta)
#NIHSS data from ANGEL large core trial in NEJM 2023
metamean(q1=4,q3=20,median=16,n=230, method.mean = "Luo")
```
Here the same data is used for the McGrath method [@doi:10.1177/0962280219889080].
```{r 03-Statistics-14-5}
metamean(q1=4,q3=20,median=16,n=230, method.mean = "QE-McGrath")
```
The conversion from median to mean in _metafor_ is performed using _conv.fivenum_ function [@JSSv036i03]. There is an update on the _metafor_ page as well as discussion on alternate approach. The default method of this function is to use the methods by Luo [@doi:10.1177/0962280216669183], Wan [@pmid25524443] and Shi [@pmid37161735]
```{r 03-Statistics-14-6}
library(metafor)
# example data frame
EstMean <- data.frame(Paper=c(1:4,NA), min=c(1,2,NA,2,NA), q1=c(NA,NA,4,4,NA),
median=c(5,6,6,6,NA), q3=c(NA,NA,10,10,NA),
max=c(12,14,NA,14,NA),
mean=c(NA,NA,NA,NA,7.0), sd=c(NA,NA,NA,NA,4.2),
n=c(30,30,30,30,30))
EstMean
EstMean <- conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=EstMean)
EstMean
```
The _estmeansd_ library uses quantile estimation method with _qe.mean.sd_ function when the data available are the median and quartile ranges [@doi:10.1177/0962280219889080] [@pmid36412105].The approach here is to use simulation to estimate the parameters.
```{r 03-Statistics-14-7}
library(estmeansd)
#data from ANGEL large core trial in NEJM 2023
res_qe <- bc.mean.sd(q1.val = 13, med.val = 16, q3.val = 20, n = 230)
res_qe
```
The standard error from the mean can be estimated using _get_SE_ function.
```{r 03-Statistics-14-8}
get_SE(res_qe)
```
The _estmeansd_ library uses Box-Cox method for estimating mean and sd when the data available are the median, minimum and maximum values.
```{r 03-Statistics-14-9}
library(estmeansd)
res_bc <- bc.mean.sd(min.val = 13, med.val = 16, max.val = 42, n = 230)
res_bc
```
Again, the standard error from the mean can be estimated using _get_SE_ function.
```{r 03-Statistics-14-10}
get_SE(res_bc)
```
### Inconsistency I2
The inconsistency $I^2$ index is the sum of the squared deviations from the overall effect and weighted by the study size. Value <25% is classified as low
and greater than 75% as high heterogeneity. This test can be performed using metafor package [@JSSv036i03]. The presence of high $I^2$ suggests a need to proceed to meta-regression on the data to understand the source of heterogeneity. The fixed component were the covariates which were being tested for their effect on heterogeneity. The random effect components were the sensitivity and FPR. The $I^2$ value for the TIA clinic study is 34.41%. As such meta-regression is not needed for that study. By contrast, the $I^2$ is much higher for the spot sign study, necessitating metaregression.
```{r 03-Statistics-15}
library(PRISMAstatement)
#example from Spot sign paper. Stroke 2019
prisma(found = 193,
found_other = 27,
no_dupes = 141,
screened = 141,
screen_exclusions = 3,
full_text = 138,
full_text_exclusions = 112,
qualitative = 26,
quantitative = 26,
width = 800, height = 800)
```
### Metaanalysis of proportion
This is an example of metaanalysis of stroke recurrence following management in rapid TIA clinic. A variety of different methods for calculating the 95% confidence interval of the binomial distribution. The mean of the binomial distribution is given by p and the variance by $\frac{p \times (1-p)}{n}$. The term $z$ is given by $1-\frac{\alpha}{2}$ quantile of normal distribution. A standard way of calculating the confidence interval is the Wald method $p\pm z\times \sqrt{\frac{p \times(1-p)}{n}}$. The Freeman-Tukey double arcsine transformation tries to transform the data to a normal distribution. This approach is useful when occurence of event is rare. The exact or Clopper-Pearson method is suggested as the most conservative of the methods for calculating confidence interval for proportion. It is based on cumulative properties of the binomial distribution. The Wilson method has similarities to the Wald method. It has an extra term $z^2/n$. There are many different methods for calculating the confidence interval for proportions. Investigators such as Agresti proposed that approximate methods are better than exact method [@doi:10.1080/00031305.1998.10480550]. Brown and colleagues proposed the use of the Wilson method [@brown2001]
```{r 03-Statistics-16, warning=F}
library(metafor) #open software metafor
#create data frame dat
#xi is numerator
#ni is denominator
dat <- data.frame(model=c("melbourne","paris","oxford","stanford","ottawa","new zealand"),
xi=c(7,7,6,2,31,2),
ni=c(468,296, 281,223,982,172))
#calculate new variable pi base on ratio xi/ni
dat$pi <- with(dat, xi/ni)
#Freeman-Tukey double arcsine trasformation
dat <- escalc(measure="PFT", xi=xi, ni=ni, data=dat, add=0)
res <- rma(yi, vi, method="REML", data=dat, slab=paste(model))
#create forest plot with labels
metafor::forest(res, transf=transf.ipft.hm, targs=list(ni=dat$ni), xlim=c(-1,1.4),refline=res$beta[1],
cex=.8, ilab=cbind(dat$xi, dat$ni),
#position of data on x-axis
ilab.xpos=c(-.6,-.4),digits=3)
#par function combine multiple plots into one
op <- par(cex=.75, font=2)
#position of column names on x-axis
text(-1.0, 7.5, "model ",pos=4)
text(c(-.55,-.2), 7.5, c("recurrence", " total subjects"))
text(1.4,7.5, "Proportion [95% CI]", pos=2)
par(op)
```
Exact 95% confidence interval for proportion is provided below using the TIA
data above. This solution was provided on _stack overflow_. This is performed using the _binomial.test_ function.
```{r 03-Statistics-16-1, warning=F}
#exact confidence interval
sapply(split(dat, dat$model), function(x) binom.test(x$xi, x$ni)$conf.int)
```
The data needs to be transposed using _t_ function. This generates one column
for lower and another for upper CI. Note that sapply returns a matrix or array
and the column names have to be assigned.
```{r 03-Statistics-16-2, warning=F}
#the data
tmp <- t(sapply(split(dat, dat$model), function(x) binom.test(x$xi, x$ni)$conf.int))
tmp
```
The exact confidence is now put back into the dat data frame.
```{r Statistics-16-3, warning=F}
dat$ci.lb <- tmp[,1] #adding column to data frame dat
dat$ci.ub <- tmp[,2] #adding column to data frame dat
dat <- escalc(measure="PFT", xi=xi, ni=ni, data=dat, add=0)
res <- rma.glmm(measure="PLO", xi=xi, ni=ni, data=dat)
#insert the exact confidence interval
with(dat, metafor::forest(yi, ci.lb=ci.lb, ci.ub=ci.ub,
ylim=c(-1.5,8.5),
xlim=c(-1.1,1),
refline=predict(res, transf=transf.ilogit)$pred,
cex=.8,
ilab=cbind(dat$xi, dat$ni),
ilab.xpos=c(-.6,-.2),digits=3))
op <- par(cex=.75, font=2)
addpoly(res, row=-1, transf=transf.ilogit)
abline(h=0)
#position of column names on x-axis
text(-.9, 7.5, "Model", pos=2)
text(c(-.55,-.2), 7.5, c("recurrence", " total subjects"))
text( 1, 7.5, "Proportion [95% CI]", pos=2)
```
### Metaanalysis of continuous data
Metanalysis of continuous outcome data can be performed using standardised mean difference or ratio of means. It is available in the _meta_ library using _metacont_ function [@doi10.1136/ebmental-2019-300117].
### Bivariate Metaanalysis
The univariate method of Moses-Shapiro-Littenberg combines these measures (sensitivity and specificity) into a single measure of accuracy (diagnostic odds ratio)[@pmid8210827] . This approach has been criticized for losing data on sensitivity and specificity of the test. Similar to the univariate method, the bivariate method employs a random effect to take into account the within study correlation [@pmid16168343]. Additionally, the bivariate method also accounts
for the between-study correlation in sensitivity and specificity. Bivariate analysis is performed using _mada_ package. A [Bayesian library][Bayesian Metaanalysis] for bivariate analysis _meta4diag_ is illustrated later.
The example below is taken from a metaanalysis of spot sign as predictor
expansion of intracerebral hemorrhage [@pmid31272327]. The data for this
analysis is available in the Data-Use sub-folder.
```{r 03-Statistics-17, warning=F}
library(mada)
#spot sign data
Data<-read.csv("./Data-Use/ss150718.csv")
#remove duplicates using subset
#another way is to use filter from dplyr
Dat<-subset(Data, Data$retain=="yes")
(ss<-reitsma(Dat))
summary(ss)
AUC(reitsma(data = Dat))
sumss<-SummaryPts(ss,n.iter = 10^3) #bivariate pooled LR
summary(sumss)
```
### Metaanalysis of clinical trial.
Fixed effect (Peto or Mantel-Haenszel) approaches assume that the population is the same for all studies and thus each study is the source of error. Random
effect (DerSimonian Laird) assumes an additional source of error between studies.The random effect approach results in more conservate estimate of effect size confidence interval. A criticism of DerSimnonian and Laird approach is that it is prone to type I error especially when the number of number of studies is small (n<20) or moderate heterogeneity. It's estimated that 25% of the significant findings with DerSimonian Laird method may be non-significant with Hartung-Knapp method (BMC Medical Research Methodology 2014, 14:25). The Hartung-Knapp method (Stat Med 2001;20:3875-89) estimate the between studies variance and treat it as fixed. It employs quantile of t-distribution rather than normal distribution. The Hartung-Knapp method is available in meta and metafor package.
The data is from Jama Cardiology on Associations of Omega-3 Fatty Acid Supplement Use With Cardiovascular Disease Risks Meta-analysis of 10 Trials Involving 77917 Individuals [@pmid29387889]. Subsequently a meta-analysis [@pmid31567003] reported that contrary to the earlier meta-analysis, Omega-3 lowers the risk of cardiovascular diseases with effect related to dose. The analysis using DerSimonian Laird and Hartung Knapp method is illustrated below.
```{r 03-Statistics-17-1, warning=F}
library(tidyverse)
library(metafor)
#Omega 3 data
Year=c(2010,2014,2010,2007,2010,2010,2013,2008,2012,1999)
Trials=c("DOIT","AREDS-2","SU.FOL.OM3","JELIS","Alpha Omega","OMEGA","R&P","GISSI-HF","ORIGIN","GISSI-P")
Treatment=c(29,213,216,262,332,534,733,783,1276,1552)
Treatment.per=c(10.3,9.9,17.2,2.8,13.8,27.7,11.7,22.4,20.3,27.4)
Control=c(35,208,211,324,331,541,745,831,1295,1550)
Control.per=c(12.5,10.1,16.9,3.5,13.6,28.6, 11.9,23.9,20.7,27.3)
#combine into data frame
rct<-data.frame(Year,Trials,Treatment,Treatment.per,Control,Control.per) %>%
mutate(Treatment.number=round(Treatment*100/Treatment.per,0),
Control.number=round(Control*100/Control.per,0), group=ifelse(Year>2010,"wide","narrow")) %>%
rename(ai=Treatment,n1i=Treatment.number,ci=Control,
n2i=Control.number,study=Trials)