forked from compgenomr/book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-StatsForGenomics.Rmd
1808 lines (1473 loc) · 79.5 KB
/
03-StatsForGenomics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Statistics for Genomics {#stats}
This chapter will summarize statistics methods frequently used
in computational genomics. As these fields are continuously evolving, the
techniques introduced here do not form an exhaustive list but mostly corner
stone methods
that are often and still being used. In addition, we focused on giving intuitive and
practical understanding of the methods with relevant examples from the field.
If you want to dig deeper into statistics and math, beyond what is described
here, we included appropriate references with annotation after each major
section.
## How to summarize collection of data points: The idea behind statistical distributions
In biology and many other fields data is collected via experimentation.
The nature of the experiments and natural variation in biology makes
it impossible to get the same exact measurements every time you measure something.
For example, if you are measuring gene expression values for
a certain gene, say PAX6, and let's assume you are measuring expression
per sample and cell with any method( microarrays, rt-qPCR, etc.). You will not
get the same expression value even if your samples are homogeneous. Due
to technical bias in experiments or natural variation in the samples. Instead,
we would like to describe this collection of data some other way
that represents the general properties of the data. The figure shows a sample of
20 expression values from PAX6 gene.
```{r,fig.align='center', out.width='60%',echo=FALSE,warning=FALSE,fig.height=5.6,fig.cap="Expression of PAX6 gene in 20 replicate experiments"}
set.seed(1)
old.par <- par()
a=rnorm(20,mean=6,sd=0.7)
layout(matrix(c(1,2)))
par(fig=c(0,1,0.15,1))
dotchart(a,labels=paste("experiment",20:1),xlim=c(0,12),
main="PAX6 expression",pch=19)
abline(v=6.13,col="red")
par(fig=c(0,1,0,0.2),mar=c(2,7.65,0.1,2), new=TRUE)
hist(a,xlim=c(0,12),labels = F,main="", col="red",border="white")
par(old.par)
```
### Describing the central tendency: mean and median
As seen in the figure above, the points from this sample are distributed around
a central value and the histogram below the dot plot shows number of points in
each bin. Another observation is that there are some bins that have more points than others. If we want to summarize what we observe, we can try
to represent the collection of data points
with an expression value that is typical to get, something that represents the
general tendency we observe on the dot plot and the histogram. This value is
sometimes called central
value or central tendency, and there are different ways to calculate such a value.
In the figure above, we see that all the values are spread around 6.13 (red line),
and that is indeed what we call mean value of this sample of expression values.
It can be calculated with the following formula $\overline{X}=\sum_{i=1}^n x_i/n$,
where $x_i$ is the expression value of an experiment and $n$ is the number of
expression value obtained from the experiments. In R, `mean()` function will calculate the
mean of a provided vector of numbers. This is called a "sample mean". In reality
the possible values of PAX6 expression for all cells (provided each cell is of the
identical cell type and is in identical conditions) are much much more than 20.
If we had the time and the funding to sample all cells and measure PAX6 expression we would
get a collection values that would be called, in statistics, a "population". In
our case the population will look like the left hand side of the figure below. What we have done with
our 20 data points is that we took a sample of PAX6 expression values from this
population, and calculated the sample mean.
```{r,out.width='75%',fig.width=6.5,echo=FALSE,warning=FALSE,fig.cap="Expression of all possible PAX6 gene expressions measures on all available biological samples (left). Expression of PAX6 gene from statistical sample, a random subset, from the population of biological samples (Right). "}
df=data.frame(x=rnorm(10000,6,0.7))
old.par <- par()
layout(matrix(1:4,ncol=2),heights = c(1,0.5))
par(mar=c( 0, 4.1, 4.1, 2.1))
plot(df[,1],1:nrow(df),pch=19,cex=0.2,yaxt="n",ylab="",xlim=c(2,10),col="blue",
xaxt="n",xlab="",main="Population")
par(mar=c( 5.1, 4.1, 0, 2.1))
hist(df[,1],xlim=c(2,10),col="blue",border="white",main="",
xlab="PAX6 expression values")
par(mar=c( 0, 4.1, 4.1, 2.1))
plot(a,1:length(a),pch=19,cex=0.7,yaxt="n",ylab="",xlim=c(2,10),col="red",
xaxt="n",xlab="",main="Sample")
par(mar=c( 5.1, 4.1, 0, 2.1))
hist(a,xlim=c(2,10),col="red",border="white",main="",
xlab="PAX6 expression values")
par(old.par)
```
The mean of the population is calculated the same way but traditionally
Greek letter $\mu$ is used to denote the population mean. Normally, we would not
have access to the population and we will use sample mean and other quantities
derived from the sample to estimate the population properties. This is the basic
idea behind statistical inference which we will see this in action in later
sections as well. We
estimate the population parameters from the sample parameters and there is some
uncertainty associated with those estimates. We will be trying to assess those
uncertainties and make decisions in the presence of those uncertainties.
We are not yet done with measuring central tendency.
There are other ways to describe it, such as the median value.
Mean can be affected by outliers easily.
If certain values are very high or low from the
bulk of the sample this will shift mean towards those outliers. However, median
is not affected by outliers. It is simply the value in a distribution where half
of the values are above and the other half is below. In R, `median()` function
will calculate the mean of a provided vector of numbers.
Let's create a set of random numbers and calculate their mean and median using
R.
```{r}
#create 10 random numbers from uniform distribution
x=runif(10)
# calculate mean
mean(x)
# calculate median
median(x)
```
### Describing the spread: measurements of variation
Another useful way to summarize a collection of data points is to measure
how variable the values are. You can simply describe the range of the values
, such as minimum and maximum values. You can easily do that in R with `range()`
function. A more common way to calculate variation is by calculating something
called "standard deviation" or the related quantity called "variance". This is a
quantity that shows how variable the values are, a value around zero indicates
there is not much variation in the values of the data points, and a high value
indicates high variation in the values. The variance is the squared distance of
data points from the mean. Population variance is again a quantity we usually
do not have access to and is simply calculate as follows $\sigma^2=\sum_{i=1}^n \frac{(x_i-\mu)^2}{n}$, where $\mu$ is the population mean, $x_i$ is the ith
data point in the population and $n$ is the population size. However, when the
we have only access to a sample this formulation is biased. It means that it
underestimates the population variance, so we make a small adjustment when we
calculate the sample variance, denoted as $s^2$:
$$
\begin{aligned}
s^2=\sum_{i=1}^n \frac{(x_i-\overline{X})^2}{n-1} && \text{ where $x_i$ is the ith data point and
$\overline{X}$ is the sample mean.}
\end{aligned}
$$
The sample standard deviation is simply the square-root of the sample variance.
The good thing about standard deviation is that it has the same unit as the mean
so it is more intuitive.
$$s=\sqrt{\sum_{i=1}^n \frac{(x_i-\overline{X})^2}{n-1}}$$
We can calculate sample standard deviation and variation with `sd()` and `var()`
functions in R. These functions take vector of numeric values as input and
calculate the desired quantities. Below we use those functions on a randomly
generated vector of numbers.
```{r}
x=rnorm(20,mean=6,sd=0.7)
var(x)
sd(x)
```
One potential problem with the variance is that it could be affected by
outliers. The points that are too far away from the mean will have a large
affect on the variance even though there might be few of them.
A way to measure variance that could be less affected by outliers is
looking at where bulk of the distribution is. How do we define where the bulk is?
One common way is to look at the the difference between 75th percentile and 25th
percentile, this effectively removes a lot of potential outliers which will be
towards the edges of the range of values.
This is called interquartile range , and
can be easily calculated using R via `IQR()` function and the quantiles of a vector
is calculated with `quantile()` function.
Let us plot the boxplot for a random vector and also calculate IQR using R.
In the boxplot below, 25th and 75th percentiles are the edges of the box, and
the median is marked with a thick line going through roughly middle the box.
```{r}
x=rnorm(20,mean=6,sd=0.7)
IQR(x)
quantile(x)
```
```{r,eval=FALSE}
boxplot(x,horizontal = T)
```
```{r,out.width='60%',echo=FALSE,warnings=FALSE,message=FALSE,fig.cap="Boxplot showing 25th percentile and 75th percentile and median for a set of points sample from a normal distribution with mean=6 and standard deviation=0.7"}
a=quantile(x)[c(2:4)]
boxplot(x,horizontal = T)
text(a[1],1.25,"25th percentile")
text(a[3],1.25,"75th percentile")
```
#### Frequently used statistical distributions
The distributions have parameters (such as mean and variance) that
summarizes them but also they are functions that assigns each outcome of a
statistical experiment to its probability of occurrence.
One distribution that you
will frequently encounter is the normal distribution or Gaussian distribution.
The normal distribution has a typical "bell-curve" shape
and, characterized by mean and standard deviation. A set of data points
that
follow normal distribution mostly will be close to the mean
but spread around it controlled by the standard deviation parameter. That
means if we sample data points from a normal distribution we are more
likely to sample nearby the mean and sometimes away from the mean.
Probability of an event occurring is higher if it is nearby the mean.
The effect
of the parameters for normal distribution can be observed in the following
plot.
```{r,echo=FALSE,out.width='60%',fig.cap="Different parameters for normal distribution and effect of those on the shape of the distribution"}
plot(function(x) dnorm(x,0,0.5), -5,5,
main = "",col="red",lwd=2,ylab="P(x)")
curve(dnorm(x,0,1),add=TRUE,col="blue",lwd=2)
curve(dnorm(x,0,2),add=TRUE,col="green",lwd=2)
curve(dnorm(x,-2,1),add=TRUE,col="yellow",lwd=2)
legend("topright",c(expression(paste(mu,"=0, ",sigma,"=0.5")),
expression(paste(mu,"=0, ",sigma,"=1")),
expression(paste(mu,"=0, ",sigma,"=2")),
expression(paste(mu,"=-2, ",sigma,"=1"))),
col=c("red","blue","green","yellow"),lwd=3,
bty="n")
```
The normal distribution is often denoted by $\mathcal{N}(\mu,\,\sigma^2)$ When a random variable $X$ is distributed normally with mean $\mu$ and variance $\sigma^2$, we write:
$$X\ \sim\ \mathcal{N}(\mu,\,\sigma^2).$$
The probability
density function of Normal distribution with mean $\mu$ and standard deviation
$\sigma$ is as follows
$$P(x)=\frac{1}{\sigma\sqrt{2\pi} } \; e^{ -\frac{(x-\mu)^2}{2\sigma^2} } $$
The probability density function gives the probability of observing a value
on a normal distribution defined by $\mu$ and
$\sigma$ parameters.
Often times, we do not need the exact probability of a value but we need the
probability of observing a value larger or smaller than a critical value or reference
point. For example, we might want to know the probability of $X$ being smaller than or
equal to -2 for a normal distribution with mean 0 and standard deviation 2.
,$P(X <= -2 \; | \; \mu=0,\sigma=2)$. In this case, what we want is the are under the
curve shaded in blue. To be able to that we need to integrate the probability
density function but we will usually let software do that. Traditionally,
one calculates a Z-score which is simply $(X-\mu)/\sigma=(-2-0)/2= -1$, and
corresponds to how many standard deviations you are away from the mean.
This is also called "standardization", the corresponding value is distributed in "standard normal distribution" where $\mathcal{N}(0,\,1)$.
After calculating the Z-score,
we can go look up in a table, that contains the area under the curve for
the left and right side of the Z-score, but again we use software for that
tables are outdated.
Below we are showing the Z-score and the associated probabilities derived
from the calculation above for $P(X <= -2 \; | \; \mu=0,\sigma=2)$.
```{r zscore,echo=FALSE,message=FALSE,out.width='60%',fig.cap='Z-score and associated probabilities for Z= -1'}
require(mosaic)
xpnorm(c(-2), mean=0, sd=2,lower.tail = TRUE,invisible=T,verbose=FALSE)
```
In R, family of `*norm` functions (`rnorm`,`dnorm`,`qnorm` and `pnorm`) can
be used to
operate with normal distribution, such as calculating probabilities and
generating random numbers drawn from normal distribution.
```{r}
# get the value of probability density function when X= -2,
# where mean=0 and sd=2
dnorm(-2, mean=0, sd=2)
# get the probability of P(X =< -2) where mean=0 and sd=2
pnorm(-2, mean=0, sd=2)
# get the probability of P(X > -2) where mean=0 and sd=2
pnorm(-2, mean=0, sd=2,lower.tail = FALSE)
# get 5 random numbers from normal dist with mean=0 and sd=2
rnorm(5, mean=0 , sd=2)
# get y value corresponding to P(X > y) = 0.15 with mean=0 and sd=2
qnorm( 0.15, mean=0 , sd=2)
```
There are many other distribution functions in R that can be used the same
way. You have to enter the distribution specific parameters along
with your critical value, quantiles or number of random numbers depending
on which function you are using in the family.We will list some of those functions below.
- `dbinom` is for binomial distribution. This distribution is usually used
to model fractional data and binary data. Examples from genomics includes
methylation data.
- `dpois` is used for Poisson distribution and `dnbinom` is used for
negative binomial distribution. These distributions are used to model count
data such as sequencing read counts.
- `df` (F distribution) and `dchisq` (Chi-Squared distribution) are used
in relation to distribution of variation. F distribution is used to model
ratios of variation and Chi-Squared distribution is used to model
distribution of variations. You will frequently encounter these in linear models and generalized linear models.
### Precision of estimates: Confidence intervals
When we take a random sample from a population and compute a statistic, such as
the mean, we are trying to approximate the mean of the population. How well this
sample statistic estimates the population value will always be a
concern. A confidence interval addresses this concern because it provides a
range of values which is plausible to contain the population parameter of interest.
Normally, we would not have access to a population. If we did, we would not have to estimate the population parameters and its precision.
When we do not have access
to the population, one way to estimate intervals is to repeatedly take samples from the
original sample with replacement, that is we take a data point from the sample
we replace, and we take another data point until we have sample size of the
original sample. Then, we calculate the parameter of interest, in this case mean, and
repeat this step a large number of times, such as 1000. At this point, we would have a distribution of re-sampled
means, we can then calculate the 2.5th and 97.5th percentiles and these will
be our so-called 95% confidence interval. This procedure, resampling with replacement to
estimate the precision of population parameter estimates, is known as the __bootstrap__.
Let's see how we can do this in practice. We simulate a sample
coming from a normal distribution (but we pretend we don't know the
population parameters). We will try to estimate the precision
of the mean of the sample using bootstrap to build confidence intervals.
```{r,out.width='70%',fig.cap="Precision estimate of the sample mean using 1000 bootstrap samples. Confidence intervals derived from the bootstrap samples are shown with red lines."}
library(mosaic)
set.seed(21)
sample1= rnorm(50,20,5) # simulate a sample
# do bootstrap resampling, sampling with replacement
boot.means=do(1000) * mean(resample(sample1))
# get percentiles from the bootstrap means
q=quantile(boot.means[,1],p=c(0.025,0.975))
# plot the histogram
hist(boot.means[,1],col="cornflowerblue",border="white",
xlab="sample means")
abline(v=c(q[1], q[2] ),col="red")
text(x=q[1],y=200,round(q[1],3),adj=c(1,0))
text(x=q[2],y=200,round(q[2],3),adj=c(0,0))
```
If we had a convenient mathematical method to calculate confidence interval
we could also do without resampling methods. It turns out that if we take
repeated
samples from a population of with sample size $n$, the distribution of means
( $\overline{X}$) of those samples
will be approximately normal with mean $\mu$ and standard deviation
$\sigma/\sqrt{n}$. This is also known as __Central Limit Theorem(CLT)__ and
is one of the most important theorems in statistics. This also means that
$\frac{\overline{X}-\mu}{\sigma\sqrt{n}}$ has a standard normal
distribution and we can calculate the Z-score and then we can get
the percentiles associated with the Z-score. Below, we are showing the
Z-score
calculation for the distribution of $\overline{X}$, and then
we are deriving the confidence intervals starting with the fact that
probability of Z being between -1.96 and 1.96 is 0.95. We then use algebra
to show that the probability that unknown $\mu$ is captured between
$\overline{X}-1.96\sigma/\sqrt{n}$ and $\overline{X}+1.96\sigma/\sqrt{n}$ is 0.95, which is commonly known as 95% confidence interval.
$$\begin{array}{ccc}
Z=\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\\
P(-1.96 < Z < 1.96)=0.95 \\
P(-1.96 < \frac{\overline{X}-\mu}{\sigma/\sqrt{n}} < 1.96)=0.95\\
P(\mu-1.96\sigma/\sqrt{n} < \overline{X} < \mu+1.96\sigma/\sqrt{n})=0.95\\
P(\overline{X}-1.96\sigma/\sqrt{n} < \mu < \overline{X}+1.96\sigma/\sqrt{n})=0.95\\
confint=[\overline{X}-1.96\sigma/\sqrt{n},\overline{X}+1.96\sigma/\sqrt{n}]
\end{array}$$
A 95% confidence interval for population mean is the most common
common interval to use, and would
mean that we would expect 95% of the interval estimates to include the
population parameter, in this case the mean. However, we can pick any value
such as 99% or 90%. We can generalize the confidence interval for
$100(1-\alpha)$ as follows:
$$\overline{X} \pm Z_{\alpha/2}\sigma/\sqrt{n}$$
In R, we can do this using `qnorm()` function to get Z-scores associated
with ${\alpha/2}$ and ${1-\alpha/2}$. As you can see, the confidence intervals we calculated using CLT are very
similar to the ones we got from bootstrap for the same sample. For bootstrap we got $[19.21, 21.989]$ and for the CLT based estimate we got $[19.23638, 22.00819]$.
```{r}
alpha=0.05
sd=5
n=50
mean(sample1)+qnorm(c(alpha/2,1-alpha/2))*sd/sqrt(n)
```
The good thing about CLT as long as the sample size is large regardless of
the population distribution, the distribution of sample means drawn from
that population will always be normal. Here we are repeatedly
drawing samples 1000 times with sample size $n$=10,30, and 100 from a bimodal,
exponential and a uniform distribution and we are getting sample mean distributions
following normal distribution.
```{r,echo=FALSE,message=FALSE,warning=FALSE,fig.cap="Sample means are normally distributed regardless of the population distribution they are drawn from."}
set.seed(101)
#require(mosaic)
par(mfcol=c(4,3))
par(mar=c(5.1-2,4.1-1,4.1,2.1-2))
d=c(rnorm(1000,mean=10,sd=8),rnorm(1000,mean=40,sd=8))
hist(d,main="",
col="black",border="white",breaks=20,xlab="",ylab=""
)
abline(v=mean(d),col="red")
mtext(expression(paste(mu,"=24.8")),cex=0.6)
mtext("bimodal",cex=0.8,line=1)
bimod10=rowMeans(do(1000)*c(rnorm(5,mean=10,sd=8),rnorm(5,mean=40,sd=8)))
bimod30=rowMeans(do(1000)*c(rnorm(15,mean=10,sd=8),rnorm(15,mean=40,sd=8)))
bimod100=rowMeans(do(1000)*c(rnorm(50,mean=10,sd=8),rnorm(50,mean=40,sd=8)))
hist(bimod10,xlim=c(17,33),main="",xlab="",ylab="",breaks=20,col="gray",
border="gray")
mtext("n=10",side=2,cex=0.8,line=2)
hist(bimod30,xlim=c(17,33),main="",xlab="",ylab="",breaks=20,col="gray",
border="gray")
mtext("n=30",side=2,cex=0.8,line=2)
hist(bimod100,xlim=c(17,33),main="",xlab="",ylab="",breaks=20,col="gray",
border="gray")
mtext("n=100",side=2,cex=0.8,line=2)
d=rexp(1000)
hist(d,main="",
col="black",border="white",breaks=20,xlab="",ylab=""
)
abline(v=mean(d),col="red")
mtext(expression(paste(mu,"=1")),cex=0.6)
mtext("exponential",cex=0.8,line=1)
mtext("Distributions of different populations",line=2)
exp10 =rowMeans(do(2000)*rexp(10))
exp30 =rowMeans(do(2000)*rexp(30))
exp100=rowMeans(do(2000)*rexp(100))
hist(exp10,xlim=c(0,2),main="",xlab="",ylab="",breaks=20,col="gray",
border="gray")
mtext("Sampling distribution of sample means",line=2)
hist(exp30,xlim=c(0,2),main="",xlab="",ylab="",breaks=20,col="gray",
border="gray")
hist(exp100,xlim=c(0,2),main="",xlab="",ylab="",breaks=20,col="gray",
border="gray")
d=runif(1000)
hist(d,main="",
col="black",border="white",breaks=20,xlab="",ylab=""
)
abline(v=mean(d),col="red")
mtext(expression(paste(mu,"=0.5")),cex=0.6)
mtext("uniform",cex=0.8,line=1)
unif10 =rowMeans(do(1000)*runif(10))
unif30 =rowMeans(do(1000)*runif(30))
unif100=rowMeans(do(1000)*runif(100))
hist(unif10,xlim=c(0,1),main="",xlab="",ylab="",breaks=20,col="gray",
border="gray")
hist(unif30,xlim=c(0,1),main="",xlab="",ylab="",breaks=20,col="gray",
border="gray")
hist(unif100,xlim=c(0,1),main="",xlab="",ylab="",breaks=20,col="gray",
border="gray")
```
However, we should note that how we constructed the confidence interval
using standard normal distribution, $N(0,1)$, only works when the when we know the
population standard deviation. In reality, we usually have only access
to a sample and have no idea about the population standard deviation. If
this is the case we should use estimate the standard deviation using
sample standard deviation and use something called _t distribution_ instead
of standard normal distribution in our interval calculation. Our confidence interval becomes
$\overline{X} \pm t_{\alpha/2}s/\sqrt{n}$, with t distribution
parameter $d.f=n-1$, since now the following quantity is t distributed $\frac{\overline{X}-\mu}{s/\sqrt{n}}$ instead of standard normal distribution.
The t distribution is similar to standard normal distribution has mean 0 but its spread is larger than the normal distribution
especially when sample size is small, and has one parameter $v$ for
the degrees of freedom, which is $n-1$ in this case. Degrees of freedom
is simply number of data points minus number of parameters estimated. Here
we are estimating the mean from the data and the distribution is for the means, therefore degrees of freedom is $n-1$.
```{r,echo=FALSE,warning=FALSE,message=FALSE,out.width='60%',fig.cap="Normal distribution and t distribution with different degrees of freedom. With increasing degrees of freedom, t distribution approximates the normal distribution better."}
plot(function(x) dnorm(x,0,1), -4,4,
main = "",col="red",lwd=2,ylab="P(x)")
curve(dt(x,1),add=TRUE,col="orange",lwd=2)
curve(dt(x,3),add=TRUE,col="green",lwd=2)
curve(dt(x,10),add=TRUE,col="blue",lwd=2)
legend("topright",c(expression(paste("N(",mu,"=0, ",sigma,"=1)")),
expression(paste(v,"=1")),
expression(paste(v,"=3")),
expression(paste(v,"=10"))),
col=c("red","orange","green","blue"),lwd=3,
bty="n")
```
## How to test for differences between samples
Often times we would want to compare sets of samples. Such comparisons include
if wild-type samples have different expression compared to mutants or if healthy
samples are different from disease samples in some measurable feature (blood count,
gene expression, methylation of certain loci). Since there is variability in our
measurements, we need to take that into account when comparing the sets of samples.
We can simply subtract the means of two samples, but given the variability
of sampling, at the very least we need to decide a cutoff value for differences
of means, small differences of means can be explained by random chance due to
sampling. That means we need to compare the difference we get to a value that
is typical to get if the difference between two group means were only due to
sampling. If you followed the logic above, here we actually introduced two core
ideas of something called "hypothesis testing", this is simply using
statistics to
determine the probability that a given hypothesis (if two sample sets
are from the same population or not) is true. Formally, those two core
ideas are as follows:
1. Decide on a hypothesis to test, often called "null hypothesis" ($H_0$). In our
case, the hypothesis is there is no difference between sets of samples. An the "Alternative hypothesis" ($H_1$) is there is a difference between the
samples.
2. Decide on a statistic to test the truth of the null hypothesis.
3. Calculate the statistic
4. Compare it to a reference value to establish significance, the P-value. Based on that either reject or not reject the null hypothesis, $H_0$
### randomization based testing for difference of the means
There is one intuitive way to go about this. If we believe there are no
differences between samples that means the sample labels (test-control or
healthy-disease) has no meaning. So, if we randomly assign labels to the
samples
that and calculate the difference of the mean, this creates a null
distribution for the $H_0$ where we can compare the real difference and
measure how unlikely it is to get such a value under the expectation of the
null hypothesis. We can calculate all possible permutations to calculate
the null distribution. However, sometimes that is not very feasible and
equivalent approach would be generating the null distribution by taking a
smaller number of random samples with shuffled group membership.
Below, we are doing this process in R. We are first simulating two samples
from two different distributions.
These would be equivalent to gene expression measurements obtained under
different conditions. Then, we calculate the differences in the means
and do the randomization procedure to get a null distribution when we
assume there is no difference between samples, $H_0$. We then calculate how
often we would get the original difference we calculated under the
assumption that $H_0$ is true.
```{r,out.width='60%',fig.cap="The null distribution for differences of means obtained via randomization. The original difference is marked via blue line. The red line marks the value that corresponds to P-value of 0.05"}
set.seed(100)
gene1=rnorm(30,mean=4,sd=2)
gene2=rnorm(30,mean=2,sd=2)
org.diff=mean(gene1)-mean(gene2)
gene.df=data.frame(exp=c(gene1,gene2),
group=c( rep("test",30),rep("control",30) ) )
exp.null <- do(1000) * diff(mosaic::mean(exp ~ shuffle(group), data=gene.df))
hist(exp.null[,1],xlab="null distribution | no difference in samples",
main=expression(paste(H[0]," :no difference in means") ),
xlim=c(-2,2),col="cornflowerblue",border="white")
abline(v=quantile(exp.null[,1],0.95),col="red" )
abline(v=org.diff,col="blue" )
text(x=quantile(exp.null[,1],0.95),y=200,"0.05",adj=c(1,0),col="red")
text(x=org.diff,y=200,"org. diff.",adj=c(1,0),col="blue")
p.val=sum(exp.null[,1]>org.diff)/length(exp.null[,1])
p.val
```
After doing random permutations and getting a null distribution, it is possible to get a confidence interval for the distribution of difference in means.
This is simply the 2.5th and 97.5th percentiles of the null distribution, and
directly related to the P-value calculation above.
### Using t-test for difference of the means between two samples
We can also calculate the difference between means using a t-test. Sometimes we will have too few data points in a sample to do meaningful
randomization test, also randomization takes more time than doing a t-test.
This is a test that depends on the t distribution. The line of thought follows
from the CLT and we can show differences in means are t distributed.
There are couple of variants of the t-test for this purpose. If we assume
the variances are equal we can use the following version
$$t = \frac{\bar {X}_1 - \bar{X}_2}{s_{X_1X_2} \cdot \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$$
where
$$s_{X_1X_2} = \sqrt{\frac{(n_1-1)s_{X_1}^2+(n_2-1)s_{X_2}^2}{n_1+n_2-2}}$$
In the first equation above the quantity is t distributed with $n_1+n_2-2$ degrees of freedom. We can calculate the quantity then use software
to look for the percentile of that value in that t distribution, which is our P-value. When we can not assume equal variances we use "Welch's t-test"
which is the default t-test in R and also works well when variances and
the sample sizes are the same. For this test we calculate the following
quantity:
$$t = {\overline{X}_1 - \overline{X}_2 \over s_{\overline{X}_1 - \overline{X}_2}}$$
where
$$s_{\overline{X}_1 - \overline{X}_2} = \sqrt{{s_1^2 \over n_1} + {s_2^2 \over n_2}}
$$
and the degrees of freedom equals to:
$$\mathrm{d.f.} = \frac{(s_1^2/n_1 + s_2^2/n_2)^2}{(s_1^2/n_1)^2/(n_1-1) + (s_2^2/n_2)^2/(n_2-1)}
$$
Luckily, R does all those calculations for us. Below we will show the use of `t.test()` function in R. We will use it on the samples we simulated
above.
```{r,welchTtest}
# Welch's t-test
stats::t.test(gene1,gene2)
# t-test with equal varience assumption
stats::t.test(gene1,gene2,var.equal=TRUE)
```
A final word on t-tests: they generally assume population where samples coming
from have normal
distribution, however it is been shown t-test can tolerate deviations from
normality. Especially, when two distributions are moderately skewed in the
same direction. This is due to central limit theorem which says means of
samples will be distributed normally no matter the population distribution
if sample sizes are large.
### multiple testing correction
We should think of hypothesis testing as a non-error-free method of making
decisions. There will be times when we declare something significant and accept
$H_1$ but we will be wrong.
These decisions are also called "false positives" or "false discoveries", this
is also known as "type I error". Similarly, we can fail to reject a hypothesis
when we actually should. These cases are known as "false negatives", also known
as "type II error".
The ratio of true negatives to the sum of
true negatives and false positives ($\frac{TN}{FP+TN}$) is known as specificity.
And we usually want to decrease the FP and get higher specificity.
The ratio of true positives to the sum of
true positives and false negatives ($\frac{TP}{TP+FN}$) is known as sensitivity.
And, again we usually want to decrease the FN and get higher sensitivity.
Sensitivity is also known as "power of a test" in the context of hypothesis
testing. More powerful tests will be highly sensitive and will do less type
II errors. For the t-test the power is positively associated with sample size
and the effect size. Higher the sample size, smaller the standard error and
looking for the larger effect sizes will similarly increase the power.
The general summary of these the different combination of the decisions are
included in the table below.
-------------------------------------------------------------
$H_0$ is $H_1$ is
TRUE, TRUE,
[Gene is NOT [Gene is
differentially differentially
expressed] expressed]
--------------- -------------------- -------------------- -------------------------
Accept $H_0$ True Negatives (TN) False Negatives (FN) $m_0$: number of truly
(claim that ,type II error null hypotheses
the gene is not
differentially
expressed)
reject $H_0$ False Positives (FP) True Positives (TP) $m-m_0$: number of
(claim that ,type I error truly alternative
the gene is hypotheses
differentially
expressed)
-------------------------------------------------------------
We expect to make more type I errors as the number of tests increase, that
means we will reject the null hypothesis by mistake. For example, if we
perform a test the 5% significance level, there is a 5% chance of
incorrectly rejecting the null hypothesis if the null hypothesis is true.
However, if we make 1000 tests where all null hypotheses are true for
each of them, the average number of incorrect rejections is 50. And if we
apply the rules of probability, there are is almost a 100% chance that
we will have at least one incorrect rejection.
There are multiple statistical techniques to prevent this from happening.
These techniques generally shrink the P-values obtained from multiple
tests to higher values, if the individual P-value is low enough it survives
this process. The most simple method is just to multiply the individual,
P-value ($p_i$) with the number of tests ($m$): $m \cdot p_i$, this is
called "Bonferroni correction". However, this is too harsh if you have thousands
of tests. Other methods are developed to remedy this. Those methods
rely on ranking the P-values and dividing $m \cdot p_i$ by the
rank,$i$, :$\frac{m \cdot p_i }{i}$, this is derived from Benjamini–Hochberg
procedure. This procedure is developed to control for "False Discovery Rate (FDR)"
, which is proportion of false positives among all significant tests. And in
practical terms, we get the "FDR adjusted P-value" from the procedure described
above. This gives us an estimate of proportion of false discoveries for a given
test. To elaborate, p-value of 0.05 implies that 5% of all tests will be false positives. An FDR adjusted p-value of 0.05 implies that 5% of significant tests will be false positives. The FDR adjusted P-values will result in a lower number of false positives.
One final method that is also popular is called the "q-value"
method and related to the method above. This procedure relies on estimating the proportion of true null
hypotheses from the distribution of raw p-values and using that quantity
to come up with what is called a "q-value", which is also an FDR adjusted P-value . That can be practically defined
as "the proportion of significant features that turn out to be false
leads." A q-value 0.01 would mean 1% of the tests called significant at this
level will be truly null on average. Within the genomics community
q-value and FDR adjusted P-value are synonymous although they can be
calculated differently.
In R, the base function `p.adjust()` implements most of the p-value correction
methods described above. For the q-value, we can use the `qvalue` package from
Bioconductor. Below we are demonstrating how to use them on a set of simulated
p-values.The plot shows that Bonferroni correction does a terrible job. FDR(BH) and q-value
approach are better but q-value approach is more permissive than FDR(BH).
```{r,multtest,out.width='60%',fig.cap="Adjusted P-values via different methods and their relationship to raw P-values"}
library(qvalue)
data(hedenfalk)
qvalues <- qvalue(hedenfalk$p)$q
bonf.pval=p.adjust(hedenfalk$p,method ="bonferroni")
fdr.adj.pval=p.adjust(hedenfalk$p,method ="fdr")
plot(hedenfalk$p,qvalues,pch=19,ylim=c(0,1),
xlab="raw P-values",ylab="adjusted P-values")
points(hedenfalk$p,bonf.pval,pch=19,col="red")
points(hedenfalk$p,fdr.adj.pval,pch=19,col="blue")
legend("bottomright",legend=c("q-value","FDR (BH)","Bonferroni"),
fill=c("black","blue","red"))
```
### moderated t-tests: using information from multiple comparisons
In genomics, we usually do not do one test but many, as described above. That means we
may be able to use the information from the parameters obtained from all
comparisons to influence the individual parameters. For example, if you have many variances
calculated for thousands of genes across samples, you can force individual
variance estimates to shrunk towards the mean or the median of the distribution
of variances. This usually creates better performance in individual variance
estimates and therefore better performance in significance testing which
depends on variance estimates. How much the values be shrunk towards a common
value comes in many flavors. These tests in general are called moderated
t-tests or shrinkage t-tests. One approach popularized by Limma software is
to use so-called "Empirical Bayesian methods". The main formulation in these
methods is $\hat{V_g} = aV_0 + bV_g$, where $V_0$ is the background variability
and $V_g$ is the individual variability. Then, these methods estimate $a$ and $b$
in various ways to come up with shrunk version of variability, $\hat{V_g}$. In a Bayesian viewpoint,
the prior knowledge is used to calculate the variability of an individual gene. In this
case, $V_0$ would be the prior knowledge we have on variability of
the genes and we
use that knowledge to influence our estimate for the individual genes.
Below we are simulating a gene expression matrix with 1000 genes, and 3 test
and 3 control groups. Each row is a gene and in normal circumstances we would
like to find out differentially expressed genes. In this case, we are simulating
them from the same distribution so in reality we do not expect any differences.
We then use the adjusted standard error estimates in empirical Bayesian spirit but
in a very crude way. We just shrink the gene-wise standard error estimates towards the median with equal $a$ and $b$ weights. That is to say, we add individual estimate to the
median of standard error distribution from all genes and divide that quantity by 2.
So if we plug that in the to the above formula what we do is:
$$ \hat{V_g} = (V_0 + V_g)/2 $$
In the code below, we are avoiding for loops or apply family functions
by using vectorized operations.
```{r, out.width='60%',fig.width=8,fig.cap="The distributions of P-values obtained by t-tests and moderated t-tests"}
set.seed(100)
#sample data matrix from normal distribution
gset=rnorm(3000,mean=200,sd=70)
data=matrix(gset,ncol=6)
# set groups
group1=1:3
group2=4:6
n1=3
n2=3
dx=rowMeans(data[,group1])-rowMeans(data[,group2])
require(matrixStats)
# get the esimate of pooled variance
stderr <- sqrt( (rowVars(data[,group1])*(n1-1) + rowVars(data[,group2])*(n2-1)) / (n1+n2-2) * ( 1/n1 + 1/n2 ))
# do the shrinking towards median
mod.stderr <- (stderr + median(stderr)) / 2 # moderation in variation
# esimate t statistic with moderated variance
t.mod = dx / mod.stderr
# calculate P-value of rejecting null
p.mod = 2*pt( -abs(t.mod), n1+n2-2 )
# esimate t statistic without moderated variance
t = dx / stderr
# calculate P-value of rejecting null
p = 2*pt( -abs(t), n1+n2-2 )
par(mfrow=c(1,2))
hist(p,col="cornflowerblue",border="white",main="",xlab="P-values t-test")
mtext(paste("signifcant tests:",sum(p<0.05)) )
hist(p.mod,col="cornflowerblue",border="white",main="",xlab="P-values mod. t-test")
mtext(paste("signifcant tests:",sum(p.mod<0.05)) )
```
```{block2, note-text3, type='rmdtip'}
__Want to know more ?__
- basic statistical concepts
- "Cartoon guide to statistics" by Gonick & Smith
- "Introduction to statistics" by Mine Rundel, et al. (Free e-book)
- Hands-on statistics recipes with R
- "The R book" by Crawley
- moderated tests
- comparison of moderated tests for differential expression http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-17
- limma method: Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3, No. 1, Article 3. http://www.statsci.org/smyth/pubs/ebayes.pdf
```
## Relationship between variables: linear models and correlation
In genomics, we would often need to measure or model the relationship between
variables. We might want to know about expression of a particular gene in liver
in relation to the dosage of a drug that patient receives. Or, we may want to know
DNA methylation of certain locus in the genome in relation to age of the sample
donor's. Or, we might be interested in the relationship between histone
modifications and gene expression. Is there a linear relationship, the more
histone modification the more the gene is expressed ?
In these
situations and many more, linear regression or linear models can be used to
model the relationship with a "dependent" or "response" variable (expression or
methylation
in the above examples) and one or more "independent"" or "explanatory" variables (age, drug dosage or histone modification in the above examples). Our simple linear model has the
following components.
$$ Y= \beta_0+\beta_1X + \epsilon $$
In the equation above, $Y$ is the response variable and $X$ is the explanatory
variable. $\epsilon$ is the mean-zero error term. Since, the line fit will not
be able to precisely predict the $Y$ values, there will be some error associated
with each prediction when we compare it to the original $Y$ values. This error
is captured in $\epsilon$ term. We can alternatively write the model as
follows to emphasize that the model approximates $Y$, in this case notice that we removed the $\epsilon$ term: $Y \sim \beta_0+\beta_1X$
The graph below shows the relationship between
histone modification (trimethylated forms of histone H3 at lysine 4, aka H3K4me3)
and gene expression for 100 genes. The blue line is our model with estimated
coefficients ($\hat{y}=\hat{\beta}_0 + \hat{\beta}_1X$, where $\hat{\beta}_0$
and $\hat{\beta}_1$ the estimated values of $\beta_0$ and
$\beta_1$, and $\hat{y}$ indicates the prediction). The red lines indicate the individual
errors per data point, indicated as $\epsilon$ in the formula above.
```{r,echo=FALSE,warning=FALSE,message=FALSE,error=FALSE,results='hide',out.width='60%',fig.cap="Relationship between histone modification score and gene expression. Increasing histone modification, H3K4me3, seems to be associated with increasing gene expression. Each dot is a gene"}
set.seed(31)
x1 <- runif(100,10,200)
b0 <- 17
b1 <- 0.5
sigma <- 15
eps <- rnorm(100,0,sigma)
y <- b0 + b1*x1 + eps
plot(x1,y,ylim=c(0,160),xlim=c(0,220),pch=20,
ylab="Gene Expression",xlab="Histone modification score")
mod1=lm(y~x1)
abline(mod1,col="blue")
# calculate residuals and predicted values
res <- signif(residuals(mod1), 5)
pre <- predict(mod1) # plot distances between points and the regression line
segments(x1, y, x1, pre, col="red")
```
There could be more than one explanatory variable, we then simply add more $X$
and $\beta$ to our model. If there are two explanatory variables our model
will look like this:
$$ Y= \beta_0+\beta_1X_1 +\beta_2X_2 + \epsilon $$
In this case, we will be fitting a plane rather than a line. However, the fitting
process which we will describe in the later sections will not change. For our
gene expression problem. We can introduce one more histone modification, H3K27me3. We will then have a linear model with 2 explanatory variables and the
fitted plane will look like the one below. The gene expression values are shown
as dots below and above the fitted plane.
```{r,echo=FALSE,out.width='70%',warning=FALSE,message=FALSE,fig.cap="Association of Gene expression with H3K4me3 and H27Kme3 histone modifications."}
set.seed(32)
x2 <- runif(100,10,200)
b2 <- -0.3
sigma <- 15
eps <- rnorm(100,0,sigma)
y2 <- b0 + b1*x1 + b2*x2+ eps
library(plot3D)
mod1=lm(y2~x1+x2)
# predict on x-y grid, for surface
x1.pred <- seq(min(x1), max(x1), length.out = 30)
x2.pred <- seq(min(x2), max(x2), length.out = 30)
xy <- expand.grid(x1 = x1.pred,
x2= x2.pred)
y.pred <- matrix (nrow = 30, ncol = 30,
data = predict(mod1, newdata = data.frame(xy), interval = "prediction"))
# predicted z-values, fitted points for droplines to surface
fitpoints <- predict(mod1)
scatter3D(z = y2, x = x1, y = x2, pch = 19, cex = 0.4,colvar=sign(residuals(mod1)),
col = c("magenta","red"),
theta = 20, phi = 40, ticktype = "simple", bty= "g",
xlab = "H3K4me3", ylab = "H3K27me3",
zlab ="Gene exp." ,r=sqrt(5),
surf = list(x = x1.pred, y = x2.pred, z = y.pred,
facets = NA, fit = fitpoints,col="blue"),
colkey = FALSE)
```
#### Matrix notation for linear models
We can naturally have more explanatory variables than just two.The formula
below has $n$ explanatory variables.
$$Y= \beta_0+\beta_1X_1+\beta_2X_2 + \beta_3X_3 + .. + \beta_nX_n +\epsilon$$
If there are many variables, it would be easier
to write the model in matrix notation. The matrix form of linear model with
two explanatory variables will look like the one
below. First matrix would be our data matrix. This contains our explanatory
variables and a column of 1s. The second term is a column vector of $\beta$
values. We add a vector of error terms,$\epsilon$s to the matrix multiplication.
$$
\mathbf{Y} = \left[\begin{array}{rrr}
1 & X_{1,1} & X_{1,2} \\
1 & X_{2,1} & X_{2,2} \\
1 & X_{3,1} & X_{3,2} \\
1 & X_{4,1} & X_{4,2}
\end{array}\right]
%
\left[\begin{array}{rrr}
\beta_0 \\
\beta_1 \\
\beta_2
\end{array}\right]
%
+
\left[\begin{array}{rrr}
\epsilon_1 \\
\epsilon_2 \\
\epsilon_3 \\
\epsilon_0
\end{array}\right]
$$
The multiplication of data matrix and $\beta$ vector and addition of the
error terms simply results in the the following set of equations per data point:
$$
\begin{aligned}
Y_1= \beta_0+\beta_1X_{1,1}+\beta_2X_{1,2} +\epsilon_1 \\
Y_2= \beta_0+\beta_1X_{2,1}+\beta_2X_{2,2} +\epsilon_2 \\
Y_3= \beta_0+\beta_1X_{3,1}+\beta_2X_{3,2} +\epsilon_3 \\
Y_4= \beta_0+\beta_1X_{4,1}+\beta_2X_{4,2} +\epsilon_4
\end{aligned}
$$
This expression involving the multiplication of the data matrix, the
$\beta$ vector and vector of error terms ($\epsilon$)
could be simply written as follows.
$$Y=X\beta + \epsilon$$
In the equation above $Y$ is the vector of response variables and $X$ is the
data matrix and $\beta$ is the vector of coefficients.
This notation is more concise and often used in scientific papers. However, this
also means you need some understanding of linear algebra to follow the math
laid out in such resources.
### How to fit a line
At this point a major questions is left unanswered: How did we fit this line?
We basically need to define $\beta$ values in a structured way.
There are multiple ways or understanding how