This repository has been archived by the owner on Sep 18, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 369
/
block010_dplyr-end-single-table.html
808 lines (754 loc) · 41.9 KB
/
block010_dplyr-end-single-table.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<title>dplyr functions for a single dataset</title>
<script src="libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="libs/bootstrap-3.3.5/css/bootstrap.min.css" rel="stylesheet" />
<script src="libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<script src="libs/navigation-1.1/tabsets.js"></script>
<link href="libs/highlightjs-9.12.0/default.css" rel="stylesheet" />
<script src="libs/highlightjs-9.12.0/highlight.js"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-68219208-1', 'auto');
ga('send', 'pageview');
</script>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<script type="text/javascript">
if (window.hljs) {
hljs.configure({languages: []});
hljs.initHighlightingOnLoad();
if (document.readyState && document.readyState === "complete") {
window.setTimeout(function() { hljs.initHighlighting(); }, 0);
}
}
</script>
<style type="text/css">
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
</style>
<link rel="stylesheet" href="libs/local/main.css" type="text/css" />
<link rel="stylesheet" href="libs/local/nav.css" type="text/css" />
<link rel="stylesheet" href="//netdna.bootstrapcdn.com/font-awesome/4.0.3/css/font-awesome.css" type="text/css" />
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
height: auto;
}
.tabbed-pane {
padding-top: 12px;
}
.html-widget {
margin-bottom: 20px;
}
button.code-folding-btn:focus {
outline: none;
}
</style>
<div class="container-fluid main-container">
<!-- tabsets -->
<style type="text/css">
.tabset-dropdown > .nav-tabs {
display: inline-table;
max-height: 500px;
min-height: 44px;
overflow-y: auto;
background: white;
border: 1px solid #ddd;
border-radius: 4px;
}
.tabset-dropdown > .nav-tabs > li.active:before {
content: "";
font-family: 'Glyphicons Halflings';
display: inline-block;
padding: 10px;
border-right: 1px solid #ddd;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open > li.active:before {
content: "";
border: none;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open:before {
content: "";
font-family: 'Glyphicons Halflings';
display: inline-block;
padding: 10px;
border-right: 1px solid #ddd;
}
.tabset-dropdown > .nav-tabs > li.active {
display: block;
}
.tabset-dropdown > .nav-tabs > li > a,
.tabset-dropdown > .nav-tabs > li > a:focus,
.tabset-dropdown > .nav-tabs > li > a:hover {
border: none;
display: inline-block;
border-radius: 4px;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open > li {
display: block;
float: none;
}
.tabset-dropdown > .nav-tabs > li {
display: none;
}
</style>
<script>
$(document).ready(function () {
window.buildTabsets("TOC");
});
$(document).ready(function () {
$('.tabset-dropdown > .nav-tabs > li').click(function () {
$(this).parent().toggleClass('nav-tabs-open')
});
});
</script>
<!-- code folding -->
<header>
<div class="nav">
<a class="nav-logo" href="index.html">
<img src="static/img/stat545-logo-s.png" width="70px" height="70px"/>
</a>
<ul>
<li class="home"><a href="index.html">Home</a></li>
<li class="faq"><a href="faq.html">FAQ</a></li>
<li class="syllabus"><a href="syllabus.html">Syllabus</a></li>
<li class="topics"><a href="topics.html">Topics</a></li>
<li class="people"><a href="people.html">People</a></li>
</ul>
</div>
</header>
<div class="fluid-row" id="header">
<h1 class="title toc-ignore">dplyr functions for a single dataset</h1>
</div>
<div id="TOC">
<ul>
<li><a href="#where-were-we">Where were we?</a></li>
<li><a href="#load-dplyr-and-gapminder">Load dplyr and gapminder</a></li>
<li><a href="#create-a-copy-of-gapminder">Create a copy of gapminder</a></li>
<li><a href="#use-mutate-to-add-new-variables">Use <code>mutate()</code> to add new variables</a></li>
<li><a href="#use-arrange-to-row-order-data-in-a-principled-way">Use <code>arrange()</code> to row-order data in a principled way</a></li>
<li><a href="#use-rename-to-rename-variables">Use <code>rename()</code> to rename variables</a></li>
<li><a href="#select-can-rename-and-reposition-variables"><code>select()</code> can rename and reposition variables</a></li>
<li><a href="#group_by-is-a-mighty-weapon"><code>group_by()</code> is a mighty weapon</a><ul>
<li><a href="#counting-things-up">Counting things up</a></li>
<li><a href="#general-summarization">General summarization</a></li>
</ul></li>
<li><a href="#grouped-mutate">Grouped mutate</a><ul>
<li><a href="#computing-with-group-wise-summaries">Computing with group-wise summaries</a></li>
<li><a href="#window-functions">Window functions</a></li>
</ul></li>
<li><a href="#grand-finale">Grand Finale</a></li>
<li><a href="#resources">Resources</a></li>
</ul>
</div>
<div id="where-were-we" class="section level3">
<h3>Where were we?</h3>
<p>In the <a href="block009_dplyr-intro.html">introduction to dplyr</a>, we used two very important verbs and an operator:</p>
<ul>
<li><code>filter()</code> for subsetting data with row logic</li>
<li><code>select()</code> for subsetting data variable- or column-wise</li>
<li>the pipe operator <code>%>%</code>, which feeds the LHS as the first argument to the expression on the RHS</li>
</ul>
<p>We also discussed dplyr’s role inside the tidyverse and tibbles:</p>
<ul>
<li>dplyr is a core package in the <a href="https://github.com/hadley/tidyverse">tidyverse</a> meta-package. Since we often make incidental usage of the others, we will load dplyr and the others via <code>library(tidyverse)</code>.</li>
<li>The tidyverse embraces a special flavor of data frame, called a tibble. The <code>gapminder</code> dataset is stored as a tibble.</li>
</ul>
</div>
<div id="load-dplyr-and-gapminder" class="section level3">
<h3>Load dplyr and gapminder</h3>
<p>I choose to load the tidyverse, which will load dplyr, among other packages we use incidentally below. Also load gapminder.</p>
<pre class="r"><code>library(gapminder)
library(tidyverse)</code></pre>
<pre><code>## ── Attaching packages ────────────────────────────────────────── tidyverse 1.2.1 ──</code></pre>
<pre><code>## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.99.9004 ✔ dplyr 0.7.99.9000
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.2.0 ✔ forcats 0.3.0</code></pre>
<pre><code>## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()</code></pre>
</div>
<div id="create-a-copy-of-gapminder" class="section level3">
<h3>Create a copy of gapminder</h3>
<p>We’re going to make changes to the <code>gapminder</code> tibble. To eliminate any fear that you’re damaging the data that comes with the package, we create an explicit copy of <code>gapminder</code> for our experiments.</p>
<pre class="r"><code>(my_gap <- gapminder)</code></pre>
<pre><code>## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows</code></pre>
<p><strong>Pay close attention</strong> to when we evaluate statements but let the output just print to screen:</p>
<pre class="r"><code>## let output print to screen, but do not store
my_gap %>% filter(country == "Canada")</code></pre>
<p>… versus when we assign the output to an object, possibly overwriting an existing object.</p>
<pre class="r"><code>## store the output as an R object
my_precious <- my_gap %>% filter(country == "Canada")</code></pre>
</div>
<div id="use-mutate-to-add-new-variables" class="section level3">
<h3>Use <code>mutate()</code> to add new variables</h3>
<p>Imagine we wanted to recover each country’s GDP. After all, the Gapminder data has a variable for population and GDP per capita. Let’s multiply them together.</p>
<p><code>mutate()</code> is a function that defines and inserts new variables into a tibble. You can refer to existing variables by name.</p>
<pre class="r"><code>my_gap %>%
mutate(gdp = pop * gdpPercap)</code></pre>
<pre><code>## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
## # … with 1,694 more rows</code></pre>
<p>Hmmmm … those GDP numbers are almost uselessly large and abstract. Consider the <a href="http://fivethirtyeight.com/datalab/xkcd-randall-munroe-qanda-what-if/">advice of Randall Munroe of xkcd</a>:</p>
<blockquote>
<p>One thing that bothers me is large numbers presented without context… ‘If I added a zero to this number, would the sentence containing it mean something different to me?’ If the answer is ‘no,’ maybe the number has no business being in the sentence in the first place."</p>
</blockquote>
<p>Maybe it would be more meaningful to consumers of my tables and figures to stick with GDP per capita. But what if I reported GDP per capita, <em>relative to some benchmark country</em>. Since Canada is my adopted home, I’ll go with that.</p>
<p>I need to create a new variable that is <code>gdpPercap</code> divided by Canadian <code>gdpPercap</code>, taking care that I always divide two numbers that pertain to the same year.</p>
<p>How I achieve:</p>
<ul>
<li>Filter down to the rows for Canada.</li>
<li>Create a new temporary variable in <code>my_gap</code>:
<ul>
<li>Extract the <code>gdpPercap</code> variable from the Canadian data.</li>
<li>Replicate it once per country in the dataset, so it has the right length.</li>
</ul></li>
<li>Divide raw <code>gdpPercap</code> by this Canadian figure.</li>
<li>Discard the temporary variable of replicated Canadian <code>gdpPercap</code>.</li>
</ul>
<pre class="r"><code>ctib <- my_gap %>%
filter(country == "Canada")
## this is a semi-dangerous way to add this variable
## I'd prefer to join on year, but we haven't covered joins yet
my_gap <- my_gap %>%
mutate(tmp = rep(ctib$gdpPercap, nlevels(country)),
gdpPercapRel = gdpPercap / tmp,
tmp = NULL)</code></pre>
<p>Note that, <code>mutate()</code> builds new variables sequentially so you can reference earlier ones (like <code>tmp</code>) when defining later ones (like <code>gdpPercapRel</code>). Also, you can get rid of a variable by setting it to <code>NULL</code>.</p>
<p>How could we sanity check that this worked? The Canadian values for <code>gdpPercapRel</code> better all be 1!</p>
<pre class="r"><code>my_gap %>%
filter(country == "Canada") %>%
select(country, year, gdpPercapRel)</code></pre>
<pre><code>## # A tibble: 12 x 3
## country year gdpPercapRel
## <fct> <int> <dbl>
## 1 Canada 1952 1
## 2 Canada 1957 1
## 3 Canada 1962 1
## 4 Canada 1967 1
## 5 Canada 1972 1
## 6 Canada 1977 1
## 7 Canada 1982 1
## 8 Canada 1987 1
## 9 Canada 1992 1
## 10 Canada 1997 1
## 11 Canada 2002 1
## 12 Canada 2007 1</code></pre>
<p>I perceive Canada to be a “high GDP” country, so I predict that the distribution of <code>gdpPercapRel</code> is located below 1, possibly even well below. Check your intuition!</p>
<pre class="r"><code>summary(my_gap$gdpPercapRel)</code></pre>
<pre><code>## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.007236 0.061648 0.171521 0.326659 0.446564 9.534690</code></pre>
<p>The relative GDP per capita numbers are, in general, well below 1. We see that most of the countries covered by this dataset have substantially lower GDP per capita, relative to Canada, across the entire time period.</p>
<p>Remember: Trust No One. Including (especially?) yourself. Always try to find a way to check that you’ve done what meant to. Prepare to be horrified.</p>
</div>
<div id="use-arrange-to-row-order-data-in-a-principled-way" class="section level3">
<h3>Use <code>arrange()</code> to row-order data in a principled way</h3>
<p><code>arrange()</code> reorders the rows in a data frame. Imagine you wanted this data ordered by year then country, as opposed to by country then year.</p>
<pre class="r"><code>my_gap %>%
arrange(year, country)</code></pre>
<pre><code>## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap gdpPercapRel
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 0.0686
## 2 Albania Europe 1952 55.2 1282697 1601. 0.141
## 3 Algeria Africa 1952 43.1 9279525 2449. 0.215
## 4 Angola Africa 1952 30.0 4232095 3521. 0.310
## 5 Argentina Americas 1952 62.5 17876956 5911. 0.520
## 6 Australia Oceania 1952 69.1 8691212 10040. 0.883
## 7 Austria Europe 1952 66.8 6927772 6137. 0.540
## 8 Bahrain Asia 1952 50.9 120447 9867. 0.868
## 9 Bangladesh Asia 1952 37.5 46886859 684. 0.0602
## 10 Belgium Europe 1952 68 8730405 8343. 0.734
## # … with 1,694 more rows</code></pre>
<p>Or maybe you want just the data from 2007, sorted on life expectancy?</p>
<pre class="r"><code>my_gap %>%
filter(year == 2007) %>%
arrange(lifeExp)</code></pre>
<pre><code>## # A tibble: 142 x 7
## country continent year lifeExp pop gdpPercap gdpPercapRel
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Swaziland Africa 2007 39.6 1.13e6 4513. 0.124
## 2 Mozambique Africa 2007 42.1 2.00e7 824. 0.0227
## 3 Zambia Africa 2007 42.4 1.17e7 1271. 0.0350
## 4 Sierra Leone Africa 2007 42.6 6.14e6 863. 0.0237
## 5 Lesotho Africa 2007 42.6 2.01e6 1569. 0.0432
## 6 Angola Africa 2007 42.7 1.24e7 4797. 0.132
## 7 Zimbabwe Africa 2007 43.5 1.23e7 470. 0.0129
## 8 Afghanistan Asia 2007 43.8 3.19e7 975. 0.0268
## 9 Central African… Africa 2007 44.7 4.37e6 706. 0.0194
## 10 Liberia Africa 2007 45.7 3.19e6 415. 0.0114
## # … with 132 more rows</code></pre>
<p>Oh, you’d like to sort on life expectancy in <strong>desc</strong>ending order? Then use <code>desc()</code>.</p>
<pre class="r"><code>my_gap %>%
filter(year == 2007) %>%
arrange(desc(lifeExp))</code></pre>
<pre><code>## # A tibble: 142 x 7
## country continent year lifeExp pop gdpPercap gdpPercapRel
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Japan Asia 2007 82.6 1.27e8 31656. 0.872
## 2 Hong Kong, Chi… Asia 2007 82.2 6.98e6 39725. 1.09
## 3 Iceland Europe 2007 81.8 3.02e5 36181. 0.996
## 4 Switzerland Europe 2007 81.7 7.55e6 37506. 1.03
## 5 Australia Oceania 2007 81.2 2.04e7 34435. 0.948
## 6 Spain Europe 2007 80.9 4.04e7 28821. 0.794
## 7 Sweden Europe 2007 80.9 9.03e6 33860. 0.932
## 8 Israel Asia 2007 80.7 6.43e6 25523. 0.703
## 9 France Europe 2007 80.7 6.11e7 30470. 0.839
## 10 Canada Americas 2007 80.7 3.34e7 36319. 1
## # … with 132 more rows</code></pre>
<p>I advise that your analyses NEVER rely on rows or variables being in a specific order. But it’s still true that human beings write the code and the interactive development process can be much nicer if you reorder the rows of your data as you go along. Also, once you are preparing tables for human eyeballs, it is imperative that you step up and take control of row order.</p>
</div>
<div id="use-rename-to-rename-variables" class="section level3">
<h3>Use <code>rename()</code> to rename variables</h3>
<p>When I first cleaned this Gapminder excerpt, I was a <a href="http://en.wikipedia.org/wiki/CamelCase"><code>camelCase</code></a> person, but now I’m all about <a href="http://en.wikipedia.org/wiki/Snake_case"><code>snake_case</code></a>. So I am vexed by the variable names I chose when I cleaned this data years ago. Let’s rename some variables!</p>
<pre class="r"><code>my_gap %>%
rename(life_exp = lifeExp,
gdp_percap = gdpPercap,
gdp_percap_rel = gdpPercapRel)</code></pre>
<pre><code>## # A tibble: 1,704 x 7
## country continent year life_exp pop gdp_percap gdp_percap_rel
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 0.0686
## 2 Afghanistan Asia 1957 30.3 9240934 821. 0.0657
## 3 Afghanistan Asia 1962 32.0 10267083 853. 0.0634
## 4 Afghanistan Asia 1967 34.0 11537966 836. 0.0520
## 5 Afghanistan Asia 1972 36.1 13079460 740. 0.0390
## 6 Afghanistan Asia 1977 38.4 14880372 786. 0.0356
## 7 Afghanistan Asia 1982 39.9 12881816 978. 0.0427
## 8 Afghanistan Asia 1987 40.8 13867957 852. 0.0320
## 9 Afghanistan Asia 1992 41.7 16317921 649. 0.0246
## 10 Afghanistan Asia 1997 41.8 22227415 635. 0.0219
## # … with 1,694 more rows</code></pre>
<p>I did NOT assign the post-rename object back to <code>my_gap</code> because that would make the chunks in this tutorial harder to copy/paste and run out of order. In real life, I would probably assign this back to <code>my_gap</code>, in a data preparation script, and proceed with the new variable names.</p>
</div>
<div id="select-can-rename-and-reposition-variables" class="section level3">
<h3><code>select()</code> can rename and reposition variables</h3>
<p>You’ve seen simple use of <code>select()</code>. There are two tricks you might enjoy:</p>
<ol style="list-style-type: decimal">
<li><code>select()</code> can rename the variables you request to keep.</li>
<li><code>select()</code> can be used with <code>everything()</code> to hoist a variable up to the front of the tibble.</li>
</ol>
<pre class="r"><code>my_gap %>%
filter(country == "Burundi", year > 1996) %>%
select(yr = year, lifeExp, gdpPercap) %>%
select(gdpPercap, everything())</code></pre>
<pre><code>## # A tibble: 3 x 3
## gdpPercap yr lifeExp
## <dbl> <int> <dbl>
## 1 463. 1997 45.3
## 2 446. 2002 47.4
## 3 430. 2007 49.6</code></pre>
<p><code>everything()</code> is one of several helpers for variable selection. Read its help to see the rest.</p>
</div>
<div id="group_by-is-a-mighty-weapon" class="section level3">
<h3><code>group_by()</code> is a mighty weapon</h3>
<p>I have found <del>friends and family</del> collaborators love to ask seemingly innocuous questions like, “which country experienced the sharpest 5-year drop in life expectancy?”. In fact, that is a totally natural question to ask. But if you are using a language that doesn’t know about data, it’s an incredibly annoying question to answer.</p>
<p>dplyr offers powerful tools to solve this class of problem.</p>
<ul>
<li><code>group_by()</code> adds extra structure to your dataset – grouping information – which lays the groundwork for computations within the groups.</li>
<li><code>summarize()</code> takes a dataset with <span class="math inline">\(n\)</span> observations, computes requested summaries, and returns a dataset with 1 observation.</li>
<li>Window functions take a dataset with <span class="math inline">\(n\)</span> observations and return a dataset with <span class="math inline">\(n\)</span> observations.</li>
<li><code>mutate()</code> and <code>summarize()</code> will honor groups.</li>
<li>You can also do very general computations on your groups with <code>do()</code>, though elsewhere in this course, I advocate for other approaches that I find more intuitive, using the <code>purrr</code> package.</li>
</ul>
<p>Combined with the verbs you already know, these new tools allow you to solve an extremely diverse set of problems with relative ease.</p>
<div id="counting-things-up" class="section level4">
<h4>Counting things up</h4>
<p>Let’s start with simple counting. How many observations do we have per continent?</p>
<pre class="r"><code>my_gap %>%
group_by(continent) %>%
summarize(n = n())</code></pre>
<pre><code>## # A tibble: 5 x 2
## continent n
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24</code></pre>
<p>Let us pause here to think about the tidyverse. You could get these same frequencies using <code>table()</code> from base R.</p>
<pre class="r"><code>table(gapminder$continent)</code></pre>
<pre><code>##
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24</code></pre>
<pre class="r"><code>str(table(gapminder$continent))</code></pre>
<pre><code>## 'table' int [1:5(1d)] 624 300 396 360 24
## - attr(*, "dimnames")=List of 1
## ..$ : chr [1:5] "Africa" "Americas" "Asia" "Europe" ...</code></pre>
<p>But the object of class <code>table</code> that is returned makes downstream computation a bit fiddlier than you’d like. For example, it’s too bad the continent levels come back only as <em>names</em> and not as a proper factor, with the original set of levels. This is an example of how the tidyverse smooths transitions where you want the output of step i to become the input of step i + 1.</p>
<p>The <code>tally()</code> function is a convenience function that knows to count rows. It honors groups.</p>
<pre class="r"><code>my_gap %>%
group_by(continent) %>%
tally()</code></pre>
<pre><code>## # A tibble: 5 x 2
## continent n
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24</code></pre>
<p>The <code>count()</code> function is an even more convenient function that does both grouping and counting.</p>
<pre class="r"><code>my_gap %>%
count(continent)</code></pre>
<pre><code>## # A tibble: 5 x 2
## continent n
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24</code></pre>
<p>What if we wanted to add the number of unique countries for each continent? You can compute multiple summaries inside <code>summarize()</code>. Use the <code>n_distinct()</code> function to count the number of distinct countries within each continent.</p>
<pre class="r"><code>my_gap %>%
group_by(continent) %>%
summarize(n = n(),
n_countries = n_distinct(country))</code></pre>
<pre><code>## # A tibble: 5 x 3
## continent n n_countries
## <fct> <int> <int>
## 1 Africa 624 52
## 2 Americas 300 25
## 3 Asia 396 33
## 4 Europe 360 30
## 5 Oceania 24 2</code></pre>
</div>
<div id="general-summarization" class="section level4">
<h4>General summarization</h4>
<p>The functions you’ll apply within <code>summarize()</code> include classical statistical summaries, like <code>mean()</code>, <code>median()</code>, <code>var()</code>, <code>sd()</code>, <code>mad()</code>, <code>IQR()</code>, <code>min()</code>, and <code>max()</code>. Remember they are functions that take <span class="math inline">\(n\)</span> inputs and distill them down into 1 output.</p>
<p>Although this may be statistically ill-advised, let’s compute the average life expectancy by continent.</p>
<pre class="r"><code>my_gap %>%
group_by(continent) %>%
summarize(avg_lifeExp = mean(lifeExp))</code></pre>
<pre><code>## # A tibble: 5 x 2
## continent avg_lifeExp
## <fct> <dbl>
## 1 Africa 48.9
## 2 Americas 64.7
## 3 Asia 60.1
## 4 Europe 71.9
## 5 Oceania 74.3</code></pre>
<p><code>summarize_at()</code> applies the same summary function(s) to multiple variables. Let’s compute average and median life expectancy and GDP per capita by continent by year … but only for 1952 and 2007.</p>
<pre class="r"><code>my_gap %>%
filter(year %in% c(1952, 2007)) %>%
group_by(continent, year) %>%
summarize_at(vars(lifeExp, gdpPercap), funs(mean, median))</code></pre>
<pre><code>## # A tibble: 10 x 6
## # Groups: continent [5]
## continent year lifeExp_mean gdpPercap_mean lifeExp_median
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Africa 1952 39.1 1253. 38.8
## 2 Africa 2007 54.8 3089. 52.9
## 3 Americas 1952 53.3 4079. 54.7
## 4 Americas 2007 73.6 11003. 72.9
## 5 Asia 1952 46.3 5195. 44.9
## 6 Asia 2007 70.7 12473. 72.4
## 7 Europe 1952 64.4 5661. 65.9
## 8 Europe 2007 77.6 25054. 78.6
## 9 Oceania 1952 69.3 10298. 69.3
## 10 Oceania 2007 80.7 29810. 80.7
## # … with 1 more variable: gdpPercap_median <dbl></code></pre>
<p>Let’s focus just on Asia. What are the minimum and maximum life expectancies seen by year?</p>
<pre class="r"><code>my_gap %>%
filter(continent == "Asia") %>%
group_by(year) %>%
summarize(min_lifeExp = min(lifeExp), max_lifeExp = max(lifeExp))</code></pre>
<pre><code>## # A tibble: 12 x 3
## year min_lifeExp max_lifeExp
## <int> <dbl> <dbl>
## 1 1952 28.8 65.4
## 2 1957 30.3 67.8
## 3 1962 32.0 69.4
## 4 1967 34.0 71.4
## 5 1972 36.1 73.4
## 6 1977 31.2 75.4
## 7 1982 39.9 77.1
## 8 1987 40.8 78.7
## 9 1992 41.7 79.4
## 10 1997 41.8 80.7
## 11 2002 42.1 82
## 12 2007 43.8 82.6</code></pre>
<p>Of course it would be much more interesting to see <em>which</em> country contributed these extreme observations. Is the minimum (maximum) always coming from the same country? We tackle that with window functions shortly.</p>
</div>
</div>
<div id="grouped-mutate" class="section level3">
<h3>Grouped mutate</h3>
<p>Sometimes you don’t want to collapse the <span class="math inline">\(n\)</span> rows for each group into one row. You want to keep your groups, but compute within them.</p>
<div id="computing-with-group-wise-summaries" class="section level4">
<h4>Computing with group-wise summaries</h4>
<p>Let’s make a new variable that is the years of life expectancy gained (lost) relative to 1952, for each individual country. We group by country and use <code>mutate()</code> to make a new variable. The <code>first()</code> function extracts the first value from a vector. Notice that <code>first()</code> is operating on the vector of life expectancies <em>within each country group</em>.</p>
<pre class="r"><code>my_gap %>%
group_by(country) %>%
select(country, year, lifeExp) %>%
mutate(lifeExp_gain = lifeExp - first(lifeExp)) %>%
filter(year < 1963)</code></pre>
<pre><code>## # A tibble: 426 x 4
## # Groups: country [142]
## country year lifeExp lifeExp_gain
## <fct> <int> <dbl> <dbl>
## 1 Afghanistan 1952 28.8 0
## 2 Afghanistan 1957 30.3 1.53
## 3 Afghanistan 1962 32.0 3.20
## 4 Albania 1952 55.2 0
## 5 Albania 1957 59.3 4.05
## 6 Albania 1962 64.8 9.59
## 7 Algeria 1952 43.1 0
## 8 Algeria 1957 45.7 2.61
## 9 Algeria 1962 48.3 5.23
## 10 Angola 1952 30.0 0
## # … with 416 more rows</code></pre>
<p>Within country, we take the difference between life expectancy in year <span class="math inline">\(i\)</span> and life expectancy in 1952. Therefore we always see zeroes for 1952 and, for most countries, a sequence of positive and increasing numbers.</p>
</div>
<div id="window-functions" class="section level4">
<h4>Window functions</h4>
<p>Window functions take <span class="math inline">\(n\)</span> inputs and give back <span class="math inline">\(n\)</span> outputs. Furthermore, the output depends on all the values. So <code>rank()</code> is a window function but <code>log()</code> is not. Here we use window functions based on ranks and offsets.</p>
<p>Let’s revisit the worst and best life expectancies in Asia over time, but retaining info about <em>which</em> country contributes these extreme values.</p>
<pre class="r"><code>my_gap %>%
filter(continent == "Asia") %>%
select(year, country, lifeExp) %>%
group_by(year) %>%
filter(min_rank(desc(lifeExp)) < 2 | min_rank(lifeExp) < 2) %>%
arrange(year) %>%
print(n = Inf)</code></pre>
<pre><code>## # A tibble: 24 x 3
## # Groups: year [12]
## year country lifeExp
## <int> <fct> <dbl>
## 1 1952 Afghanistan 28.8
## 2 1952 Israel 65.4
## 3 1957 Afghanistan 30.3
## 4 1957 Israel 67.8
## 5 1962 Afghanistan 32.0
## 6 1962 Israel 69.4
## 7 1967 Afghanistan 34.0
## 8 1967 Japan 71.4
## 9 1972 Afghanistan 36.1
## 10 1972 Japan 73.4
## 11 1977 Cambodia 31.2
## 12 1977 Japan 75.4
## 13 1982 Afghanistan 39.9
## 14 1982 Japan 77.1
## 15 1987 Afghanistan 40.8
## 16 1987 Japan 78.7
## 17 1992 Afghanistan 41.7
## 18 1992 Japan 79.4
## 19 1997 Afghanistan 41.8
## 20 1997 Japan 80.7
## 21 2002 Afghanistan 42.1
## 22 2002 Japan 82
## 23 2007 Afghanistan 43.8
## 24 2007 Japan 82.6</code></pre>
<p>We see that (min = Afghanistan, max = Japan) is the most frequent result, but Cambodia and Israel pop up at least once each as the min or max, respectively. That table should make you impatient for our upcoming work on tidying and reshaping data! Wouldn’t it be nice to have one row per year?</p>
<p>How did that actually work? First, I store and view a partial that leaves off the <code>filter()</code> statement. All of these operations should be familiar.</p>
<pre class="r"><code>asia <- my_gap %>%
filter(continent == "Asia") %>%
select(year, country, lifeExp) %>%
group_by(year)
asia</code></pre>
<pre><code>## # A tibble: 396 x 3
## # Groups: year [12]
## year country lifeExp
## <int> <fct> <dbl>
## 1 1952 Afghanistan 28.8
## 2 1957 Afghanistan 30.3
## 3 1962 Afghanistan 32.0
## 4 1967 Afghanistan 34.0
## 5 1972 Afghanistan 36.1
## 6 1977 Afghanistan 38.4
## 7 1982 Afghanistan 39.9
## 8 1987 Afghanistan 40.8
## 9 1992 Afghanistan 41.7
## 10 1997 Afghanistan 41.8
## # … with 386 more rows</code></pre>
<p>Now we apply a window function – <code>min_rank()</code>. Since <code>asia</code> is grouped by year, <code>min_rank()</code> operates within mini-datasets, each for a specific year. Applied to the variable <code>lifeExp</code>, <code>min_rank()</code> returns the rank of each country’s observed life expectancy. FYI, the <code>min</code> part just specifies how ties are broken. Here is an explicit peek at these within-year life expectancy ranks, in both the (default) ascending and descending order.</p>
<p>For concreteness, I use <code>mutate()</code> to actually create these variables, even though I dropped this in the solution above. Let’s look at a bit of that.</p>
<pre class="r"><code>asia %>%
mutate(le_rank = min_rank(lifeExp),
le_desc_rank = min_rank(desc(lifeExp))) %>%
filter(country %in% c("Afghanistan", "Japan", "Thailand"), year > 1995)</code></pre>
<pre><code>## # A tibble: 9 x 5
## # Groups: year [12]
## year country lifeExp le_rank le_desc_rank
## <int> <fct> <dbl> <int> <int>
## 1 1997 Afghanistan 41.8 1 33
## 2 1997 Japan 80.7 33 1
## 3 1997 Thailand 67.5 12 22
## 4 2002 Afghanistan 42.1 1 33
## 5 2002 Japan 82 33 1
## 6 2002 Thailand 68.6 12 22
## 7 2007 Afghanistan 43.8 1 33
## 8 2007 Japan 82.6 33 1
## 9 2007 Thailand 70.6 12 22</code></pre>
<p>Afghanistan tends to present 1’s in the <code>le_rank</code> variable, Japan tends to present 1’s in the <code>le_desc_rank</code> variable and other countries, like Thailand, present less extreme ranks.</p>
<p>You can understand the original <code>filter()</code> statement now:</p>
<pre class="r"><code>filter(min_rank(desc(lifeExp)) < 2 | min_rank(lifeExp) < 2)</code></pre>
<p>These two sets of ranks are formed on-the-fly, within year group, and <code>filter()</code> retains rows with rank less than 2, which means … the row with rank = 1. Since we do for ascending and descending ranks, we get both the min and the max.</p>
<p>If we had wanted just the min OR the max, an alternative approach using <code>top_n()</code> would have worked.</p>
<pre class="r"><code>my_gap %>%
filter(continent == "Asia") %>%
select(year, country, lifeExp) %>%
arrange(year) %>%
group_by(year) %>%
#top_n(1, wt = lifeExp) ## gets the min
top_n(1, wt = desc(lifeExp)) ## gets the max</code></pre>
<pre><code>## # A tibble: 12 x 3
## # Groups: year [12]
## year country lifeExp
## <int> <fct> <dbl>
## 1 1952 Afghanistan 28.8
## 2 1957 Afghanistan 30.3
## 3 1962 Afghanistan 32.0
## 4 1967 Afghanistan 34.0
## 5 1972 Afghanistan 36.1
## 6 1977 Cambodia 31.2
## 7 1982 Afghanistan 39.9
## 8 1987 Afghanistan 40.8
## 9 1992 Afghanistan 41.7
## 10 1997 Afghanistan 41.8
## 11 2002 Afghanistan 42.1
## 12 2007 Afghanistan 43.8</code></pre>
</div>
</div>
<div id="grand-finale" class="section level3">
<h3>Grand Finale</h3>
<p>So let’s answer that “simple” question: which country experienced the sharpest 5-year drop in life expectancy? Recall that this excerpt of the Gapminder data only has data every five years, e.g. for 1952, 1957, etc. So this really means looking at life expectancy changes between adjacent timepoints.</p>
<p>At this point, that’s just too easy, so let’s do it by continent while we’re at it.</p>
<pre class="r"><code>my_gap %>%
select(country, year, continent, lifeExp) %>%
group_by(continent, country) %>%
## within country, take (lifeExp in year i) - (lifeExp in year i - 1)
## positive means lifeExp went up, negative means it went down
mutate(le_delta = lifeExp - lag(lifeExp)) %>%
## within country, retain the worst lifeExp change = smallest or most negative
summarize(worst_le_delta = min(le_delta, na.rm = TRUE)) %>%
## within continent, retain the row with the lowest worst_le_delta
top_n(-1, wt = worst_le_delta) %>%
arrange(worst_le_delta)</code></pre>
<pre><code>## # A tibble: 5 x 3
## # Groups: continent [5]
## continent country worst_le_delta
## <fct> <fct> <dbl>
## 1 Africa Rwanda -20.4
## 2 Asia Cambodia -9.10
## 3 Americas El Salvador -1.51
## 4 Europe Montenegro -1.46
## 5 Oceania Australia 0.170</code></pre>
<p>Ponder that for a while. The subject matter and the code. Mostly you’re seeing what genocide looks like in dry statistics on average life expectancy.</p>
<p>Break the code into pieces, starting at the top, and inspect the intermediate results. That’s certainly how I was able to <em>write</em> such a thing. These commands do not <a href="http://tinyurl.com/athenaforehead">leap fully formed out of anyone’s forehead</a> – they are built up gradually, with lots of errors and refinements along the way. I’m not even sure it’s a great idea to do so much manipulation in one fell swoop. Is the statement above really hard for you to read? If yes, then by all means break it into pieces and make some intermediate objects. Your code should be easy to write and read when you’re done.</p>
<p>In later tutorials, we’ll explore more of dplyr, such as operations based on two datasets.</p>
</div>
<div id="resources" class="section level3">
<h3>Resources</h3>
<p><code>dplyr</code> official stuff</p>
<ul>
<li>package home <a href="http://cran.r-project.org/web/packages/dplyr/index.html">on CRAN</a>
<ul>
<li>note there are several vignettes, with the <a href="http://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html">introduction</a> being the most relevant right now</li>
<li>the <a href="http://cran.rstudio.com/web/packages/dplyr/vignettes/window-functions.html">one on window functions</a> will also be interesting to you now</li>
</ul></li>
<li>development home <a href="https://github.com/hadley/dplyr">on GitHub</a></li>
<li><a href="https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a">tutorial HW delivered</a> (note this links to a DropBox folder) at useR! 2014 conference</li>
</ul>
<p><a href="https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf">RStudio Data Wrangling cheatsheet</a>, covering <code>dplyr</code> and <code>tidyr</code>. Remember you can get to these via <em>Help > Cheatsheets.</em></p>
<p><a href="http://r4ds.had.co.nz/transform.html">Data transformation</a> chapter of <a href="http://r4ds.had.co.nz">R for Data Science</a></p>
<p><a href="https://github.com/tjmahr/MadR_Pipelines">Excellent slides</a> on pipelines and <code>dplyr</code> by TJ Mahr, talk given to the Madison R Users Group.</p>
<p>Blog post <a href="http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/">Hands-on dplyr tutorial for faster data manipulation in R</a> by Data School, that includes a link to an R Markdown document and links to videos</p>
<p><a href="bit001_dplyr-cheatsheet.html">Cheatsheet</a> I made for <code>dplyr</code> join functions (not relevant yet but soon)</p>
</div>
<div class="footer">
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">CC BY-NC 3.0 Creative Commons License</a>.
</div>
</div>
<script>
// add bootstrap table styles to pandoc tables
function bootstrapStylePandocTables() {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
}
$(document).ready(function () {
bootstrapStylePandocTables();
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>