block010_dplyr-end-single-table.html

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />


<title>dplyr functions for a single dataset</title>

<script src="libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="libs/bootstrap-3.3.5/css/bootstrap.min.css" rel="stylesheet" />
<script src="libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<script src="libs/navigation-1.1/tabsets.js"></script>
<link href="libs/highlightjs-9.12.0/default.css" rel="stylesheet" />
<script src="libs/highlightjs-9.12.0/highlight.js"></script>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-68219208-1', 'auto');
  ga('send', 'pageview');

</script>

<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
  pre:not([class]) {
    background-color: white;
  }
</style>
<script type="text/javascript">
if (window.hljs) {
  hljs.configure({languages: []});
  hljs.initHighlightingOnLoad();
  if (document.readyState && document.readyState === "complete") {
    window.setTimeout(function() { hljs.initHighlighting(); }, 0);
  }
}
</script>


<style type="text/css">
h1 {
  font-size: 34px;
}
h1.title {
  font-size: 38px;
}
h2 {
  font-size: 30px;
}
h3 {
  font-size: 24px;
}
h4 {
  font-size: 18px;
}
h5 {
  font-size: 16px;
}
h6 {
  font-size: 12px;
}
.table th:not([align]) {
  text-align: left;
}
</style>

<link rel="stylesheet" href="libs/local/main.css" type="text/css" />
<link rel="stylesheet" href="libs/local/nav.css" type="text/css" />
<link rel="stylesheet" href="//netdna.bootstrapcdn.com/font-awesome/4.0.3/css/font-awesome.css" type="text/css" />

</head>

<body>

<style type = "text/css">
.main-container {
  max-width: 940px;
  margin-left: auto;
  margin-right: auto;
}
code {
  color: inherit;
  background-color: rgba(0, 0, 0, 0.04);
}
img {
  max-width:100%;
  height: auto;
}
.tabbed-pane {
  padding-top: 12px;
}
.html-widget {
  margin-bottom: 20px;
}
button.code-folding-btn:focus {
  outline: none;
}
</style>


<div class="container-fluid main-container">

<!-- tabsets -->

<style type="text/css">
.tabset-dropdown > .nav-tabs {
  display: inline-table;
  max-height: 500px;
  min-height: 44px;
  overflow-y: auto;
  background: white;
  border: 1px solid #ddd;
  border-radius: 4px;
}

.tabset-dropdown > .nav-tabs > li.active:before {
  content: "";
  font-family: 'Glyphicons Halflings';
  display: inline-block;
  padding: 10px;
  border-right: 1px solid #ddd;
}

.tabset-dropdown > .nav-tabs.nav-tabs-open > li.active:before {
  content: "&#xe258;";
  border: none;
}

.tabset-dropdown > .nav-tabs.nav-tabs-open:before {
  content: "";
  font-family: 'Glyphicons Halflings';
  display: inline-block;
  padding: 10px;
  border-right: 1px solid #ddd;
}

.tabset-dropdown > .nav-tabs > li.active {
  display: block;
}

.tabset-dropdown > .nav-tabs > li > a,
.tabset-dropdown > .nav-tabs > li > a:focus,
.tabset-dropdown > .nav-tabs > li > a:hover {
  border: none;
  display: inline-block;
  border-radius: 4px;
}

.tabset-dropdown > .nav-tabs.nav-tabs-open > li {
  display: block;
  float: none;
}

.tabset-dropdown > .nav-tabs > li {
  display: none;
}
</style>

<script>
$(document).ready(function () {
  window.buildTabsets("TOC");
});

$(document).ready(function () {
  $('.tabset-dropdown > .nav-tabs > li').click(function () {
    $(this).parent().toggleClass('nav-tabs-open')
  });
});
</script>

<!-- code folding -->


<header>
  <div class="nav">
    <a class="nav-logo" href="index.html">
      <img src="static/img/stat545-logo-s.png" width="70px" height="70px"/>
    </a>
    <ul>
      <li class="home"><a href="index.html">Home</a></li>
      <li class="faq"><a href="faq.html">FAQ</a></li>
      <li class="syllabus"><a href="syllabus.html">Syllabus</a></li>
      <li class="topics"><a href="topics.html">Topics</a></li>
      <li class="people"><a href="people.html">People</a></li>
    </ul>
  </div>
</header>

<div class="fluid-row" id="header">


<h1 class="title toc-ignore">dplyr functions for a single dataset</h1>

</div>

<div id="TOC">
<ul>
<li><a href="#where-were-we">Where were we?</a></li>
<li><a href="#load-dplyr-and-gapminder">Load dplyr and gapminder</a></li>
<li><a href="#create-a-copy-of-gapminder">Create a copy of gapminder</a></li>
<li><a href="#use-mutate-to-add-new-variables">Use <code>mutate()</code> to add new variables</a></li>
<li><a href="#use-arrange-to-row-order-data-in-a-principled-way">Use <code>arrange()</code> to row-order data in a principled way</a></li>
<li><a href="#use-rename-to-rename-variables">Use <code>rename()</code> to rename variables</a></li>
<li><a href="#select-can-rename-and-reposition-variables"><code>select()</code> can rename and reposition variables</a></li>
<li><a href="#group_by-is-a-mighty-weapon"><code>group_by()</code> is a mighty weapon</a><ul>
<li><a href="#counting-things-up">Counting things up</a></li>
<li><a href="#general-summarization">General summarization</a></li>
</ul></li>
<li><a href="#grouped-mutate">Grouped mutate</a><ul>
<li><a href="#computing-with-group-wise-summaries">Computing with group-wise summaries</a></li>
<li><a href="#window-functions">Window functions</a></li>
</ul></li>
<li><a href="#grand-finale">Grand Finale</a></li>
<li><a href="#resources">Resources</a></li>
</ul>
</div>

<div id="where-were-we" class="section level3">
<h3>Where were we?</h3>
<p>In the <a href="block009_dplyr-intro.html">introduction to dplyr</a>, we used two very important verbs and an operator:</p>
<ul>
<li><code>filter()</code> for subsetting data with row logic</li>
<li><code>select()</code> for subsetting data variable- or column-wise</li>
<li>the pipe operator <code>%&gt;%</code>, which feeds the LHS as the first argument to the expression on the RHS</li>
</ul>
<p>We also discussed dplyr’s role inside the tidyverse and tibbles:</p>
<ul>
<li>dplyr is a core package in the <a href="https://github.com/hadley/tidyverse">tidyverse</a> meta-package. Since we often make incidental usage of the others, we will load dplyr and the others via <code>library(tidyverse)</code>.</li>
<li>The tidyverse embraces a special flavor of data frame, called a tibble. The <code>gapminder</code> dataset is stored as a tibble.</li>
</ul>
</div>
<div id="load-dplyr-and-gapminder" class="section level3">
<h3>Load dplyr and gapminder</h3>
<p>I choose to load the tidyverse, which will load dplyr, among other packages we use incidentally below. Also load gapminder.</p>
<pre class="r"><code>library(gapminder)
library(tidyverse)</code></pre>
<pre><code>## ── Attaching packages ────────────────────────────────────────── tidyverse 1.2.1 ──</code></pre>
<pre><code>## ✔ ggplot2 3.0.0           ✔ purrr   0.2.5      
## ✔ tibble  1.4.99.9004     ✔ dplyr   0.7.99.9000
## ✔ tidyr   0.8.1           ✔ stringr 1.3.1      
## ✔ readr   1.2.0           ✔ forcats 0.3.0</code></pre>
<pre><code>## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()</code></pre>
</div>
<div id="create-a-copy-of-gapminder" class="section level3">
<h3>Create a copy of gapminder</h3>
<p>We’re going to make changes to the <code>gapminder</code> tibble. To eliminate any fear that you’re damaging the data that comes with the package, we create an explicit copy of <code>gapminder</code> for our experiments.</p>
<pre class="r"><code>(my_gap &lt;- gapminder)</code></pre>
<pre><code>## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    &lt;fct&gt;       &lt;fct&gt;     &lt;int&gt;   &lt;dbl&gt;    &lt;int&gt;     &lt;dbl&gt;
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows</code></pre>
<p><strong>Pay close attention</strong> to when we evaluate statements but let the output just print to screen:</p>
<pre class="r"><code>## let output print to screen, but do not store
my_gap %&gt;% filter(country == &quot;Canada&quot;)</code></pre>
<p>… versus when we assign the output to an object, possibly overwriting an existing object.</p>
<pre class="r"><code>## store the output as an R object
my_precious &lt;- my_gap %&gt;% filter(country == &quot;Canada&quot;)</code></pre>
</div>
<div id="use-mutate-to-add-new-variables" class="section level3">
<h3>Use <code>mutate()</code> to add new variables</h3>
<p>Imagine we wanted to recover each country’s GDP. After all, the Gapminder data has a variable for population and GDP per capita. Let’s multiply them together.</p>
<p><code>mutate()</code> is a function that defines and inserts new variables into a tibble. You can refer to existing variables by name.</p>
<pre class="r"><code>my_gap %&gt;%
  mutate(gdp = pop * gdpPercap)</code></pre>
<pre><code>## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap          gdp
##    &lt;fct&gt;       &lt;fct&gt;     &lt;int&gt;   &lt;dbl&gt;    &lt;int&gt;     &lt;dbl&gt;        &lt;dbl&gt;
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
## # … with 1,694 more rows</code></pre>
<p>Hmmmm … those GDP numbers are almost uselessly large and abstract. Consider the <a href="http://fivethirtyeight.com/datalab/xkcd-randall-munroe-qanda-what-if/">advice of Randall Munroe of xkcd</a>:</p>
<blockquote>
<p>One thing that bothers me is large numbers presented without context… ‘If I added a zero to this number, would the sentence containing it mean something different to me?’ If the answer is ‘no,’ maybe the number has no business being in the sentence in the first place.&quot;</p>
</blockquote>
<p>Maybe it would be more meaningful to consumers of my tables and figures to stick with GDP per capita. But what if I reported GDP per capita, <em>relative to some benchmark country</em>. Since Canada is my adopted home, I’ll go with that.</p>
<p>I need to create a new variable that is <code>gdpPercap</code> divided by Canadian <code>gdpPercap</code>, taking care that I always divide two numbers that pertain to the same year.</p>
<p>How I achieve:</p>
<ul>
<li>Filter down to the rows for Canada.</li>
<li>Create a new temporary variable in <code>my_gap</code>:
<ul>
<li>Extract the <code>gdpPercap</code> variable from the Canadian data.</li>
<li>Replicate it once per country in the dataset, so it has the right length.</li>
</ul></li>
<li>Divide raw <code>gdpPercap</code> by this Canadian figure.</li>
<li>Discard the temporary variable of replicated Canadian <code>gdpPercap</code>.</li>
</ul>
<pre class="r"><code>ctib &lt;- my_gap %&gt;%
  filter(country == &quot;Canada&quot;)
## this is a semi-dangerous way to add this variable
## I&#39;d prefer to join on year, but we haven&#39;t covered joins yet
my_gap &lt;- my_gap %&gt;%
  mutate(tmp = rep(ctib$gdpPercap, nlevels(country)),
         gdpPercapRel = gdpPercap / tmp,
         tmp = NULL)</code></pre>
<p>Note that, <code>mutate()</code> builds new variables sequentially so you can reference earlier ones (like <code>tmp</code>) when defining later ones (like <code>gdpPercapRel</code>). Also, you can get rid of a variable by setting it to <code>NULL</code>.</p>
<p>How could we sanity check that this worked? The Canadian values for <code>gdpPercapRel</code> better all be 1!</p>
<pre class="r"><code>my_gap %&gt;% 
  filter(country == &quot;Canada&quot;) %&gt;% 
  select(country, year, gdpPercapRel)</code></pre>
<pre><code>## # A tibble: 12 x 3
##    country  year gdpPercapRel
##    &lt;fct&gt;   &lt;int&gt;        &lt;dbl&gt;
##  1 Canada   1952            1
##  2 Canada   1957            1
##  3 Canada   1962            1
##  4 Canada   1967            1
##  5 Canada   1972            1
##  6 Canada   1977            1
##  7 Canada   1982            1
##  8 Canada   1987            1
##  9 Canada   1992            1
## 10 Canada   1997            1
## 11 Canada   2002            1
## 12 Canada   2007            1</code></pre>
<p>I perceive Canada to be a “high GDP” country, so I predict that the distribution of <code>gdpPercapRel</code> is located below 1, possibly even well below. Check your intuition!</p>
<pre class="r"><code>summary(my_gap$gdpPercapRel)</code></pre>
<pre><code>##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.007236 0.061648 0.171521 0.326659 0.446564 9.534690</code></pre>
<p>The relative GDP per capita numbers are, in general, well below 1. We see that most of the countries covered by this dataset have substantially lower GDP per capita, relative to Canada, across the entire time period.</p>
<p>Remember: Trust No One. Including (especially?) yourself. Always try to find a way to check that you’ve done what meant to. Prepare to be horrified.</p>
</div>
<div id="use-arrange-to-row-order-data-in-a-principled-way" class="section level3">
<h3>Use <code>arrange()</code> to row-order data in a principled way</h3>
<p><code>arrange()</code> reorders the rows in a data frame. Imagine you wanted this data ordered by year then country, as opposed to by country then year.</p>
<pre class="r"><code>my_gap %&gt;%
  arrange(year, country)</code></pre>
<pre><code>## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap gdpPercapRel
##    &lt;fct&gt;       &lt;fct&gt;     &lt;int&gt;   &lt;dbl&gt;    &lt;int&gt;     &lt;dbl&gt;        &lt;dbl&gt;
##  1 Afghanistan Asia       1952    28.8  8425333      779.       0.0686
##  2 Albania     Europe     1952    55.2  1282697     1601.       0.141 
##  3 Algeria     Africa     1952    43.1  9279525     2449.       0.215 
##  4 Angola      Africa     1952    30.0  4232095     3521.       0.310 
##  5 Argentina   Americas   1952    62.5 17876956     5911.       0.520 
##  6 Australia   Oceania    1952    69.1  8691212    10040.       0.883 
##  7 Austria     Europe     1952    66.8  6927772     6137.       0.540 
##  8 Bahrain     Asia       1952    50.9   120447     9867.       0.868 
##  9 Bangladesh  Asia       1952    37.5 46886859      684.       0.0602
## 10 Belgium     Europe     1952    68    8730405     8343.       0.734 
## # … with 1,694 more rows</code></pre>
<p>Or maybe you want just the data from 2007, sorted on life expectancy?</p>
<pre class="r"><code>my_gap %&gt;%
  filter(year == 2007) %&gt;%
  arrange(lifeExp)</code></pre>
<pre><code>## # A tibble: 142 x 7
##    country          continent  year lifeExp     pop gdpPercap gdpPercapRel
##    &lt;fct&gt;            &lt;fct&gt;     &lt;int&gt;   &lt;dbl&gt;   &lt;int&gt;     &lt;dbl&gt;        &lt;dbl&gt;
##  1 Swaziland        Africa     2007    39.6  1.13e6     4513.       0.124 
##  2 Mozambique       Africa     2007    42.1  2.00e7      824.       0.0227
##  3 Zambia           Africa     2007    42.4  1.17e7     1271.       0.0350
##  4 Sierra Leone     Africa     2007    42.6  6.14e6      863.       0.0237
##  5 Lesotho          Africa     2007    42.6  2.01e6     1569.       0.0432
##  6 Angola           Africa     2007    42.7  1.24e7     4797.       0.132 
##  7 Zimbabwe         Africa     2007    43.5  1.23e7      470.       0.0129
##  8 Afghanistan      Asia       2007    43.8  3.19e7      975.       0.0268
##  9 Central African… Africa     2007    44.7  4.37e6      706.       0.0194
## 10 Liberia          Africa     2007    45.7  3.19e6      415.       0.0114
## # … with 132 more rows</code></pre>
<p>Oh, you’d like to sort on life expectancy in <strong>desc</strong>ending order? Then use <code>desc()</code>.</p>
<pre class="r"><code>my_gap %&gt;%
  filter(year == 2007) %&gt;%
  arrange(desc(lifeExp))</code></pre>
<pre><code>## # A tibble: 142 x 7
##    country         continent  year lifeExp      pop gdpPercap gdpPercapRel
##    &lt;fct&gt;           &lt;fct&gt;     &lt;int&gt;   &lt;dbl&gt;    &lt;int&gt;     &lt;dbl&gt;        &lt;dbl&gt;
##  1 Japan           Asia       2007    82.6   1.27e8    31656.        0.872
##  2 Hong Kong, Chi… Asia       2007    82.2   6.98e6    39725.        1.09 
##  3 Iceland         Europe     2007    81.8   3.02e5    36181.        0.996
##  4 Switzerland     Europe     2007    81.7   7.55e6    37506.        1.03 
##  5 Australia       Oceania    2007    81.2   2.04e7    34435.        0.948
##  6 Spain           Europe     2007    80.9   4.04e7    28821.        0.794
##  7 Sweden          Europe     2007    80.9   9.03e6    33860.        0.932
##  8 Israel          Asia       2007    80.7   6.43e6    25523.        0.703
##  9 France          Europe     2007    80.7   6.11e7    30470.        0.839
## 10 Canada          Americas   2007    80.7   3.34e7    36319.        1    
## # … with 132 more rows</code></pre>
<p>I advise that your analyses NEVER rely on rows or variables being in a specific order. But it’s still true that human beings write the code and the interactive development process can be much nicer if you reorder the rows of your data as you go along. Also, once you are preparing tables for human eyeballs, it is imperative that you step up and take control of row order.</p>
</div>
<div id="use-rename-to-rename-variables" class="section level3">
<h3>Use <code>rename()</code> to rename variables</h3>
<p>When I first cleaned this Gapminder excerpt, I was a <a href="http://en.wikipedia.org/wiki/CamelCase"><code>camelCase</code></a> person, but now I’m all about <a href="http://en.wikipedia.org/wiki/Snake_case"><code>snake_case</code></a>. So I am vexed by the variable names I chose when I cleaned this data years ago. Let’s rename some variables!</p>
<pre class="r"><code>my_gap %&gt;%
  rename(life_exp = lifeExp,
         gdp_percap = gdpPercap,
         gdp_percap_rel = gdpPercapRel)</code></pre>
<pre><code>## # A tibble: 1,704 x 7
##    country     continent  year life_exp      pop gdp_percap gdp_percap_rel
##    &lt;fct&gt;       &lt;fct&gt;     &lt;int&gt;    &lt;dbl&gt;    &lt;int&gt;      &lt;dbl&gt;          &lt;dbl&gt;
##  1 Afghanistan Asia       1952     28.8  8425333       779.         0.0686
##  2 Afghanistan Asia       1957     30.3  9240934       821.         0.0657
##  3 Afghanistan Asia       1962     32.0 10267083       853.         0.0634
##  4 Afghanistan Asia       1967     34.0 11537966       836.         0.0520
##  5 Afghanistan Asia       1972     36.1 13079460       740.         0.0390
##  6 Afghanistan Asia       1977     38.4 14880372       786.         0.0356
##  7 Afghanistan Asia       1982     39.9 12881816       978.         0.0427
##  8 Afghanistan Asia       1987     40.8 13867957       852.         0.0320
##  9 Afghanistan Asia       1992     41.7 16317921       649.         0.0246
## 10 Afghanistan Asia       1997     41.8 22227415       635.         0.0219
## # … with 1,694 more rows</code></pre>
<p>I did NOT assign the post-rename object back to <code>my_gap</code> because that would make the chunks in this tutorial harder to copy/paste and run out of order. In real life, I would probably assign this back to <code>my_gap</code>, in a data preparation script, and proceed with the new variable names.</p>
</div>
<div id="select-can-rename-and-reposition-variables" class="section level3">
<h3><code>select()</code> can rename and reposition variables</h3>
<p>You’ve seen simple use of <code>select()</code>. There are two tricks you might enjoy:</p>
<ol style="list-style-type: decimal">
<li><code>select()</code> can rename the variables you request to keep.</li>
<li><code>select()</code> can be used with <code>everything()</code> to hoist a variable up to the front of the tibble.</li>
</ol>
<pre class="r"><code>my_gap %&gt;%
  filter(country == &quot;Burundi&quot;, year &gt; 1996) %&gt;% 
  select(yr = year, lifeExp, gdpPercap) %&gt;% 
  select(gdpPercap, everything())</code></pre>
<pre><code>## # A tibble: 3 x 3
##   gdpPercap    yr lifeExp
##       &lt;dbl&gt; &lt;int&gt;   &lt;dbl&gt;
## 1      463.  1997    45.3
## 2      446.  2002    47.4
## 3      430.  2007    49.6</code></pre>
<p><code>everything()</code> is one of several helpers for variable selection. Read its help to see the rest.</p>
</div>
<div id="group_by-is-a-mighty-weapon" class="section level3">
<h3><code>group_by()</code> is a mighty weapon</h3>
<p>I have found <del>friends and family</del> collaborators love to ask seemingly innocuous questions like, “which country experienced the sharpest 5-year drop in life expectancy?”. In fact, that is a totally natural question to ask. But if you are using a language that doesn’t know about data, it’s an incredibly annoying question to answer.</p>
<p>dplyr offers powerful tools to solve this class of problem.</p>
<ul>
<li><code>group_by()</code> adds extra structure to your dataset – grouping information – which lays the groundwork for computations within the groups.</li>
<li><code>summarize()</code> takes a dataset with <span class="math inline">\(n\)</span> observations, computes requested summaries, and returns a dataset with 1 observation.</li>
<li>Window functions take a dataset with <span class="math inline">\(n\)</span> observations and return a dataset with <span class="math inline">\(n\)</span> observations.</li>
<li><code>mutate()</code> and <code>summarize()</code> will honor groups.</li>
<li>You can also do very general computations on your groups with <code>do()</code>, though elsewhere in this course, I advocate for other approaches that I find more intuitive, using the <code>purrr</code> package.</li>
</ul>
<p>Combined with the verbs you already know, these new tools allow you to solve an extremely diverse set of problems with relative ease.</p>
<div id="counting-things-up" class="section level4">
<h4>Counting things up</h4>
<p>Let’s start with simple counting. How many observations do we have per continent?</p>
<pre class="r"><code>my_gap %&gt;%
  group_by(continent) %&gt;%
  summarize(n = n())</code></pre>
<pre><code>## # A tibble: 5 x 2
##   continent     n
##   &lt;fct&gt;     &lt;int&gt;
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24</code></pre>
<p>Let us pause here to think about the tidyverse. You could get these same frequencies using <code>table()</code> from base R.</p>
<pre class="r"><code>table(gapminder$continent)</code></pre>
<pre><code>## 
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24</code></pre>
<pre class="r"><code>str(table(gapminder$continent))</code></pre>
<pre><code>##  &#39;table&#39; int [1:5(1d)] 624 300 396 360 24
##  - attr(*, &quot;dimnames&quot;)=List of 1
##   ..$ : chr [1:5] &quot;Africa&quot; &quot;Americas&quot; &quot;Asia&quot; &quot;Europe&quot; ...</code></pre>
<p>But the object of class <code>table</code> that is returned makes downstream computation a bit fiddlier than you’d like. For example, it’s too bad the continent levels come back only as <em>names</em> and not as a proper factor, with the original set of levels. This is an example of how the tidyverse smooths transitions where you want the output of step i to become the input of step i + 1.</p>
<p>The <code>tally()</code> function is a convenience function that knows to count rows. It honors groups.</p>
<pre class="r"><code>my_gap %&gt;%
  group_by(continent) %&gt;%
  tally()</code></pre>
<pre><code>## # A tibble: 5 x 2
##   continent     n
##   &lt;fct&gt;     &lt;int&gt;
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24</code></pre>
<p>The <code>count()</code> function is an even more convenient function that does both grouping and counting.</p>
<pre class="r"><code>my_gap %&gt;% 
  count(continent)</code></pre>
<pre><code>## # A tibble: 5 x 2
##   continent     n
##   &lt;fct&gt;     &lt;int&gt;
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24</code></pre>
<p>What if we wanted to add the number of unique countries for each continent? You can compute multiple summaries inside <code>summarize()</code>. Use the <code>n_distinct()</code> function to count the number of distinct countries within each continent.</p>
<pre class="r"><code>my_gap %&gt;%
  group_by(continent) %&gt;%
  summarize(n = n(),
            n_countries = n_distinct(country))</code></pre>
<pre><code>## # A tibble: 5 x 3
##   continent     n n_countries
##   &lt;fct&gt;     &lt;int&gt;       &lt;int&gt;
## 1 Africa      624          52
## 2 Americas    300          25
## 3 Asia        396          33
## 4 Europe      360          30
## 5 Oceania      24           2</code></pre>
</div>
<div id="general-summarization" class="section level4">
<h4>General summarization</h4>
<p>The functions you’ll apply within <code>summarize()</code> include classical statistical summaries, like <code>mean()</code>, <code>median()</code>, <code>var()</code>, <code>sd()</code>, <code>mad()</code>, <code>IQR()</code>, <code>min()</code>, and <code>max()</code>. Remember they are functions that take <span class="math inline">\(n\)</span> inputs and distill them down into 1 output.</p>
<p>Although this may be statistically ill-advised, let’s compute the average life expectancy by continent.</p>
<pre class="r"><code>my_gap %&gt;%
  group_by(continent) %&gt;%
  summarize(avg_lifeExp = mean(lifeExp))</code></pre>
<pre><code>## # A tibble: 5 x 2
##   continent avg_lifeExp
##   &lt;fct&gt;           &lt;dbl&gt;
## 1 Africa           48.9
## 2 Americas         64.7
## 3 Asia             60.1
## 4 Europe           71.9
## 5 Oceania          74.3</code></pre>
<p><code>summarize_at()</code> applies the same summary function(s) to multiple variables. Let’s compute average and median life expectancy and GDP per capita by continent by year … but only for 1952 and 2007.</p>
<pre class="r"><code>my_gap %&gt;%
  filter(year %in% c(1952, 2007)) %&gt;%
  group_by(continent, year) %&gt;%
  summarize_at(vars(lifeExp, gdpPercap), funs(mean, median))</code></pre>
<pre><code>## # A tibble: 10 x 6
## # Groups:   continent [5]
##    continent  year lifeExp_mean gdpPercap_mean lifeExp_median
##    &lt;fct&gt;     &lt;int&gt;        &lt;dbl&gt;          &lt;dbl&gt;          &lt;dbl&gt;
##  1 Africa     1952         39.1          1253.           38.8
##  2 Africa     2007         54.8          3089.           52.9
##  3 Americas   1952         53.3          4079.           54.7
##  4 Americas   2007         73.6         11003.           72.9
##  5 Asia       1952         46.3          5195.           44.9
##  6 Asia       2007         70.7         12473.           72.4
##  7 Europe     1952         64.4          5661.           65.9
##  8 Europe     2007         77.6         25054.           78.6
##  9 Oceania    1952         69.3         10298.           69.3
## 10 Oceania    2007         80.7         29810.           80.7
## # … with 1 more variable: gdpPercap_median &lt;dbl&gt;</code></pre>
<p>Let’s focus just on Asia. What are the minimum and maximum life expectancies seen by year?</p>
<pre class="r"><code>my_gap %&gt;%
  filter(continent == &quot;Asia&quot;) %&gt;%
  group_by(year) %&gt;%
  summarize(min_lifeExp = min(lifeExp), max_lifeExp = max(lifeExp))</code></pre>
<pre><code>## # A tibble: 12 x 3
##     year min_lifeExp max_lifeExp
##    &lt;int&gt;       &lt;dbl&gt;       &lt;dbl&gt;
##  1  1952        28.8        65.4
##  2  1957        30.3        67.8
##  3  1962        32.0        69.4
##  4  1967        34.0        71.4
##  5  1972        36.1        73.4
##  6  1977        31.2        75.4
##  7  1982        39.9        77.1
##  8  1987        40.8        78.7
##  9  1992        41.7        79.4
## 10  1997        41.8        80.7
## 11  2002        42.1        82  
## 12  2007        43.8        82.6</code></pre>
<p>Of course it would be much more interesting to see <em>which</em> country contributed these extreme observations. Is the minimum (maximum) always coming from the same country? We tackle that with window functions shortly.</p>
</div>
</div>
<div id="grouped-mutate" class="section level3">
<h3>Grouped mutate</h3>
<p>Sometimes you don’t want to collapse the <span class="math inline">\(n\)</span> rows for each group into one row. You want to keep your groups, but compute within them.</p>
<div id="computing-with-group-wise-summaries" class="section level4">
<h4>Computing with group-wise summaries</h4>
<p>Let’s make a new variable that is the years of life expectancy gained (lost) relative to 1952, for each individual country. We group by country and use <code>mutate()</code> to make a new variable. The <code>first()</code> function extracts the first value from a vector. Notice that <code>first()</code> is operating on the vector of life expectancies <em>within each country group</em>.</p>
<pre class="r"><code>my_gap %&gt;% 
  group_by(country) %&gt;% 
  select(country, year, lifeExp) %&gt;% 
  mutate(lifeExp_gain = lifeExp - first(lifeExp)) %&gt;% 
  filter(year &lt; 1963)</code></pre>
<pre><code>## # A tibble: 426 x 4
## # Groups:   country [142]
##    country      year lifeExp lifeExp_gain
##    &lt;fct&gt;       &lt;int&gt;   &lt;dbl&gt;        &lt;dbl&gt;
##  1 Afghanistan  1952    28.8         0   
##  2 Afghanistan  1957    30.3         1.53
##  3 Afghanistan  1962    32.0         3.20
##  4 Albania      1952    55.2         0   
##  5 Albania      1957    59.3         4.05
##  6 Albania      1962    64.8         9.59
##  7 Algeria      1952    43.1         0   
##  8 Algeria      1957    45.7         2.61
##  9 Algeria      1962    48.3         5.23
## 10 Angola       1952    30.0         0   
## # … with 416 more rows</code></pre>
<p>Within country, we take the difference between life expectancy in year <span class="math inline">\(i\)</span> and life expectancy in 1952. Therefore we always see zeroes for 1952 and, for most countries, a sequence of positive and increasing numbers.</p>
</div>
<div id="window-functions" class="section level4">
<h4>Window functions</h4>
<p>Window functions take <span class="math inline">\(n\)</span> inputs and give back <span class="math inline">\(n\)</span> outputs. Furthermore, the output depends on all the values. So <code>rank()</code> is a window function but <code>log()</code> is not. Here we use window functions based on ranks and offsets.</p>
<p>Let’s revisit the worst and best life expectancies in Asia over time, but retaining info about <em>which</em> country contributes these extreme values.</p>
<pre class="r"><code>my_gap %&gt;%
  filter(continent == &quot;Asia&quot;) %&gt;%
  select(year, country, lifeExp) %&gt;%
  group_by(year) %&gt;%
  filter(min_rank(desc(lifeExp)) &lt; 2 | min_rank(lifeExp) &lt; 2) %&gt;% 
  arrange(year) %&gt;%
  print(n = Inf)</code></pre>
<pre><code>## # A tibble: 24 x 3
## # Groups:   year [12]
##     year country     lifeExp
##    &lt;int&gt; &lt;fct&gt;         &lt;dbl&gt;
##  1  1952 Afghanistan    28.8
##  2  1952 Israel         65.4
##  3  1957 Afghanistan    30.3
##  4  1957 Israel         67.8
##  5  1962 Afghanistan    32.0
##  6  1962 Israel         69.4
##  7  1967 Afghanistan    34.0
##  8  1967 Japan          71.4
##  9  1972 Afghanistan    36.1
## 10  1972 Japan          73.4
## 11  1977 Cambodia       31.2
## 12  1977 Japan          75.4
## 13  1982 Afghanistan    39.9
## 14  1982 Japan          77.1
## 15  1987 Afghanistan    40.8
## 16  1987 Japan          78.7
## 17  1992 Afghanistan    41.7
## 18  1992 Japan          79.4
## 19  1997 Afghanistan    41.8
## 20  1997 Japan          80.7
## 21  2002 Afghanistan    42.1
## 22  2002 Japan          82  
## 23  2007 Afghanistan    43.8
## 24  2007 Japan          82.6</code></pre>
<p>We see that (min = Afghanistan, max = Japan) is the most frequent result, but Cambodia and Israel pop up at least once each as the min or max, respectively. That table should make you impatient for our upcoming work on tidying and reshaping data! Wouldn’t it be nice to have one row per year?</p>
<p>How did that actually work? First, I store and view a partial that leaves off the <code>filter()</code> statement. All of these operations should be familiar.</p>
<pre class="r"><code>asia &lt;- my_gap %&gt;%
  filter(continent == &quot;Asia&quot;) %&gt;%
  select(year, country, lifeExp) %&gt;%
  group_by(year)
asia</code></pre>
<pre><code>## # A tibble: 396 x 3
## # Groups:   year [12]
##     year country     lifeExp
##    &lt;int&gt; &lt;fct&gt;         &lt;dbl&gt;
##  1  1952 Afghanistan    28.8
##  2  1957 Afghanistan    30.3
##  3  1962 Afghanistan    32.0
##  4  1967 Afghanistan    34.0
##  5  1972 Afghanistan    36.1
##  6  1977 Afghanistan    38.4
##  7  1982 Afghanistan    39.9
##  8  1987 Afghanistan    40.8
##  9  1992 Afghanistan    41.7
## 10  1997 Afghanistan    41.8
## # … with 386 more rows</code></pre>
<p>Now we apply a window function – <code>min_rank()</code>. Since <code>asia</code> is grouped by year, <code>min_rank()</code> operates within mini-datasets, each for a specific year. Applied to the variable <code>lifeExp</code>, <code>min_rank()</code> returns the rank of each country’s observed life expectancy. FYI, the <code>min</code> part just specifies how ties are broken. Here is an explicit peek at these within-year life expectancy ranks, in both the (default) ascending and descending order.</p>
<p>For concreteness, I use <code>mutate()</code> to actually create these variables, even though I dropped this in the solution above. Let’s look at a bit of that.</p>
<pre class="r"><code>asia %&gt;%
  mutate(le_rank = min_rank(lifeExp),
         le_desc_rank = min_rank(desc(lifeExp))) %&gt;% 
  filter(country %in% c(&quot;Afghanistan&quot;, &quot;Japan&quot;, &quot;Thailand&quot;), year &gt; 1995)</code></pre>
<pre><code>## # A tibble: 9 x 5
## # Groups:   year [12]
##    year country     lifeExp le_rank le_desc_rank
##   &lt;int&gt; &lt;fct&gt;         &lt;dbl&gt;   &lt;int&gt;        &lt;int&gt;
## 1  1997 Afghanistan    41.8       1           33
## 2  1997 Japan          80.7      33            1
## 3  1997 Thailand       67.5      12           22
## 4  2002 Afghanistan    42.1       1           33
## 5  2002 Japan          82        33            1
## 6  2002 Thailand       68.6      12           22
## 7  2007 Afghanistan    43.8       1           33
## 8  2007 Japan          82.6      33            1
## 9  2007 Thailand       70.6      12           22</code></pre>
<p>Afghanistan tends to present 1’s in the <code>le_rank</code> variable, Japan tends to present 1’s in the <code>le_desc_rank</code> variable and other countries, like Thailand, present less extreme ranks.</p>
<p>You can understand the original <code>filter()</code> statement now:</p>
<pre class="r"><code>filter(min_rank(desc(lifeExp)) &lt; 2 | min_rank(lifeExp) &lt; 2)</code></pre>
<p>These two sets of ranks are formed on-the-fly, within year group, and <code>filter()</code> retains rows with rank less than 2, which means … the row with rank = 1. Since we do for ascending and descending ranks, we get both the min and the max.</p>
<p>If we had wanted just the min OR the max, an alternative approach using <code>top_n()</code> would have worked.</p>
<pre class="r"><code>my_gap %&gt;%
  filter(continent == &quot;Asia&quot;) %&gt;%
  select(year, country, lifeExp) %&gt;%
  arrange(year) %&gt;%
  group_by(year) %&gt;%
  #top_n(1, wt = lifeExp)        ## gets the min
  top_n(1, wt = desc(lifeExp)) ## gets the max</code></pre>
<pre><code>## # A tibble: 12 x 3
## # Groups:   year [12]
##     year country     lifeExp
##    &lt;int&gt; &lt;fct&gt;         &lt;dbl&gt;
##  1  1952 Afghanistan    28.8
##  2  1957 Afghanistan    30.3
##  3  1962 Afghanistan    32.0
##  4  1967 Afghanistan    34.0
##  5  1972 Afghanistan    36.1
##  6  1977 Cambodia       31.2
##  7  1982 Afghanistan    39.9
##  8  1987 Afghanistan    40.8
##  9  1992 Afghanistan    41.7
## 10  1997 Afghanistan    41.8
## 11  2002 Afghanistan    42.1
## 12  2007 Afghanistan    43.8</code></pre>
</div>
</div>
<div id="grand-finale" class="section level3">
<h3>Grand Finale</h3>
<p>So let’s answer that “simple” question: which country experienced the sharpest 5-year drop in life expectancy? Recall that this excerpt of the Gapminder data only has data every five years, e.g. for 1952, 1957, etc. So this really means looking at life expectancy changes between adjacent timepoints.</p>
<p>At this point, that’s just too easy, so let’s do it by continent while we’re at it.</p>
<pre class="r"><code>my_gap %&gt;%
  select(country, year, continent, lifeExp) %&gt;%
  group_by(continent, country) %&gt;%
  ## within country, take (lifeExp in year i) - (lifeExp in year i - 1)
  ## positive means lifeExp went up, negative means it went down
  mutate(le_delta = lifeExp - lag(lifeExp)) %&gt;% 
  ## within country, retain the worst lifeExp change = smallest or most negative
  summarize(worst_le_delta = min(le_delta, na.rm = TRUE)) %&gt;% 
  ## within continent, retain the row with the lowest worst_le_delta
  top_n(-1, wt = worst_le_delta) %&gt;% 
  arrange(worst_le_delta)</code></pre>
<pre><code>## # A tibble: 5 x 3
## # Groups:   continent [5]
##   continent country     worst_le_delta
##   &lt;fct&gt;     &lt;fct&gt;                &lt;dbl&gt;
## 1 Africa    Rwanda             -20.4  
## 2 Asia      Cambodia            -9.10 
## 3 Americas  El Salvador         -1.51 
## 4 Europe    Montenegro          -1.46 
## 5 Oceania   Australia            0.170</code></pre>
<p>Ponder that for a while. The subject matter and the code. Mostly you’re seeing what genocide looks like in dry statistics on average life expectancy.</p>
<p>Break the code into pieces, starting at the top, and inspect the intermediate results. That’s certainly how I was able to <em>write</em> such a thing. These commands do not <a href="http://tinyurl.com/athenaforehead">leap fully formed out of anyone’s forehead</a> – they are built up gradually, with lots of errors and refinements along the way. I’m not even sure it’s a great idea to do so much manipulation in one fell swoop. Is the statement above really hard for you to read? If yes, then by all means break it into pieces and make some intermediate objects. Your code should be easy to write and read when you’re done.</p>
<p>In later tutorials, we’ll explore more of dplyr, such as operations based on two datasets.</p>
</div>
<div id="resources" class="section level3">
<h3>Resources</h3>
<p><code>dplyr</code> official stuff</p>
<ul>
<li>package home <a href="http://cran.r-project.org/web/packages/dplyr/index.html">on CRAN</a>
<ul>
<li>note there are several vignettes, with the <a href="http://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html">introduction</a> being the most relevant right now</li>
<li>the <a href="http://cran.rstudio.com/web/packages/dplyr/vignettes/window-functions.html">one on window functions</a> will also be interesting to you now</li>
</ul></li>
<li>development home <a href="https://github.com/hadley/dplyr">on GitHub</a></li>
<li><a href="https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a">tutorial HW delivered</a> (note this links to a DropBox folder) at useR! 2014 conference</li>
</ul>
<p><a href="https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf">RStudio Data Wrangling cheatsheet</a>, covering <code>dplyr</code> and <code>tidyr</code>. Remember you can get to these via <em>Help &gt; Cheatsheets.</em></p>
<p><a href="http://r4ds.had.co.nz/transform.html">Data transformation</a> chapter of <a href="http://r4ds.had.co.nz">R for Data Science</a></p>
<p><a href="https://github.com/tjmahr/MadR_Pipelines">Excellent slides</a> on pipelines and <code>dplyr</code> by TJ Mahr, talk given to the Madison R Users Group.</p>
<p>Blog post <a href="http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/">Hands-on dplyr tutorial for faster data manipulation in R</a> by Data School, that includes a link to an R Markdown document and links to videos</p>
<p><a href="bit001_dplyr-cheatsheet.html">Cheatsheet</a> I made for <code>dplyr</code> join functions (not relevant yet but soon)</p>
</div>

<div class="footer">
This work is licensed under the  <a href="http://creativecommons.org/licenses/by-nc/3.0/">CC BY-NC 3.0 Creative Commons License</a>.
</div>


</div>

<script>

// add bootstrap table styles to pandoc tables
function bootstrapStylePandocTables() {
  $('tr.header').parent('thead').parent('table').addClass('table table-condensed');
}
$(document).ready(function () {
  bootstrapStylePandocTables();
});


</script>

<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

</body>
</html>