carpentries-incubator · xortizross · Dec 3, 2024 · Dec 3, 2024 · Dec 3, 2024
diff --git a/.DS_Store b/.DS_Store
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -1,8 +1,13 @@
-If this pull request addresses an open issue on the repository, please add 'Closes #NN' below, where NN is the issue number.
+----- TEMPLATE: PLEASE FILL IN ------
+# Description
 
-Please briefly summarise the changes made in the pull request, and the reason(s) for making these changes.
+*Please include a summary of the changes and the related issue. Please also include relevant motivation and context. List any dependencies that are required for this change.*
 
-If any relevant discussions have taken place elsewhere, please provide links to these.
+Fixes # [ISSUE NUMBER]
 
-For review requests pertaining to specific episodes, please tag `xortizross` for lesson 4 (minimal reproducible data), `peterlaurin` for 
-lesson 3 (identify the problem) and `kaijagahm` for lesson 1 and 6. For other lessons, feel free to tag whomever. 
+# Tag maintainers
+*Please tag the appropriate maintainer, depending on which episode your change pertains to*
+**Identify the Problem**: tag @peterlaurin
+**Minimal Reproducible Data**: tag @xortizross
+**Minimal Reproducible Code**: tag @kaijagahm
+**Anything else**: use your judgment!
diff --git a/config.yaml b/config.yaml
@@ -1,3 +1,4 @@
+
 #------------------------------------------------------------
 # Values for this lesson.
 #------------------------------------------------------------
@@ -58,26 +59,24 @@ contact: '[email protected]'
 # - another-learner.md
 
 # Order of episodes in your lesson
-episodes: 
+episodes:
 - 1-intro-reproducible-examples.md
 - 2-understanding-your-code.md
-- 3-identify-the-problem.md
+- 3-identify-the-problem.Rmd
 - 4-minimal-reproducible-data.Rmd
-- 5-minimal-reproducible-code.md
+- 5-minimal-reproducible-code.Rmd
 - 6-asking-your-question.md
 
 # Information for Learners
-learners: 
+learners:
 
 # Information for Instructors
-instructors: 
+instructors:
 
 # Learner Profiles
-profiles: 
+profiles:
 
 # Customisation ---------------------------------------------
 #
 # This space below is where custom yaml items (e.g. pinning
 # sandpaper and varnish versions) should live
-
-
diff --git a/episodes/.DS_Store b/episodes/.DS_Store
diff --git a/episodes/2-identify-the-problem.Rmd b/episodes/2-identify-the-problem.Rmd
@@ -0,0 +1,279 @@
+
+---
+title: "Identify the problem and make a plan"
+teaching: 0
+exercises: 0
+---
+
+:::::::::::::::::::::::::::::::::::::: questions 
+
+- What do I do when I encounter an error?
+- What do I do when my code outputs something I don’t expect?
+- Why do errors and warnings appear in R? 
+- Which areas of code are responsible for errors? 
+- How can I fix my code? What other options exist if I can't fix it? 
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::: objectives
+
+After completing this episode, participants should be able to...
+
+- decode/describe what an error message is trying to communicate
+- Identify specific lines and/or functions generating the error message
+- Lookup function syntax, use, and examples using R Documentation (?help calls)
+- Describe a general category of error message (e.g. syntax error, semantic errors, package-specific errors, etc.) # be more explicit about semantic errors?
+- Describe the output of code you are seeking. ## identify relevant warnings or code output
+- Identify and quickly fix commonly-encountered R errors ###### what was I thinking here
+- Identify which problems are better suited for asking for further help, including online help and reprex
+
+::::::::::::::::::::::::::::::::::::::::::::::::
+
+
+```{r}
+library(readr)
+library(dplyr)
+library(ggplot2)
+library(stringr)
+
+# Read in the data
+rodents <- read_csv("data/surveys_complete_77_89.csv")
+
+```
+
+
+As simple as it may seem, the first step we'll cover is what to do when encountering an error or other undesired output from your code. It is our experience that many seemingly-impossible errors can be fixed on the route to create a reproducible example for an expert helper. With this episode, we hope to teach you the basics about identifying how your error might be occurring, and how to isolate the problem for others to look at. 
+
+
+## 3.1 What do I do when I encounter an error? 
+
+Something that may go wrong is an error in your code. By this, we mean any code which generates an error message. This happens when R is unable to run your code, for a variety of reasons: some common ones include R being unable to read or interpret your commands, expecting different input types than those inputted, and user-written errors from checks you or other package creators have added to ensure your code is running correctly. 
+
+The accompanying error message attempts to tell you exactly how your code failed. For example, consider the following error message that occurs when I run this command in the R console:
+
+```{r,error=T}
+ggplot(x = taxa) + geom_bar()
+```
+
+Though we know somewhere there is an object called `taxa` (it is actually a column of the dataset `rodents`), R is trying to communicate that it cannot find any such object in the local environment. Let's try again, appropriately pointing ggplot to the `rodents` dataset and `taxa` column using the `$` operator. 
+
+```{r, error=T}
+ggplot(aes(x = rodents$taxa)) + geom_bar()
+```
+
+
+ Whoops! Here we see another error message -- this time, R responds with a perhaps more-uninterpretable message.
+
+Let's go over each part briefly. First, we see an error from a function called `fortify`, which we didn't even call! Then a much more helpful informational message: Did we accidentally pass `aes()` to the `data` argument? This does seem to relate to our function call, as we do pass `aes`, potentially where our data should go. A helpful starting place when attempting to decipher an error message is checking the documentation for the function which caused the error:
+
+` ?ggplot`
+
+Here, a Help window pops up in RStudio which provides some more information. Skipping the general description at the top, we see ggplot takes positional arguments of `data`, **then** `mapping`, which uses the `aes` call. We can see in "Arguments" that the `aes(x = rodents$taxa)` object used in the plot is attempted by `fortify` to be converted to a data.frame: now the picture is clear! We accidentally passed our `mapping` argument (telling ggplot how to map variables to the plot) into the position it expected `data` in the form of a data frame. And if we scroll down to "Examples", to "Pattern 1", we can see exactly how ggplot expects these arguments in practice. Let's amend our result:
+
+```{r} 
+ggplot(rodents, aes(x = taxa)) + geom_bar()
+```
+
+
+Here we see our desired plot. 
+
+
+
+## Summary
+
+In general, when encountering an error message for which a remedy is not immediately apparent, some steps to take include:
+
+1. Reading each part of the error message, to see if we can interpret and act on any.
+
+2. Pulling up the R Documentation for that function (which may be specific to a package, such as with ggplot).
+
+3. Reading through the documentation's Description, Usage, Arguments, Details and Examples entries for greater insight into our error. 
+
+4. Copying and pasting the error message into a search engine for more interpretable explanations.
+
+And, when all else fails, preparing our code into a reproducible example for expert help.
+
+
+## 3.2 What do I do when my code outputs something I don't expect
+
+Another type of 'error' which you may encounter is when your R code runs without errors, but does not produce the desired output. You may sometimes see these called "semantic errors" (as opposed to syntax errors, though these term themselves are vague within computer science and describe a variety of different scenarios). As with actual R errors, semantic errors may occur for a variety of non-intuitive reasons, and are often harder to solve as there is no description of the error -- you must work out where your code is defective yourself! 
+
+With our rodent analysis, the next step in the plan is to subset to just the `Rodent` taxa (as opposed to other taxa: Bird, Rabbit, Reptile or NA). Let's quickly check to see how much data we'd be throwing out by doing so:
+
+```{r}
+table(rodents$taxa)
+```
+
+We're interested in the Rodents, and thankfully it seems like a majority of our observations will be maintained when subsetting to rodents. Except wait. In our plot, we can clearly see the presence of NA values. Why are we not seeing them here? This is an example of a semantic error -- our command is executed correctly, but the output is not everything we intended. Having no error message to interpret, let's jump straight to the documentation:
+
+```{r}
+?table
+```
+
+Here, the documentation provides some clues: there seems to be an argument called `useNA` that accepts "no", "ifany", and "always", but it's not immediately apparent which one we should use. As a second approach, let's go to `Examples` to see if we can find any quick fixes. Here we see a couple lines further down: 
+
+```r
+table(a)                 # does not report NA's
+table(a, exclude = NULL) # reports NA's
+```
+
+That seems like it should be inclusive. Let's try again:
+
+```{r}
+table(rodents$taxa, exclude = NULL)
+```
+Now, we do see that by subsetting to the "Rodent" taxa, we are losing about 357 NAs, which themselves could be rodents! However, in this case, it seems a small enough portion to safely omit. Let's subset our data to the rodent taxa
+
+```{r}
+rodents <- rodents %>% filter(taxa == "Rodent")
+```
+
+## Summary
+
+In general, when encountering a semantic error for which a remedy is not immediately apparent, some steps to take include:
+
+1. Reading any warning or informational messages that may pop up when executing your code.
+
+2. Changing the input to your function call to see if the behavior is ... 
+
+2. Pulling up the R Documentation for that function (which may be specific to a package, such as with ggplot).
+
+3. Reading through the documentation's Description, Usage, Arguments, Details and Examples entries for greater insight into our error. 
+
+And, when all else fails, preparing our code into a reproducible example for expert help. Note, there are fewer options available as when an error message prevents your code from running. You may find yourself isolating and reproducing your problem more often with semantic errors as easily solvable syntax errors. 
+
+
+::::::::::::::::::::: callout
+
+Generally, the more your code deviates from just using base R functions, or the more you use specific packages, both the quality of documentation and online help available from search engines and Googling gets worse and worse. While base R errors will often be solvable in a couple of minutes from a quick `?help` check or a long online discussion and solutions on a website like Stack Overflow, errors arising from  little-used packages applied in bespoke analyses might merit isolating your specific problem to a reproducible example for online help, or even getting in touch with the developers! Such community input and questions are often the way packages and documentation improves over time. 
+
+:::::::::::::::::::::
+
+## 3.3 How can I find where my code is failing? 
+
+Isolating your problem may not be as simple as assessing the output from a single function call on the line of code which produces your error. Often, it may be difficult to determine which line(s) in your code are producing the error. 
+
+Consider the example below, where we now are attempting to see how which species of kangaroo rodents appear in different plot types over the years. 
+
+
+```{r, error=T}
+krats <- rodents %>% filter(genus == "Dipadomys") #kangaroo rat genus
+
+ggplot(krats, aes(year, fill=plot_type)) + 
+geom_histogram() +
+facet_wrap(~species)
+
+```
+
+Uh-oh. Another error here, this time in "combine_vars?" What is that? "Faceting variables must have at least one value": What does that mean?
+
+Well it may be clear enough that we seem to be missing "species" values where we intend. Maybe we can try to make a different graph looking at what species values are present? Or perhaps there's an error earlier -- our safest approach may actually be seeing what krats looks like:
+
+
+```{r}
+krats
+```
+It's empty! What went wrong with our "Dipadomys" filter?
+
+```{r}
+rodents %>% count(genus)
+```
+We see two things here. For one, we've misspelled Dipodomys, which we can now amend. This quick function call also tells us we should expect a data frame with 9573 values resulting after subsetting to the genus Dipodomys.
+
+```{r}
+krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
+dim(krats)
+
+ggplot(krats, aes(year, fill=plot_type)) + 
+geom_histogram() +
+facet_wrap(~species)
+
+```
+
+
+Our improved code here looks good. A quick "dim" call confirms we now have all the Dipodomys observations, and our plot is looking better. In general, having a 'print' statement or some other output before plots or other major steps can be a good way to check your code is producing intermediate results consistent with your expectations. 
+
+However, there's something funky happening here. The bins are definitely weirdly spaced -- we can see some bins are not filled with any observations, while those exactly containing one of the integer years happens to contain all the observations for that year. 
+
+:::::::: challenge 
+
+As a group, name some potential problems or undesired outcomes from this graph...
+
+::::::::
+
+:::::::: solution
+
+ - The graph looks sparse, and unevenly so -- many bins have no observations
+ - Suggests that some years had more observations and others fewer based on somewhat arbitrary measurements (i.e. what calendar year happened to fall on)
+ - Hard to compare trends across time, or even subsequent years...
+
+::::::::
+
+As we discussed in the challenge, there are some issues to visualizing our data this way. A solution here might be to tinker with the bin width in the histogram code, but let's step back a minute. Do we necessarily need to dive into the depths of tinkering with the plot? We can evalulate this problem not in terms of the plot having a problem, but with our data type having a problem. There's  an opportunity to encode the observation times outside of coarse, somewhat arbitrary year groupings with the real, interpretable date they were collected. Let's do that using the tidyverse's 'lubridate' package. The important details here are that we are creating a 'datetime'-type variable using the recorded years, months, and days, which are currently all encoded as numeric types.
+
+```{r}
+krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
+dim(krats)
+
+krats <- krats %>% mutate(date = lubridate::ymd(paste(year,month,day,sep='-')))
+
+ggplot(krats, aes(date, fill=plot_type)) + 
+geom_histogram() +
+facet_wrap(~species)
+
+```
+
+This looks much better, and is easier to see the trends over time as well. Note our x axis still shows bins with year labelings, but the continuous spread of our data over these bins shows that dates are treated more continuously and fall more continuously within histogram bins.
+
+:::::::: callout 
+One aspect we can see with this exercise above is that by setting up a reproducible example, we can isolate the problem with data rather than simply asking a proximal problem (i.e. 'how can i change my plot to look like so'). This allows helpers and you to directly improve your code, but also allows the community to help in identifying the problem. You don't always need to  understand what exact lines of code or function calls are going wrong in order to get help!
+::::::::
+
+
+## Summary
+
+In general, we need to isolate the specific areas of code causing the bug or problem. There is no general rule of thumb as to how large this needs to be, but in general, think about what we would want to include in a reprex. Any early lines which we know run correctly and as intended may not need to be included, and we should seek to isolate the problem area as much as we can to make it understandable to others. 
+
+Let's add to our list of steps for identifying the problem: 
+
+0. Identify the problem area -- add print statements immediately upstream or downstream of problem areas, step into functions, and see whether any intermediate output can be further isolated. 
+
+1. Read each part of the error or warning message (if applicable), to see if we can immediately interpret and act on any.  
+
+2. Pulling up the R Documentation for any function calls causing the error (which may be specific to a package, such as with ggplot).
+
+3. Reading through the documentation's Description, Usage, Arguments, Details and Examples entries for greater insight into our error. 
+
+4. Copying and pasting the error message into a search engine for more interpretable explanations.
+
+And, when all else fails, preparing our code into a reproducible example for expert help.
+
+Whereas before we had a list of steps for addressing spot problems arising in one or two lines, we can now organize identifying the problem into a more organizational workflow. Any step in the above that helps us identify the specific areas or aspects of our code that are failing in particular, we can zoom in on and restart the checklist. We can stop as soon as we don't understand anymore how our code fails, at which point we can excise that area for further help. 
+
+
+
+
+Finally, let's make our plot publication-ready by changing some aesthetics. Let's also add a vertical line to show when the study design changed on the exclosures.
+
+```{r}
+krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
+dim(krats)
+
+krats <- krats %>% mutate(date = lubridate::ymd(paste(year,month,day,sep='-')))
+
+
+krats %>%
+  ggplot(aes(x = date, fill = plot_type)) +
+  geom_histogram()+
+  facet_wrap(~species, ncol = 1)+ 
+  theme_bw()+
+  scale_fill_viridis_d(option = "plasma")+
+  geom_vline(aes(xintercept = lubridate::ymd("1988-01-01")), col = "dodgerblue")
+
+```
+
+It looks like the study change helped to reduce merriami sightings in the Rodent and Short-term Krat exclosures. 
+
+
+
diff --git a/episodes/3-identify-the-problem.md b/episodes/3-identify-the-problem.md
diff --git a/episodes/4-minimal-reproducible-data.Rmd → episodes/3-minimal-reproducible-data.Rmd b/episodes/4-minimal-reproducible-data.Rmd → episodes/3-minimal-reproducible-data.Rmd