Skip to content

Commit

Permalink
Merge pull request #78 from carpentries-incubator/peter_edits
Browse files Browse the repository at this point in the history
added identify the problem episode for review, Dec2024. Changed to Rm…
  • Loading branch information
xortizross authored Dec 3, 2024
2 parents b2ef676 + c7d87b9 commit 66501ae
Show file tree
Hide file tree
Showing 3 changed files with 281 additions and 62 deletions.
5 changes: 2 additions & 3 deletions config.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@

#------------------------------------------------------------
# Values for this lesson.
#------------------------------------------------------------
Expand Down Expand Up @@ -61,7 +62,7 @@ contact: '[email protected]'
episodes:
- 1-intro-reproducible-examples.md
- 2-understanding-your-code.md
- 3-identify-the-problem.md
- 3-identify-the-problem.Rmd
- 4-minimal-reproducible-data.Rmd
- 5-minimal-reproducible-code.Rmd
- 6-asking-your-question.md
Expand All @@ -79,5 +80,3 @@ profiles:
#
# This space below is where custom yaml items (e.g. pinning
# sandpaper and varnish versions) should live


279 changes: 279 additions & 0 deletions episodes/3-identify-the-problem.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,279 @@

---
title: "Identify the problem and make a plan"
teaching: 0
exercises: 0
---

:::::::::::::::::::::::::::::::::::::: questions

- What do I do when I encounter an error?
- What do I do when my code outputs something I don’t expect?
- Why do errors and warnings appear in R?
- Which areas of code are responsible for errors?
- How can I fix my code? What other options exist if I can't fix it?


::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: objectives

After completing this episode, participants should be able to...

- decode/describe what an error message is trying to communicate
- Identify specific lines and/or functions generating the error message
- Lookup function syntax, use, and examples using R Documentation (?help calls)
- Describe a general category of error message (e.g. syntax error, semantic errors, package-specific errors, etc.) # be more explicit about semantic errors?
- Describe the output of code you are seeking. ## identify relevant warnings or code output
- Identify and quickly fix commonly-encountered R errors ###### what was I thinking here
- Identify which problems are better suited for asking for further help, including online help and reprex

::::::::::::::::::::::::::::::::::::::::::::::::


```{r}
library(readr)
library(dplyr)
library(ggplot2)
library(stringr)
# Read in the data
rodents <- read_csv("../scripts/data/surveys_complete_77_89.csv")
```


As simple as it may seem, the first step we'll cover is what to do when encountering an error or other undesired output from your code. It is our experience that many seemingly-impossible errors can be fixed on the route to create a reproducible example for an expert helper. With this episode, we hope to teach you the basics about identifying how your error might be occurring, and how to isolate the problem for others to look at.


## 3.1 What do I do when I encounter an error?

Something that may go wrong is an error in your code. By this, we mean any code which generates an error message. This happens when R is unable to run your code, for a variety of reasons: some common ones include R being unable to read or interpret your commands, expecting different input types than those inputted, and user-written errors from checks you or other package creators have added to ensure your code is running correctly.

The accompanying error message attempts to tell you exactly how your code failed. For example, consider the following error message that occurs when I run this command in the R console:

```{r,error=T}
ggplot(x = taxa) + geom_bar()
```

Though we know somewhere there is an object called `taxa` (it is actually a column of the dataset `rodents`), R is trying to communicate that it cannot find any such object in the local environment. Let's try again, appropriately pointing ggplot to the `rodents` dataset and `taxa` column using the `$` operator.

```{r, error=T}
ggplot(aes(x = rodents$taxa)) + geom_bar()
```


Whoops! Here we see another error message -- this time, R responds with a perhaps more-uninterpretable message.

Let's go over each part briefly. First, we see an error from a function called `fortify`, which we didn't even call! Then a much more helpful informational message: Did we accidentally pass `aes()` to the `data` argument? This does seem to relate to our function call, as we do pass `aes`, potentially where our data should go. A helpful starting place when attempting to decipher an error message is checking the documentation for the function which caused the error:

` ?ggplot`

Here, a Help window pops up in RStudio which provides some more information. Skipping the general description at the top, we see ggplot takes positional arguments of `data`, **then** `mapping`, which uses the `aes` call. We can see in "Arguments" that the `aes(x = rodents$taxa)` object used in the plot is attempted by `fortify` to be converted to a data.frame: now the picture is clear! We accidentally passed our `mapping` argument (telling ggplot how to map variables to the plot) into the position it expected `data` in the form of a data frame. And if we scroll down to "Examples", to "Pattern 1", we can see exactly how ggplot expects these arguments in practice. Let's amend our result:

```{r}
ggplot(rodents, aes(x = taxa)) + geom_bar()
```


Here we see our desired plot.



## Summary

In general, when encountering an error message for which a remedy is not immediately apparent, some steps to take include:

1. Reading each part of the error message, to see if we can interpret and act on any.

2. Pulling up the R Documentation for that function (which may be specific to a package, such as with ggplot).

3. Reading through the documentation's Description, Usage, Arguments, Details and Examples entries for greater insight into our error.

4. Copying and pasting the error message into a search engine for more interpretable explanations.

And, when all else fails, preparing our code into a reproducible example for expert help.


## 3.2 What do I do when my code outputs something I don't expect

Another type of 'error' which you may encounter is when your R code runs without errors, but does not produce the desired output. You may sometimes see these called "semantic errors" (as opposed to syntax errors, though these term themselves are vague within computer science and describe a variety of different scenarios). As with actual R errors, semantic errors may occur for a variety of non-intuitive reasons, and are often harder to solve as there is no description of the error -- you must work out where your code is defective yourself!

With our rodent analysis, the next step in the plan is to subset to just the `Rodent` taxa (as opposed to other taxa: Bird, Rabbit, Reptile or NA). Let's quickly check to see how much data we'd be throwing out by doing so:

```{r}
table(rodents$taxa)
```

We're interested in the Rodents, and thankfully it seems like a majority of our observations will be maintained when subsetting to rodents. Except wait. In our plot, we can clearly see the presence of NA values. Why are we not seeing them here? This is an example of a semantic error -- our command is executed correctly, but the output is not everything we intended. Having no error message to interpret, let's jump straight to the documentation:

```{r}
?table
```

Here, the documentation provides some clues: there seems to be an argument called `useNA` that accepts "no", "ifany", and "always", but it's not immediately apparent which one we should use. As a second approach, let's go to `Examples` to see if we can find any quick fixes. Here we see a couple lines further down:

```r
table(a) # does not report NA's
table(a, exclude = NULL) # reports NA's
```

That seems like it should be inclusive. Let's try again:

```{r}
table(rodents$taxa, exclude = NULL)
```
Now, we do see that by subsetting to the "Rodent" taxa, we are losing about 357 NAs, which themselves could be rodents! However, in this case, it seems a small enough portion to safely omit. Let's subset our data to the rodent taxa

```{r}
rodents <- rodents %>% filter(taxa == "Rodent")
```

## Summary

In general, when encountering a semantic error for which a remedy is not immediately apparent, some steps to take include:

1. Reading any warning or informational messages that may pop up when executing your code.

2. Changing the input to your function call to see if the behavior is ...

2. Pulling up the R Documentation for that function (which may be specific to a package, such as with ggplot).

3. Reading through the documentation's Description, Usage, Arguments, Details and Examples entries for greater insight into our error.

And, when all else fails, preparing our code into a reproducible example for expert help. Note, there are fewer options available as when an error message prevents your code from running. You may find yourself isolating and reproducing your problem more often with semantic errors as easily solvable syntax errors.


::::::::::::::::::::: callout

Generally, the more your code deviates from just using base R functions, or the more you use specific packages, both the quality of documentation and online help available from search engines and Googling gets worse and worse. While base R errors will often be solvable in a couple of minutes from a quick `?help` check or a long online discussion and solutions on a website like Stack Overflow, errors arising from little-used packages applied in bespoke analyses might merit isolating your specific problem to a reproducible example for online help, or even getting in touch with the developers! Such community input and questions are often the way packages and documentation improves over time.

:::::::::::::::::::::

## 3.3 How can I find where my code is failing?

Isolating your problem may not be as simple as assessing the output from a single function call on the line of code which produces your error. Often, it may be difficult to determine which line(s) in your code are producing the error.

Consider the example below, where we now are attempting to see how which species of kangaroo rodents appear in different plot types over the years.


```{r, error=T}
krats <- rodents %>% filter(genus == "Dipadomys") #kangaroo rat genus
ggplot(krats, aes(year, fill=plot_type)) +
geom_histogram() +
facet_wrap(~species)
```

Uh-oh. Another error here, this time in "combine_vars?" What is that? "Faceting variables must have at least one value": What does that mean?

Well it may be clear enough that we seem to be missing "species" values where we intend. Maybe we can try to make a different graph looking at what species values are present? Or perhaps there's an error earlier -- our safest approach may actually be seeing what krats looks like:


```{r}
krats
```
It's empty! What went wrong with our "Dipadomys" filter?

```{r}
rodents %>% count(genus)
```
We see two things here. For one, we've misspelled Dipodomys, which we can now amend. This quick function call also tells us we should expect a data frame with 9573 values resulting after subsetting to the genus Dipodomys.

```{r}
krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)
ggplot(krats, aes(year, fill=plot_type)) +
geom_histogram() +
facet_wrap(~species)
```


Our improved code here looks good. A quick "dim" call confirms we now have all the Dipodomys observations, and our plot is looking better. In general, having a 'print' statement or some other output before plots or other major steps can be a good way to check your code is producing intermediate results consistent with your expectations.

However, there's something funky happening here. The bins are definitely weirdly spaced -- we can see some bins are not filled with any observations, while those exactly containing one of the integer years happens to contain all the observations for that year.

:::::::: challenge

As a group, name some potential problems or undesired outcomes from this graph...

::::::::

:::::::: solution

- The graph looks sparse, and unevenly so -- many bins have no observations
- Suggests that some years had more observations and others fewer based on somewhat arbitrary measurements (i.e. what calendar year happened to fall on)
- Hard to compare trends across time, or even subsequent years...

::::::::

As we discussed in the challenge, there are some issues to visualizing our data this way. A solution here might be to tinker with the bin width in the histogram code, but let's step back a minute. Do we necessarily need to dive into the depths of tinkering with the plot? We can evalulate this problem not in terms of the plot having a problem, but with our data type having a problem. There's an opportunity to encode the observation times outside of coarse, somewhat arbitrary year groupings with the real, interpretable date they were collected. Let's do that using the tidyverse's 'lubridate' package. The important details here are that we are creating a 'datetime'-type variable using the recorded years, months, and days, which are currently all encoded as numeric types.

```{r}
krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)
krats <- krats %>% mutate(date = lubridate::ymd(paste(year,month,day,sep='-')))
ggplot(krats, aes(date, fill=plot_type)) +
geom_histogram() +
facet_wrap(~species)
```

This looks much better, and is easier to see the trends over time as well. Note our x axis still shows bins with year labelings, but the continuous spread of our data over these bins shows that dates are treated more continuously and fall more continuously within histogram bins.

:::::::: callout
One aspect we can see with this exercise above is that by setting up a reproducible example, we can isolate the problem with data rather than simply asking a proximal problem (i.e. 'how can i change my plot to look like so'). This allows helpers and you to directly improve your code, but also allows the community to help in identifying the problem. You don't always need to understand what exact lines of code or function calls are going wrong in order to get help!
::::::::


## Summary

In general, we need to isolate the specific areas of code causing the bug or problem. There is no general rule of thumb as to how large this needs to be, but in general, think about what we would want to include in a reprex. Any early lines which we know run correctly and as intended may not need to be included, and we should seek to isolate the problem area as much as we can to make it understandable to others.

Let's add to our list of steps for identifying the problem:

0. Identify the problem area -- add print statements immediately upstream or downstream of problem areas, step into functions, and see whether any intermediate output can be further isolated.

1. Read each part of the error or warning message (if applicable), to see if we can immediately interpret and act on any.

2. Pulling up the R Documentation for any function calls causing the error (which may be specific to a package, such as with ggplot).

3. Reading through the documentation's Description, Usage, Arguments, Details and Examples entries for greater insight into our error.

4. Copying and pasting the error message into a search engine for more interpretable explanations.

And, when all else fails, preparing our code into a reproducible example for expert help.

Whereas before we had a list of steps for addressing spot problems arising in one or two lines, we can now organize identifying the problem into a more organizational workflow. Any step in the above that helps us identify the specific areas or aspects of our code that are failing in particular, we can zoom in on and restart the checklist. We can stop as soon as we don't understand anymore how our code fails, at which point we can excise that area for further help.




Finally, let's make our plot publication-ready by changing some aesthetics. Let's also add a vertical line to show when the study design changed on the exclosures.

```{r}
krats <- rodents %>% filter(genus == "Dipodomys") #kangaroo rat genus
dim(krats)
krats <- krats %>% mutate(date = lubridate::ymd(paste(year,month,day,sep='-')))
krats %>%
ggplot(aes(x = date, fill = plot_type)) +
geom_histogram()+
facet_wrap(~species, ncol = 1)+
theme_bw()+
scale_fill_viridis_d(option = "plasma")+
geom_vline(aes(xintercept = lubridate::ymd("1988-01-01")), col = "dodgerblue")
```

It looks like the study change helped to reduce merriami sightings in the Rodent and Short-term Krat exclosures.



59 changes: 0 additions & 59 deletions episodes/3-identify-the-problem.md

This file was deleted.

0 comments on commit 66501ae

Please sign in to comment.