-
Notifications
You must be signed in to change notification settings - Fork 153
/
Copy pathboxplot.Rmd
executable file
·213 lines (150 loc) · 8.36 KB
/
boxplot.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
# Chart: Boxplot {#box}
```{r, include=FALSE}
knitr::opts_chunk$set(fig.align = 'center', out.width = '60%')
```

## tl;dr
I want a nice example and I want it NOW!
Here's a look at the weights of newborn chicks split by the feed supplement they received:
```{r tldr-boxplot, echo=FALSE, out.width='75%'}
library(ggplot2)
# boxplot by feed supplement
ggplot(chickwts, aes(x = reorder(feed, -weight, median), y = weight)) +
# plotting
geom_boxplot(fill = "#cc9a38", color = "#473e2c") +
# formatting
ggtitle("Casein Makes You Fat?!",
subtitle = "Boxplots of Chick Weights by Feed Supplement") +
labs(x = "Feed Supplement", y = "Chick Weight (g)", caption = "Source: datasets::chickwts") +
theme_grey(16) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
```
And here's the code:
```{r, ref.label="tldr-boxplot", eval=FALSE}
```
For more info on this dataset, type `?datasets::chickwts` into the console.
## Simple examples
Okay...*much* simpler please.
### Single boxplots
Base R will give you a quick boxplot of a vector or a single column of a data frame with very little typing:
```{r, fig.align = 'center', out.width='60%'}
# vector
boxplot(rivers)
```
Or, the horizontal version:
```{r}
# single column of a data frame
boxplot(chickwts$weight, horizontal = TRUE)
```
Creating a single boxplot in **ggplot2** is somewhat problematic. (The joke is that it's the package author's way of saying that if you only have one group, make a histogram instead!)
If you only include one aesthetic mapping, it will be assumed to the `x` (group) variable and you will get an error:
```{r, eval=FALSE}
ggplot(chickwts, aes(weight)) + geom_boxplot()
```
```
Error: stat_boxplot requires the following missing aesthetics: y
```
This can be remedied by adding `y = ` to indicate that `weight` is the numeric variable, but you'll still get a meaningless x-axis:
```{r}
ggplot(chickwts, aes(y = weight)) +
geom_boxplot() +
theme_grey(16) # make all font sizes larger (default is 11)
```
Another, cleaner approach is to create a name for the single group as the `x` aesthetic and remove the x-axis label:
```{r}
ggplot(chickwts, aes(x = "all 71 chickens", y = weight)) +
geom_boxplot() + xlab("") + theme_grey(16)
```
### Multiple boxplots using ggplot2
To create multiple boxplots with **ggplot2**, your data frame needs to be tidy, that is you need to have a column with levels of the grouping variable. It can be be factor, character, or integer class.
```{r}
str(chickwts)
```
We see that `chickwts` is in the right form: we have a `feed` column with six factor levels, so we can set the the `x` aesthetic to `feed`. We also order the boxplots by *decreasing median weight*:
```{r}
ggplot(chickwts, aes(x = reorder(feed, -weight, median), y = weight)) +
geom_boxplot() +
xlab("feed type") +
theme_grey(16)
```
Data frames that contain a separate column of values for each desired boxplot must be tidied first. (For more detail on using `tidy::gather()`, see [this tutorial](https://github.com/jtr13/codehelp/blob/master/R/gather.md){target="_blank"}.)
```{r, message=FALSE}
library(tidyverse)
head(attitude)
tidyattitude <- attitude %>% gather(key = "question", value = "rating")
head(tidyattitude)
```
Now we're ready to plot:
```{r}
ggplot(tidyattitude, aes(reorder(question, -rating, median), rating)) +
geom_boxplot() +
xlab("question short name") +
theme_grey(16)
```
## Theory
Here's a quote by Hadley Wickham that sums up boxplots nicely:
>The boxplot is a compact distributional summary, displaying less detail than a histogram or kernel density, but also taking up less space. Boxplots use robust summary statistics that are always located at actual data points, are quickly computable (originally by hand), and have no tuning parameters. They are particularly useful for comparing distributions across groups. - Hadley Wickham
>
Another important use of the boxplot is in showing outliers. A boxplot shows how much of an outlier a data point is with quartiles and fences. Use the boxplot when you have data with outliers so that they can be exposed. What it lacks in specificity it makes up with its ability to clearly summarize large data sets.
* For more info about boxplots and continuous variables, check out [Chapter 3](http://www.gradaanwr.net/content/03-examining-continuous-variables/){target="_blank"} of the textbook.
## When to use
Boxplots should be used to display *continuous variables*. They are particularly useful for identifying outliers and comparing different groups.
*Aside*: Boxplots may even help you [convince someone you are their outlier](https://xkcd.com/539/){target="_blank"} (If you like it when people over-explain jokes, [here is why that comic is funny](https://www.explainxkcd.com/wiki/index.php/539:_Boyfriend){target="_blank"}.).
## Considerations
### Flipping orientation
Often you want boxplots to be horizontal. Super easy to do in **ggplot2**: just tack on `+ coord_flip()` and remove the `-` from the reordering so that the factor level with the highest median will be on top:
```{r flip-boxplot}
ggplot(tidyattitude, aes(reorder(question, rating, median), rating)) +
geom_boxplot() +
coord_flip() +
xlab("question short name") +
theme_grey(16)
```
Note that switching `x` and `y` insteading of using `coord_flip()` doesn't work!
```{r}
ggplot(tidyattitude, aes(rating, reorder(question, rating, median))) +
geom_boxplot() +
ggtitle("This is not what we wanted!") +
ylab("question short name") +
theme_grey(16)
```
### NOT for categorical data
Boxplots are great, but they do NOT work with categorical data. Make sure your variable is continuous before using boxplots.
The data in this example are variables from the `pisaitems` dataset in the **likert** package with ratings of 1, 2, 3 or 4:
```{r, echo = FALSE, message=FALSE}
library(likert) # data
library(dplyr) # data manipulation
# load/format data
data(pisaitems)
pisa <- pisaitems[1:100, 2:7] %>%
dplyr::mutate_all(as.integer) %>%
dplyr::filter(complete.cases(.))
```
```{r}
head(pisa, 4)
```
Creating a boxplot from this data is a good example of what *not* to do:
```{r bad-boxplots, message=FALSE, echo = FALSE, out.width='75%'}
# create theme
theme <- theme_grey(16) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
# create plot
plot <- ggplot(stack(pisa), aes(x = ind, y = values)) +
geom_boxplot(fill = "#9B3535") +
ggtitle("Don't Plot Boxplots of Categorical Variables",
subtitle = "...seriously don't. Here, I'll make it red so it looks scary:") +
labs(x = "Assessment Code", y = "Values", caption = "Source: likert::pisaitems")
# bad boxplot
plot + theme
```
## External resources
- Tukey, John W. 1977. [*Exploratory Data Analysis.*](https://clio.columbia.edu/catalog/136422){target="_blank"} Addison-Wesley. (Chapter 2): the primary source in which boxplots are first presented.
- [Article on boxplots with ggplot2](http://t-redactyl.io/blog/2016/04/creating-plots-in-r-using-ggplot2-part-10-boxplots.html){target="_blank"}: An excellent collection of code examples on how to make boxplots with `ggplot2`. Covers layering, working with legends, faceting, formatting, and more. If you want a boxplot to look a certain way, this article will help.
- [Boxplots with plotly package](https://plot.ly/ggplot2/box-plots/){target="_blank"}: boxplot examples using the `plotly` package. These allow for a little interactivity on hover, which might better explain the underlying statistics of your plot.
- [ggplot2 Boxplot: Quick Start Guide](http://www.sthda.com/english/wiki/ggplot2-box-plot-quick-start-guide-r-software-and-data-visualization){target="_blank"}: Article from [STHDA](http://www.sthda.com/english/){target="_blank"} on making boxplots using ggplot2. Excellent starting point for getting immediate results and custom formatting.
- [ggplot2 cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf){target="_blank"}: Always good to have close by.
- [Hadley Wickhan and Lisa Stryjewski on boxplots](http://vita.had.co.nz/papers/boxplots.pdf){target="_blank"}: good for understanding basics of more complex boxplots and some of the history behind them.