GGplot visualization - Part 1

Author

Tobi Gerstenberg, Edited by Bria Long & Martin Zettersten

Load packages

Let’s first load the packages that we need for this chapter. You can click on the green arrow to execute the code chunk below.

Code

library(knitr)
library(tidyverse) # for plotting (and many more cool things we'll discover later)

Setting up a plot

Let’s first get some data.

Code

df.diamonds = diamonds

The diamonds dataset comes with the ggplot2 package. We can get a description of the dataset by running the following command:

Code

?diamonds

Above, we assigned the diamonds dataset to the variable df.diamonds so that we can see it in the data explorer.

Let’s take a look at the full dataset by clicking on it in the explorer.

Tip: You can view a data frame by highlighting the text in the editor (or simply moving the mouse above the text), and then pressing F2.

The df.diamonds data frame contains information about almost 60,000 diamonds, including their price, carat value, size, etc. Let’s use visualization to get a better sense for this dataset.

We start by setting up the plot. To do so, we pass a data frame to the function ggplot() in the following way.

Code

ggplot(data = df.diamonds)

This, by itself, won’t do anything yet. We also need to specify what to plot.

Let’s take a look at how much diamonds of different color cost. The help file says that diamonds labeled D have the best color, and diamonds labeled J the worst color. Let’s make a bar plot that shows the average price of diamonds for different colors.

We do so via specifying a mapping from the data to the plot aesthetics with the function aes(). We need to tell aes() what we would like to display on the x-axis, and the y-axis of the plot.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = color,
                     y = price))

Code

ggplot(data = df.diamonds,
       mapping = aes(x = color,
                     y = price)) +
  stat_summary(fun = "mean", #computes the mean
               geom = "bar") # plots it as a bar graph

Bar plot

Let’s make a bar graph:

Code

ggplot(data = df.diamonds,
       mapping = aes(x = color,
                     y = price)) +
  stat_summary(fun = "mean",
               geom = "bar")

Neat! Three lines of code produce an almost-publication-ready plot (to be published in the Proceedings of Unnecessary Diamonds)! Note how we used a + at the end of the first line of code to specify that there will be more. This is a very powerful idea underlying ggplot2. We can start simple and keep adding things to the plot step by step.

We used the stat_summary() function to define what we want to plot (the “mean”), and how (as a “bar” chart). Let’s take a closer look at that function.

Code

help(stat_summary)

Not the the easiest help file … We supplied two arguments to the function, fun = and geom =.

The fun argument specifies what function we’d like to apply to the data for each value of x. Here, we said that we would like to take the mean and we specified that as a string.
The geom (= geometric object) argument specifies how we would like to plot the result, namely as a “bar” plot.

Instead of showing the “mean”, we could also show the “median” instead.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = color,
                     y = price)) +
  stat_summary(fun = "median", # can vary this function
               geom = "bar")

And instead of making a bar plot, we could plot some points.

Code

ggplot(df.diamonds,
       aes(x = color,
           y = price)) +
  stat_summary(fun = "mean",
               geom = "point") # this is my favorite way to look at data!

Tip: Take a look here to see what other geoms ggplot2 supports.

Somewhat surprisingly, diamonds with the best color (D) are not the most expensive ones. What’s going on here? We’ll need to do some more exploration to figure this out.

Setting the default plot theme

Before moving on, let’s set a different default theme for our plots. Personally, I’m not a big fan of the gray background and the white grid lines. Also, the default size of the text should be bigger. We can change the default theme using the theme_set() function like so:

Code

library(ggthemes)

theme_set(theme_few() + # set the theme
            theme(text = element_text(size = 20))) # set the default text size

From now on, all our plots will use what’s specified in theme_classic(), and the default text size will be larger, too. For any individual plot, we can still override these settings.

Scatter plot

I don’t know much about diamonds, but I do know that diamonds with a higher carat value tend to be more expensive. color was a discrete variable with seven different values. carat, however, is a continuous variable. We want to see how the price of diamonds differs as a function of the carat value. Since we are interested in the relationship between two continuous variables, plotting a bar graph won’t work. Instead, let’s make a scatter plot. Let’s put the carat value on the x-axis, and the price on the y-axis.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = carat,
                     y = price)) +
  geom_point()

Cool! That looks sensible. Diamonds with a higher carat value tend to have a higher price. Our dataset has 53940 rows. So the plot actually shows 53940 circles even though we can’t see all of them since they overlap.

Let’s make some progress on trying to figure out why the diamonds with the better color weren’t the most expensive ones on average. We’ll add some color to the scatter plot in Figure 1. We color each of the points based on the diamond’s color. To do so, we pass another argument to the aesthetics of the plot via aes().

Code

ggplot(data = df.diamonds,
       mapping = aes(x = carat,
                     y = price,
                     color = color)) +
  geom_point()

Aha! Now we’ve got some color. Notice how in Figure 2 ggplot2 added a legend for us, thanks! We’ll see later how to play around with legends. Form just eye-balling the plot, it looks like the diamonds with the best color (D) tended to have a lower carat value, and the ones with the worst color (J), tended to have the highest carat values.

So this is why diamonds with better colors are less expensive – these diamonds have a lower carat value overall.

There are many other things that we can define in aes(). Take a quick look at the vignette:

Code

vignette("ggplot2-specs")

Practice plot 1

Make a scatter plot that shows the relationship between the variables depth (on the x-axis), and table (on the y-axis). Take a look at the description for the diamonds dataset so you know what these different variables mean. Your plot should look like the one shown in Figure 3).

Code

#enter your plot here

Code

include_graphics("figures/vis1_practice_plot1.png")

Advanced: A neat trick to get a better sense for the data here is to add transparency. Your plot should look like the one shown in Figure 4.

Code

# make advanced practice plot 1 here

Code

include_graphics("figures/vis1_practice_plot1a.png")

Line plot

What else do we know about the diamonds? We actually know the quality of how they were cut. The cut variable ranges from “Fair” to “Ideal”. First, let’s take a look at the relationship between cut and price. This time, we’ll make a line plot instead of a bar plot (just because we can).

Code

ggplot(data = df.diamonds,
       mapping = aes(x = cut,
                     y = price)) +
  stat_summary(fun = "mean",
               geom = "line")

`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

Oops! All we did is that we replaced x = color with x = cut, and geom = "bar" with geom = "line". However, the plot doesn’t look like expected (i.e. there is no real plot). What happened here? The reason is that the line plot needs to know which points to connect. The error message tells us that each group consists of only one observation. Let’s adjust the group aesthetic to fix this.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = cut,
                     y = price,
                     group = 1)) +
  stat_summary(fun = "mean",
               geom = "line")

By adding the parameter group = 1 to mapping = aes(), we specify that we would like all the levels in x = cut to be treated as coming from the same group. The reason for this is that cut (our x-axis variable) is a factor (and not a numeric variable), so, by default, ggplot2 tries to draw a separate line for each factor level. We’ll learn more about grouping below (and about factors later).

Interestingly, there is no simple relationship between the quality of the cut and the price of the diamond. In fact, “Ideal” diamonds tend to be cheapest.

Adding error bars

We often don’t just want to show the means but also give a sense for how much the data varies. ggplot2 has some convenient ways of specifying error bars. Let’s take a look at how much price varies as a function of clarity (another variable in our diamonds data frame).

Code

ggplot(data = df.diamonds,
       mapping = aes(x = clarity,
                     y = price)) +
  stat_summary(fun.data = "mean_cl_boot",
               geom = "pointrange")

Relationship between diamond clarity and price. Error bars indicate 95% bootstrapped confidence intervals.

Here we have it. The average price of our diamonds for different levels of clarity together with bootstrapped 95% confidence intervals. How do we know that we have 95% confidence intervals? That’s what mean_cl_boot() computes as a default. Let’s take a look at that function:

Code

help(mean_cl_boot)

Note that I had to use the fun.data = argument here instead of fun = because the mean_cl_boot() function produces three data points for each value of the x-axis (the mean, lower, and upper confidence interval).

Order matters

The order in which we add geoms to a ggplot matters! Generally, we want to plot error bars before the points that represent the means. To illustrate, let’s set the color in which we show the means to “red”.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = clarity,
                     y = price)) +
  stat_summary(fun.data = "mean_cl_boot",
               geom = "linerange") +
  stat_summary(fun = "mean",
               geom = "point",
               color = "red")

Figure 5: This figure looks good. Error bars and means are drawn in the correct order.

Figure 5 looks good.

Code

# I've changed the order in which the means and error bars are drawn.
ggplot(df.diamonds,
       aes(x = clarity,
           y = price)) +
  stat_summary(fun = "mean",
               geom = "point",
               color = "red") +
  stat_summary(fun.data = "mean_cl_boot",
               geom = "linerange")

Figure 6: This figure looks bad. Error bars and means are drawn in the incorrect order.

Figure 6 doesn’t look good. The error bars are on top of the points that represent the means.

One cool feature about using stat_summary() is that we did not have to change anything about the data frame that we used to make the plots. We directly used our raw data instead of having to make separate data frames that contain the relevant information (such as the means and the confidence intervals).

Let’s take a look at two more principles for plotting data that are extremely helpful: groups and facets. But before, another practice plot.

Practice plot 2

Make a bar plot that shows the average price of diamonds (on the y-axis) as a function of their clarity (on the x-axis). Also add error bars. Your plot should look like the one shown in Figure Figure 7.

Code

#your practice plot here

Code

include_graphics("figures/vis1_practice_plot2.png")

Advanced: Try to make the plot shown in Figure 8.

Code

# make advanced practice plot 2 here

Code

include_graphics("figures/vis1_practice_plot2a.png")

Grouping data

Grouping in ggplot2 is a very powerful idea. It allows us to plot subsets of the data – again without the need to make separate data frames first.

Let’s make a plot that shows the relationship between price and color separately for the different qualities of cut.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = color,
                     y = price,
                     group = cut)) +
  stat_summary(fun = "mean",
               geom = "line")

Well, we got some separate lines here but we don’t know which line corresponds to which cut. Let’s add some color!

Code

ggplot(data = df.diamonds,
       mapping = aes(x = color,
                     y = price,
                     group = cut,
                     color = cut)) +
  stat_summary(fun = "mean",
               geom = "line",
               linewidth = 1)

Nice! In addition to adding color, I’ve made the lines a little thicker here by setting the linewidth argument to 2.

Grouping is very useful for bar plots. Let’s take a look at how the average price of diamonds looks like taking into account both cut and color (I know – exciting times!). Let’s put the color on the x-axis and then group by the cut.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = color,
                     y = price,
                     group = cut,
                     color = cut)) +
  stat_summary(fun = "mean",
               geom = "bar")

That’s a fail! Several things went wrong here. All the bars are gray and only their outline is colored differently. Instead we want the bars to have a different color. For that we need to specify the fill argument rather than the color argument! But things are worse. The bars currently are shown on top of each other. Instead, we’d like to put them next to each other. Here is how we can do that:

Code

ggplot(data = df.diamonds,
       mapping = aes(x = color,
                     y = price,
                     group = cut,
                     fill = cut)) +
  stat_summary(fun = "mean",
               geom = "bar",
               position = position_dodge())

Neato! We’ve changed the color argument to fill, and have added the position = position_dodge() argument to the stat_summary() call. This argument makes it such that the bars are nicely dodged next to each other. Let’s add some error bars just for kicks.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = color,
                     y = price,
                     group = cut,
                     fill = cut)) +
  stat_summary(fun = "mean",
               geom = "bar",
               position = position_dodge(width = 0.9),
               color = "black") +
  stat_summary(fun.data = "mean_cl_boot",
               geom = "linerange",
               position = position_dodge(width = 0.9))

Voila! Now with error bars. Note that we’ve added the width = 0.9 argument to position_dodge(). Somehow R was complaining when this was not defined for geom “linerange”. I’ve also added some outline to the bars by including the argument color = "black". I think it looks nicer this way.

So, still somewhat surprisingly, diamonds with the worst color (J) are more expensive than diamonds with the best color (D), and diamonds with better cuts are not necessarily more expensive.

Practice plot 3

Recreate the plot shown in Figure 9.

Code

# make practice plot 3 here

Code

include_graphics("figures/vis1_practice_plot3.png")

Advanced: Try to recreate the plot show in in Figure 10.

Code

# make advanced practice plot 3 here

Code

include_graphics("figures/vis1_practice_plot3a.png")

Making facets

Having too much information in a single plot can be overwhelming. The previous plot is already pretty busy. Facets are a nice way of splitting up plots and showing information in separate panels.

Let’s take a look at how wide these diamonds tend to be. The width in mm is given in the y column of the diamonds data frame. We’ll make a histogram first. To make a histogram, the only aesthetic we needed to specify is x.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = y)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

That looks bad! Let’s pick a different value for the width of the bins in the histogram.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = y)) +
  geom_histogram(binwidth = 0.1)

Still bad. There seems to be an outlier diamond that happens to be almost 60 mm wide, while most of the rest is much narrower. One option would be to remove the outlier from the data before plotting it. But generally, we don’t want to make new data frames. Instead, let’s just limit what data we show in the plot.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = y)) +
  geom_histogram(binwidth = 0.1) +
  coord_cartesian(xlim = c(3, 10))

I’ve used the coord_cartesian() function to restrict the range of data to show by passing a minimum and maximum to the xlim argument. This looks better now.

Instead of histograms, we can also plot a density fitted to the distribution.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = y)) +
  geom_density() +
  coord_cartesian(xlim = c(3, 10))

Looks pretty similar to our histogram above! Just like we can play around with the binwidth of the histogram, we can change the smoothing bandwidth of the kernel that is used to create the histogram. Here is a histogram with a much wider bandwidth:

Code

ggplot(data = df.diamonds,
       mapping = aes(x = y)) +
  geom_density(bw = 0.5) +
  coord_cartesian(xlim = c(3, 10))

I promised that this section was about making facets, right? We’re getting there! Let’s first take a look at how wide diamonds of different color are. We can use grouping to make this happen.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = y,
                     group = color,
                     fill = color)) +
  geom_density(bw = 0.2,
               alpha = 0.2) +
  coord_cartesian(xlim = c(3, 10))

OK! That’s a little tricky to tell apart. Notice that I’ve specified the alpha argument in the geom_density() function so that the densities in the front don’t completely hide the densities in the back. But this plot still looks too busy. Instead of grouping, let’s put the densities for the different colors, in separate panels. That’s what facetting allows you to do.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = y,
                     fill = color)) +
  geom_density(bw = 0.2) +
  facet_grid(cols = vars(color)) +
  coord_cartesian(xlim = c(3, 10))

Now we have the densities next to each other in separate panels. I’ve removed the alpha argument since the densities aren’t overlapping anymore. To make the different panels, I used the facet_grid() function and specified that I want separate columns for the different colors (cols = vars(color)). What’s the deal with vars()? Why couldn’t we just write facet_grid(cols = color) instead? The short answer is: that’s what the function wants. The long answer is: long. (We’ll learn more about this later in the course.)

To show the facets in different rows instead of columns we simply replace cols = vars(color) with rows = vars(color).

Code

ggplot(data = df.diamonds,
       mapping = aes(x = y,
                     fill = color)) +
  geom_density(bw = 0.2) +
  facet_grid(rows = vars(color)) +
  coord_cartesian(xlim = c(3, 10))

Several aspects about this plot should be improved:

the y-axis text is overlapping
having both a legend and separate facet labels is redundant
having separate fills is not really necessary here

So, what does this plot actually show us? Well, J-colored diamonds tend to be wider than D-colored diamonds. Fascinating!

Of course, we could go completely overboard with facets and groups. So let’s do it! Let’s look at how the average price (somewhat more interesting) varies as a function of color, cut, and clarity. We’ll put color on the x-axis, and make separate rows for cut and columns for clarity.

Code

ggplot(data = df.diamonds,
       mapping = aes(y = price,
                     x = color,
                     fill = color)) +
  stat_summary(fun = "mean",
               geom = "bar",
               color = "black") +
  stat_summary(fun.data = "mean_cl_boot",
               geom = "linerange") +
  facet_grid(rows = vars(cut),
             cols = vars(clarity))

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_segment()`).

Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_segment()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_segment()`).

Figure 11: A figure that is stretching it in terms of information.

Figure 11 is stretching it in terms of how much information it presents. But it gives you a sense for how to combine the different bits and pieces we’ve learned so far.

Bonus. Why are we getting a few warning messages here?

Practice plot 4

Recreate the plot shown in Figure 12.

Code

# make practice plot 4 here

Code

include_graphics("figures/vis1_practice_plot4.png")

Global, local, and setting `aes()`

ggplot2 allows you to specify the plot aesthetics in different ways.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = carat,
                     y = price,
                     color = color)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = F)

`geom_smooth()` using formula = 'y ~ x'

Here, I’ve drawn a scatter plot of the relationship between carat and price, and I have added the best-fitting regression lines via the geom_smooth(method = "lm") call. (We will learn more about what these regression lines mean later in class.)

Because I have defined all the aesthetics at the top level (i.e. directly within the ggplot() function), the aesthetics apply to all the functions afterwards. Aesthetics defined in the ggplot() call are global. In this case, the geom_point() and the geom_smooth() functions. The geom_smooth() function produces separate best-fit regression lines for each different color.

But what if we only wanted to show one regression line instead that applies to all the data? Here is one way of doing so:

Code

ggplot(data = df.diamonds,
       mapping = aes(x = carat,
                     y = price)) +
  geom_point(mapping = aes(color = color)) +
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Here, I’ve moved the color aesthetic into the geom_point() function call. Now, the x and y aesthetics still apply to both the geom_point() and the geom_smooth() function call (they are global), but the color aesthetic applies only to geom_point() (it is local). Alternatively, we can simply overwrite global aesthetics within local function calls.

Code

ggplot(data = df.diamonds,
       mapping = aes(x = carat,
                     y = price,
                     color = color)) +
  geom_point() +
  geom_smooth(method = "lm",
              color = "black")

`geom_smooth()` using formula = 'y ~ x'

Here, I’ve set color = "black" within the geom_smooth() function, and now only one overall regression line is displayed since the global color aesthetic was overwritten in the local function call.

Additional resources

Cheatsheets

RStudio IDE –> information about RStudio
RMarkdown –> information about writing in RMarkdown
RMarkdown reference –> RMarkdown reference sheet
Quarto reference –> information about writing in Quarto
Data visualization –> general principles of effective graphic design
ggplot2 –> specific information about ggplot

Datacamp courses

Books and chapters

R graphics cookbook –> quick intro to the the most common graphs
ggplot2 book
R for Data Science book
- Data visualization
- Graphics for communication
Data Visualization – A practical introduction (by Kieran Healy)
Fundamentals of Data Visualization –> very nice resource that goes beyond basic functionality of ggplot and focuses on how to make good figures (e.g. how to choose colors, axes, …)

Misc

nice online ggplot tutorial
how to read R help files
ggplot2 extensions –> gallery of ggplot2 extension packages
ggplot2 visualizations with code –> gallery of plots with code

Session info

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.7.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggthemes_5.1.0  lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1  
 [5] dplyr_1.1.4     purrr_1.0.2     readr_2.1.5     tidyr_1.3.1    
 [9] tibble_3.2.1    ggplot2_3.5.0   tidyverse_2.0.0 knitr_1.45     

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    lattice_0.22-6    stringi_1.8.3    
 [5] hms_1.1.3         digest_0.6.35     magrittr_2.0.3    evaluate_0.23    
 [9] grid_4.3.2        timechange_0.3.0  fastmap_1.1.1     Matrix_1.6-5     
[13] jsonlite_1.8.8    backports_1.4.1   nnet_7.3-19       Formula_1.2-5    
[17] gridExtra_2.3     mgcv_1.9-1        fansi_1.0.6       viridisLite_0.4.2
[21] scales_1.3.0      cli_3.6.2         rlang_1.1.3       splines_4.3.2    
[25] munsell_0.5.1     Hmisc_5.1-2       base64enc_0.1-3   withr_3.0.0      
[29] yaml_2.3.8        tools_4.3.2       tzdb_0.4.0        checkmate_2.3.1  
[33] htmlTable_2.4.2   colorspace_2.1-0  rpart_4.1.23      vctrs_0.6.5      
[37] R6_2.5.1          lifecycle_1.0.4   htmlwidgets_1.6.4 foreign_0.8-86   
[41] cluster_2.1.6     pkgconfig_2.0.3   pillar_1.9.0      gtable_0.3.4     
[45] data.table_1.15.4 glue_1.7.0        xfun_0.43         tidyselect_1.2.1 
[49] rstudioapi_0.16.0 farver_2.1.1      nlme_3.1-164      htmltools_0.5.8.1
[53] rmarkdown_2.26    labeling_0.4.3    compiler_4.3.2

--- title: "GGplot visualization - Part 1" author: "Tobi Gerstenberg, Edited by Bria Long & Martin Zettersten" format: html: code-fold: true code-tools: true toc: true editor: visual --- ## Load packages Let's first load the packages that we need for this chapter. You can click on the green arrow to execute the code chunk below. ```{r, message=FALSE} library(knitr) library(tidyverse) # for plotting (and many more cool things we'll discover later) ``` ### Setting up a plot Let's first get some data. ```{r} df.diamonds = diamonds ``` The `diamonds` dataset comes with the `ggplot2` package. We can get a description of the dataset by running the following command: ```{r, eval=FALSE} ?diamonds ``` Above, we assigned the `diamonds` dataset to the variable `df.diamonds` so that we can see it in the data explorer. Let's take a look at the full dataset by clicking on it in the explorer. > **Tip**: You can view a data frame by highlighting the text in the editor (or simply moving the mouse above the text), and then pressing `F2`. The `df.diamonds` data frame contains information about almost 60,000 diamonds, including their `price`, `carat` value, size, etc. Let's use visualization to get a better sense for this dataset. We start by setting up the plot. To do so, we pass a data frame to the function `ggplot()` in the following way. ```{r} ggplot(data = df.diamonds) ``` This, by itself, won't do anything yet. We also need to specify what to plot. Let's take a look at how much diamonds of different color cost. The help file says that diamonds labeled D have the best color, and diamonds labeled J the worst color. Let's make a bar plot that shows the average price of diamonds for different colors. We do so via specifying a mapping from the data to the plot aesthetics with the function `aes()`. We need to tell `aes()` what we would like to display on the x-axis, and the y-axis of the plot. ```{r} ggplot(data = df.diamonds, mapping = aes(x = color, y = price)) ``` ```{r} ggplot(data = df.diamonds, mapping = aes(x = color, y = price)) + stat_summary(fun = "mean", #computes the mean geom = "bar") # plots it as a bar graph ``` ### Bar plot Let's make a **bar graph**: ```{r} ggplot(data = df.diamonds, mapping = aes(x = color, y = price)) + stat_summary(fun = "mean", geom = "bar") ``` Neat! Three lines of code produce an almost-publication-ready plot (to be published in the *Proceedings of Unnecessary Diamonds*)! Note how we used a `+` at the end of the first line of code to specify that there will be more. This is a very powerful idea underlying `ggplot2`. We can start simple and keep adding things to the plot step by step. We used the `stat_summary()` function to define *what* we want to plot (the "mean"), and *how* (as a "bar" chart). Let's take a closer look at that function. ```{r, eval=FALSE} help(stat_summary) ``` Not the the easiest help file ... We supplied two arguments to the function, `fun =` and `geom =`. 1. The `fun` argument specifies *what* function we'd like to apply to the data for each value of `x`. Here, we said that we would like to take the `mean` and we specified that as a string. 2. The `geom` (= geometric object) argument specifies *how* we would like to plot the result, namely as a "bar" plot. Instead of showing the "mean", we could also show the "median" instead. ```{r} ggplot(data = df.diamonds, mapping = aes(x = color, y = price)) + stat_summary(fun = "median", # can vary this function geom = "bar") ``` And instead of making a bar plot, we could plot some points. ```{r} ggplot(df.diamonds, aes(x = color, y = price)) + stat_summary(fun = "mean", geom = "point") # this is my favorite way to look at data! ``` > **Tip**: Take a look [here](https://ggplot2.tidyverse.org/reference/#section-layer-geoms) to see what other geoms ggplot2 supports. Somewhat surprisingly, diamonds with the best color (D) are not the most expensive ones. What's going on here? We'll need to do some more exploration to figure this out. ### Setting the default plot theme Before moving on, let's set a different default theme for our plots. Personally, I'm not a big fan of the gray background and the white grid lines. Also, the default size of the text should be bigger. We can change the default theme using the `theme_set()` function like so: ```{r} library(ggthemes) theme_set(theme_few() + # set the theme theme(text = element_text(size = 20))) # set the default text size ``` From now on, all our plots will use what's specified in `theme_classic()`, and the default text size will be larger, too. For any individual plot, we can still override these settings. ### Scatter plot I don't know much about diamonds, but I do know that diamonds with a higher `carat` value tend to be more expensive. `color` was a discrete variable with seven different values. `carat`, however, is a continuous variable. We want to see how the price of diamonds differs as a function of the `carat` value. Since we are interested in the relationship between two continuous variables, plotting a bar graph won't work. Instead, let's make a **scatter plot**. Let's put the `carat` value on the x-axis, and the `price` on the y-axis. ```{r scatter, fig.cap="Scatterplot."} #| label: fig-scatter ggplot(data = df.diamonds, mapping = aes(x = carat, y = price)) + geom_point() ``` Cool! That looks sensible. Diamonds with a higher `carat` value tend to have a higher `price`. Our dataset has `r nrow(diamonds)` rows. So the plot actually shows `r nrow(diamonds)` circles even though we can't see all of them since they overlap. Let's make some progress on trying to figure out why the diamonds with the better color weren't the most expensive ones on average. We'll add some color to the scatter plot in @fig-scatter. We color each of the points based on the diamond's color. To do so, we pass another argument to the aesthetics of the plot via `aes()`. ```{r scatter-color, fig.cap="Scatterplot with color."} #| label: fig-scatter-color ggplot(data = df.diamonds, mapping = aes(x = carat, y = price, color = color)) + geom_point() ``` Aha! Now we've got some color. Notice how in @fig-scatter-color `ggplot2` added a legend for us, thanks! We'll see later how to play around with legends. Form just eye-balling the plot, it looks like the diamonds with the best `color` (D) tended to have a lower `carat` value, and the ones with the worst `color` (J), tended to have the highest carat values. So this is why diamonds with better colors are less expensive -- these diamonds have a lower carat value overall. There are many other things that we can define in `aes()`. Take a quick look at the vignette: ```{r, eval=FALSE} vignette("ggplot2-specs") ``` #### Practice plot 1 Make a scatter plot that shows the relationship between the variables `depth` (on the x-axis), and `table` (on the y-axis). Take a look at the description for the `diamonds` dataset so you know what these different variables mean. Your plot should look like the one shown in @fig-practice-plot1). ```{r} #enter your plot here ``` ```{r practice-plot1, fig.align="center", fig.cap="Practice plot 1.", fig.height=6, fig.width=8, out.width="95%"} #| label: fig-practice-plot1 include_graphics("figures/vis1_practice_plot1.png") ``` **Advanced**: A neat trick to get a better sense for the data here is to add transparency. Your plot should look like the one shown in @fig-practice-plot1a. ```{r} # make advanced practice plot 1 here ``` ```{r practice-plot1a, fig.align="center", fig.cap="Practice plot 1.", fig.height=6, fig.width=8, out.width="95%"} #| label: fig-practice-plot1a include_graphics("figures/vis1_practice_plot1a.png") ``` ### Line plot What else do we know about the diamonds? We actually know the quality of how they were cut. The `cut` variable ranges from "Fair" to "Ideal". First, let's take a look at the relationship between `cut` and `price`. This time, we'll make a line plot instead of a bar plot (just because we can). ```{r} ggplot(data = df.diamonds, mapping = aes(x = cut, y = price)) + stat_summary(fun = "mean", geom = "line") ``` Oops! All we did is that we replaced `x = color` with `x = cut`, and `geom = "bar"` with `geom = "line"`. However, the plot doesn't look like expected (i.e. there is no real plot). What happened here? The reason is that the line plot needs to know which points to connect. The error message tells us that each group consists of only one observation. Let's adjust the group aesthetic to fix this. ```{r} ggplot(data = df.diamonds, mapping = aes(x = cut, y = price, group = 1)) + stat_summary(fun = "mean", geom = "line") ``` By adding the parameter `group = 1` to `mapping = aes()`, we specify that we would like all the levels in `x = cut` to be treated as coming from the same group. The reason for this is that `cut` (our x-axis variable) is a factor (and not a numeric variable), so, by default, `ggplot2` tries to draw a separate line for each factor level. We'll learn more about grouping below (and about factors later). Interestingly, there is no simple relationship between the quality of the cut and the price of the diamond. In fact, "Ideal" diamonds tend to be cheapest. ### Adding error bars We often don't just want to show the means but also give a sense for how much the data varies. `ggplot2` has some convenient ways of specifying error bars. Let's take a look at how much `price` varies as a function of `clarity` (another variable in our `diamonds` data frame). ```{r errorbars-normal, fig.cap="Relationship between diamond clarity and price. Error bars indicate 95% bootstrapped confidence intervals."} ggplot(data = df.diamonds, mapping = aes(x = clarity, y = price)) + stat_summary(fun.data = "mean_cl_boot", geom = "pointrange") ``` Here we have it. The average price of our diamonds for different levels of `clarity` together with bootstrapped 95% confidence intervals. How do we know that we have 95% confidence intervals? That's what `mean_cl_boot()` computes as a default. Let's take a look at that function: ```{r, eval=FALSE} help(mean_cl_boot) ``` Note that I had to use the `fun.data =` argument here instead of `fun =` because the `mean_cl_boot()` function produces three data points for each value of the x-axis (the mean, lower, and upper confidence interval). ### Order matters The order in which we add geoms to a ggplot matters! Generally, we want to plot error bars before the points that represent the means. To illustrate, let's set the color in which we show the means to "red". ```{r good-figure, fig.cap='This figure looks good. Error bars and means are drawn in the correct order.'} #| label: fig-good-figure ggplot(data = df.diamonds, mapping = aes(x = clarity, y = price)) + stat_summary(fun.data = "mean_cl_boot", geom = "linerange") + stat_summary(fun = "mean", geom = "point", color = "red") ``` @fig-good-figure looks good. ```{r bad-figure, fig.cap='This figure looks bad. Error bars and means are drawn in the incorrect order.'} #| label: fig-bad-figure # I've changed the order in which the means and error bars are drawn. ggplot(df.diamonds, aes(x = clarity, y = price)) + stat_summary(fun = "mean", geom = "point", color = "red") + stat_summary(fun.data = "mean_cl_boot", geom = "linerange") ``` @fig-bad-figure doesn't look good. The error bars are on top of the points that represent the means. One cool feature about using `stat_summary()` is that we did not have to change anything about the data frame that we used to make the plots. We directly used our raw data instead of having to make separate data frames that contain the relevant information (such as the means and the confidence intervals). Let's take a look at two more principles for plotting data that are extremely helpful: groups and facets. But before, another practice plot. #### Practice plot 2 Make a bar plot that shows the average `price` of diamonds (on the y-axis) as a function of their `clarity` (on the x-axis). Also add error bars. Your plot should look like the one shown in Figure @fig-practice-plot2. ```{r} #your practice plot here ``` ```{r practice-plot2, out.width="90%", fig.align="center", fig.cap="Practice plot 2.", out.width="95%"} #| label: fig-practice-plot2 include_graphics("figures/vis1_practice_plot2.png") ``` **Advanced**: Try to make the plot shown in @fig-practice-plot2a. ```{r} # make advanced practice plot 2 here ``` ```{r practice-plot2a, out.width="90%", fig.align="center", fig.cap="Practice plot 2.", out.width="95%"} #| label: fig-practice-plot2a include_graphics("figures/vis1_practice_plot2a.png") ``` ### Grouping data Grouping in `ggplot2` is a very powerful idea. It allows us to plot subsets of the data -- again without the need to make separate data frames first. Let's make a plot that shows the relationship between `price` and `color` separately for the different qualities of `cut`. ```{r} ggplot(data = df.diamonds, mapping = aes(x = color, y = price, group = cut)) + stat_summary(fun = "mean", geom = "line") ``` Well, we got some separate lines here but we don't know which line corresponds to which cut. Let's add some color! ```{r} ggplot(data = df.diamonds, mapping = aes(x = color, y = price, group = cut, color = cut)) + stat_summary(fun = "mean", geom = "line", linewidth = 1) ``` Nice! In addition to adding color, I've made the lines a little thicker here by setting the `linewidth` argument to 2. Grouping is very useful for bar plots. Let's take a look at how the average price of diamonds looks like taking into account both `cut` and `color` (I know -- exciting times!). Let's put the `color` on the x-axis and then group by the `cut`. ```{r} ggplot(data = df.diamonds, mapping = aes(x = color, y = price, group = cut, color = cut)) + stat_summary(fun = "mean", geom = "bar") ``` That's a fail! Several things went wrong here. All the bars are gray and only their outline is colored differently. Instead we want the bars to have a different color. For that we need to specify the `fill` argument rather than the `color` argument! But things are worse. The bars currently are shown on top of each other. Instead, we'd like to put them next to each other. Here is how we can do that: ```{r} ggplot(data = df.diamonds, mapping = aes(x = color, y = price, group = cut, fill = cut)) + stat_summary(fun = "mean", geom = "bar", position = position_dodge()) ``` Neato! We've changed the `color` argument to `fill`, and have added the `position = position_dodge()` argument to the `stat_summary()` call. This argument makes it such that the bars are nicely dodged next to each other. Let's add some error bars just for kicks. ```{r} ggplot(data = df.diamonds, mapping = aes(x = color, y = price, group = cut, fill = cut)) + stat_summary(fun = "mean", geom = "bar", position = position_dodge(width = 0.9), color = "black") + stat_summary(fun.data = "mean_cl_boot", geom = "linerange", position = position_dodge(width = 0.9)) ``` Voila! Now with error bars. Note that we've added the `width = 0.9` argument to `position_dodge()`. Somehow R was complaining when this was not defined for geom "linerange". I've also added some outline to the bars by including the argument `color = "black"`. I think it looks nicer this way. So, still somewhat surprisingly, diamonds with the worst color (J) are more expensive than diamonds with the best color (D), and diamonds with better cuts are not necessarily more expensive. #### Practice plot 3 Recreate the plot shown in @fig-practice-plot3. ```{r} # make practice plot 3 here ``` ```{r practice-plot3, out.width="90%", fig.align="center", fig.cap="Practice plot 3.", out.width="95%"} #| label: fig-practice-plot3 include_graphics("figures/vis1_practice_plot3.png") ``` **Advanced**: Try to recreate the plot show in in @fig-practice-plot3a. ```{r} # make advanced practice plot 3 here ``` ```{r practice-plot3a, out.width="90%", fig.align="center", fig.cap="Practice plot 3.", out.width="95%"} #| label: fig-practice-plot3a include_graphics("figures/vis1_practice_plot3a.png") ``` ### Making facets Having too much information in a single plot can be overwhelming. The previous plot is already pretty busy. Facets are a nice way of splitting up plots and showing information in separate panels. Let's take a look at how wide these diamonds tend to be. The width in mm is given in the `y` column of the diamonds data frame. We'll make a histogram first. To make a histogram, the only aesthetic we needed to specify is `x`. ```{r} ggplot(data = df.diamonds, mapping = aes(x = y)) + geom_histogram() ``` That looks bad! Let's pick a different value for the width of the bins in the histogram. ```{r} ggplot(data = df.diamonds, mapping = aes(x = y)) + geom_histogram(binwidth = 0.1) ``` Still bad. There seems to be an outlier diamond that happens to be almost 60 mm wide, while most of the rest is much narrower. One option would be to remove the outlier from the data before plotting it. But generally, we don't want to make new data frames. Instead, let's just limit what data we show in the plot. ```{r} ggplot(data = df.diamonds, mapping = aes(x = y)) + geom_histogram(binwidth = 0.1) + coord_cartesian(xlim = c(3, 10)) ``` I've used the `coord_cartesian()` function to restrict the range of data to show by passing a minimum and maximum to the `xlim` argument. This looks better now. Instead of histograms, we can also plot a density fitted to the distribution. ```{r} ggplot(data = df.diamonds, mapping = aes(x = y)) + geom_density() + coord_cartesian(xlim = c(3, 10)) ``` Looks pretty similar to our histogram above! Just like we can play around with the binwidth of the histogram, we can change the smoothing bandwidth of the kernel that is used to create the histogram. Here is a histogram with a much wider bandwidth: ```{r} ggplot(data = df.diamonds, mapping = aes(x = y)) + geom_density(bw = 0.5) + coord_cartesian(xlim = c(3, 10)) ``` I promised that this section was about making facets, right? We're getting there! Let's first take a look at how wide diamonds of different `color` are. We can use grouping to make this happen. ```{r} ggplot(data = df.diamonds, mapping = aes(x = y, group = color, fill = color)) + geom_density(bw = 0.2, alpha = 0.2) + coord_cartesian(xlim = c(3, 10)) ``` OK! That's a little tricky to tell apart. Notice that I've specified the `alpha` argument in the `geom_density()` function so that the densities in the front don't completely hide the densities in the back. But this plot still looks too busy. Instead of grouping, let's put the densities for the different colors, in separate panels. That's what facetting allows you to do. ```{r} ggplot(data = df.diamonds, mapping = aes(x = y, fill = color)) + geom_density(bw = 0.2) + facet_grid(cols = vars(color)) + coord_cartesian(xlim = c(3, 10)) ``` Now we have the densities next to each other in separate panels. I've removed the `alpha` argument since the densities aren't overlapping anymore. To make the different panels, I used the `facet_grid()` function and specified that I want separate columns for the different colors (`cols = vars(color)`). What's the deal with `vars()`? Why couldn't we just write `facet_grid(cols = color)` instead? The short answer is: that's what the function wants. The long answer is: long. (We'll learn more about this later in the course.) To show the facets in different rows instead of columns we simply replace `cols = vars(color)` with `rows = vars(color)`. ```{r} ggplot(data = df.diamonds, mapping = aes(x = y, fill = color)) + geom_density(bw = 0.2) + facet_grid(rows = vars(color)) + coord_cartesian(xlim = c(3, 10)) ``` Several aspects about this plot should be improved: - the y-axis text is overlapping - having both a legend and separate facet labels is redundant - having separate fills is not really necessary here So, what does this plot actually show us? Well, J-colored diamonds tend to be wider than D-colored diamonds. Fascinating! Of course, we could go completely overboard with facets and groups. So let's do it! Let's look at how the average `price` (somewhat more interesting) varies as a function of `color`, `cut`, and `clarity`. We'll put color on the x-axis, and make separate rows for `cut` and columns for `clarity`. ```{r stretching-it, fig.cap="A figure that is stretching it in terms of information."} #| label: fig-stretching-it ggplot(data = df.diamonds, mapping = aes(y = price, x = color, fill = color)) + stat_summary(fun = "mean", geom = "bar", color = "black") + stat_summary(fun.data = "mean_cl_boot", geom = "linerange") + facet_grid(rows = vars(cut), cols = vars(clarity)) ``` @fig-stretching-it is stretching it in terms of how much information it presents. But it gives you a sense for how to combine the different bits and pieces we've learned so far. **Bonus.** Why are we getting a few warning messages here? #### Practice plot 4 Recreate the plot shown in @fig-practice-plot4. ```{r} # make practice plot 4 here ``` ```{r practice-plot4, fig.align="center", fig.cap="Practice plot 4.", out.width="95%"} #| label: fig-practice-plot4 include_graphics("figures/vis1_practice_plot4.png") ``` ### Global, local, and setting `aes()` `ggplot2` allows you to specify the plot aesthetics in different ways. ```{r} ggplot(data = df.diamonds, mapping = aes(x = carat, y = price, color = color)) + geom_point() + geom_smooth(method = "lm", se = F) ``` Here, I've drawn a scatter plot of the relationship between `carat` and `price`, and I have added the best-fitting regression lines via the `geom_smooth(method = "lm")` call. (We will learn more about what these regression lines mean later in class.) Because I have defined all the aesthetics at the top level (i.e. directly within the `ggplot()` function), the aesthetics apply to all the functions afterwards. Aesthetics defined in the `ggplot()` call are **global**. In this case, the `geom_point()` and the `geom_smooth()` functions. The `geom_smooth()` function produces separate best-fit regression lines for each different color. But what if we only wanted to show one regression line instead that applies to all the data? Here is one way of doing so: ```{r} ggplot(data = df.diamonds, mapping = aes(x = carat, y = price)) + geom_point(mapping = aes(color = color)) + geom_smooth(method = "lm") ``` Here, I've moved the color aesthetic into the `geom_point()` function call. Now, the `x` and `y` aesthetics still apply to both the `geom_point()` and the `geom_smooth()` function call (they are **global**), but the `color` aesthetic applies only to `geom_point()` (it is **local**). Alternatively, we can simply overwrite global aesthetics within local function calls. ```{r} ggplot(data = df.diamonds, mapping = aes(x = carat, y = price, color = color)) + geom_point() + geom_smooth(method = "lm", color = "black") ``` Here, I've set `color = "black"` within the `geom_smooth()` function, and now only one overall regression line is displayed since the global color aesthetic was overwritten in the local function call. ## Additional resources ### Cheatsheets - [RStudio IDE](figures/rstudio-ide.pdf) --\> information about RStudio - [RMarkdown](figures/rmarkdown.pdf) --\> information about writing in RMarkdown - [RMarkdown reference](figures/rmarkdown-reference.pdf) --\> RMarkdown reference sheet - [Quarto reference](https://quarto.org/docs/guide/) --\> information about writing in Quarto - [Data visualization](figures/visualization-principles.pdf) --\> general principles of effective graphic design - [ggplot2](figures/data-visualization.pdf) --\> specific information about ggplot ### Datacamp courses - [Introduction to R](https://www.datacamp.com/courses/free-introduction-to-r) - [ggplot (intro)](https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2) - [Reporting](https://www.datacamp.com/courses/communicating-with-data-in-the-tidyverse) - [visualization best practices](https://www.datacamp.com/courses/visualization-best-practices-in-r) ### Books and chapters - [R graphics cookbook](http://www.cookbook-r.com/Graphs/) --\> quick intro to the the most common graphs - [ggplot2 book](https://ggplot2-book.org/) - [R for Data Science book](http://r4ds.had.co.nz/) - [Data visualization](http://r4ds.had.co.nz/data-visualisation.html) - [Graphics for communication](http://r4ds.had.co.nz/graphics-for-communication.html) - [Data Visualization -- A practical introduction (by Kieran Healy)](http://socviz.co/) - [Look at data](http://socviz.co/lookatdata.html#lookatdata) - [Make a plot](http://socviz.co/makeplot.html#makeplot) - [Show the right numbers](http://socviz.co/groupfacettx.html#groupfacettx) - [Fundamentals of Data Visualization](https://serialmentor.com/dataviz/) --\> very nice resource that goes beyond basic functionality of `ggplot` and focuses on how to make good figures (e.g. how to choose colors, axes, ...) ### Misc - [nice online ggplot tutorial](https://evamaerey.github.io/ggplot2_grammar_guide/about) - [how to read R help files](https://socviz.co/appendix.html#a-little-more-about-r) - [ggplot2 extensions](https://exts.ggplot2.tidyverse.org/gallery/) --\> gallery of ggplot2 extension packages - [ggplot2 visualizations with code](http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html) --\> gallery of plots with code ## Session info ```{r, echo=F} sessionInfo() ```

Load packages

Setting up a plot

Bar plot

Setting the default plot theme

Scatter plot

Practice plot 1

Line plot

Adding error bars

Order matters

Practice plot 2

Grouping data

Practice plot 3

Making facets

Practice plot 4

Global, local, and setting aes()

Additional resources

Cheatsheets

Datacamp courses

Books and chapters

Misc

Session info

Global, local, and setting `aes()`