The following tutorial will introduce you to a set of powerful functions in the tidyverse:

They are especially powerful when used in combination with each other

Load libraries and data

For this exercise, we’re going to use the Gapminder dataset

Code
#install.packages("gapminder")
library(gapminder)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(broom)

Now, let’s just take a quick peek at the gapminder dataset, an excerpt from the Gapminder data. The dataset shows information about population size, life expectancy and GDP (per capita) for a range of different countries from 1952 - 2007.

Code
df <- gapminder
df %>%
  glimpse()
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

One thing we can notice is that gdp per capita has a long tail

Code
df %>%
  ggplot(aes(gdpPercap))+
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s log transform gdp per capita to make it more akin to a normal distribution

Code
df <- df %>%
  mutate(log_gdp_pc = log(gdpPercap))

That’s better

Code
hist(df$log_gdp_pc)

Predicting life expectancy from gdp per capita

Let’s fit a linear model predicting life expectancy from GDP per capita. Just for illustration’s sake, I’m going to focus on one particular year: 1977.

Code
df_77 <- df %>% 
  filter(year == 1977)

fit <- lm(lifeExp ~ log_gdp_pc, data = df_77)
summary(fit)

Call:
lm(formula = lifeExp ~ log_gdp_pc, data = df_77)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.8647  -3.7210   0.9703   4.1957  16.8197 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -2.7389     3.8043   -0.72    0.473    
log_gdp_pc    7.5490     0.4561   16.55   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.552 on 140 degrees of freedom
Multiple R-squared:  0.6618,    Adjusted R-squared:  0.6594 
F-statistic:   274 on 1 and 140 DF,  p-value: < 2.2e-16

You can use the broom package to get a “tidy” output and the kable() function in knitr to get a clean-looking table

Code
fit %>%
  broom::tidy() %>%
  knitr::kable()
term estimate std.error statistic p.value
(Intercept) -2.738859 3.8042673 -0.7199439 0.4727598
log_gdp_pc 7.549041 0.4560659 16.5525242 0.0000000

Plot it!

Code
ggplot(df_77,aes(log_gdp_pc,lifeExp))+
  geom_point()+
  geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'

So it looks like there is a strong relationship between (log) GDP and life expectancy in the year 1977. How general is that pattern?

One way we could approach this is to just copy-paste the code above for each year, and look at the outcome. That seems pretty cumbersome and redundant though. Instead, we’ll use nest() and map() to handle this in just a few lines of code, all while keeping our data in a tidy format (!)

Using nest and map to look across many years

nest()

First, we use nest() to create a data frame in which each row contains the dataset for a particular year (nested within a cell!).

Code
nested <- gapminder %>%
  group_by(year) %>%
  nest()

Take a look at what the outcome looks like. We now have two columns, year and data. The data for a particular year is nested in the cell of each row.

Code
nested %>%
  glimpse()
Rows: 12
Columns: 2
Groups: year [12]
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,…
$ data <list> [<tbl_df[142 x 5]>], [<tbl_df[142 x 5]>], [<tbl_df[142 x 5]>], […

map()

Next, we use map() to execute a function in each row of a cell.

First, let’s define a quick function to fit a linear model on our data.

Code
fit_ols <- function(df) {
  lm(lifeExp ~ log_gdp_pc, data = df)
}

Combine nest() and map()

Now, we can use map() to run the function (fit_ols()) fitting a linear model in each row of the nested data frame.

Code
nested_fit <- df %>%
  group_by(year) %>%
  nest() %>%
  mutate(model=map(data,fit_ols))

We’ve created a new column that holds the model fit for each of our linear models (per year).

Finally, we can use the tidy() function in broom to summarize the model outputs into a neat format. Again we use map

Code
nested_output <- nested_fit %>%
  mutate(output = map(model,broom::tidy))

Finally, we use unnest() the output back into a neat dataframe.

Code
unnested_output <- nested_output %>%
  unnest(output)

Et voila!! A clean table of model outputs

Code
unnested_output %>%
  #remove the columns we don't want
  select(-data,-model) %>%
  #just show the effect of gdp
  filter(term=="log_gdp_pc") %>%
  knitr::kable()
year term estimate std.error statistic p.value
1952 log_gdp_pc 8.829813 0.6625915 13.32618 0
1957 log_gdp_pc 8.730598 0.6333668 13.78443 0
1962 log_gdp_pc 8.597827 0.6011185 14.30305 0
1967 log_gdp_pc 8.049757 0.5583525 14.41698 0
1972 log_gdp_pc 7.597430 0.4993213 15.21551 0
1977 log_gdp_pc 7.549041 0.4560659 16.55252 0
1982 log_gdp_pc 7.493593 0.3990243 18.77979 0
1987 log_gdp_pc 7.368595 0.3455803 21.32238 0
1992 log_gdp_pc 7.525306 0.3844216 19.57566 0
1997 log_gdp_pc 7.613136 0.3749190 20.30608 0
2002 log_gdp_pc 7.617203 0.4407672 17.28169 0
2007 log_gdp_pc 7.202802 0.4423393 16.28343 0

Plot

Here’s a nice companion plot to show what’s going on in the data. Note that we’re not using nesting here, we’re just taking advantage of ggplot’s facet_wrap() function

Code
ggplot(df,aes(log_gdp_pc,lifeExp))+
  geom_point()+
  geom_smooth(method="lm")+
  facet_wrap(~year)
`geom_smooth()` using formula = 'y ~ x'