Nest and Map

The following tutorial will introduce you to a set of powerful functions in the tidyverse:

nest()
map()

They are especially powerful when used in combination with each other

Load libraries and data

For this exercise, we’re going to use the Gapminder dataset

Code

#install.packages("gapminder")
library(gapminder)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(broom)

Now, let’s just take a quick peek at the gapminder dataset, an excerpt from the Gapminder data. The dataset shows information about population size, life expectancy and GDP (per capita) for a range of different countries from 1952 - 2007.

Code

df <- gapminder
df %>%
  glimpse()

Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

One thing we can notice is that gdp per capita has a long tail

Code

df %>%
  ggplot(aes(gdpPercap))+
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s log transform gdp per capita to make it more akin to a normal distribution

Code

df <- df %>%
  mutate(log_gdp_pc = log(gdpPercap))

That’s better

Code

hist(df$log_gdp_pc)

Predicting life expectancy from gdp per capita

Let’s fit a linear model predicting life expectancy from GDP per capita. Just for illustration’s sake, I’m going to focus on one particular year: 1977.

Code

df_77 <- df %>% 
  filter(year == 1977)

fit <- lm(lifeExp ~ log_gdp_pc, data = df_77)
summary(fit)


Call:
lm(formula = lifeExp ~ log_gdp_pc, data = df_77)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.8647  -3.7210   0.9703   4.1957  16.8197 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -2.7389     3.8043   -0.72    0.473    
log_gdp_pc    7.5490     0.4561   16.55   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.552 on 140 degrees of freedom
Multiple R-squared:  0.6618,    Adjusted R-squared:  0.6594 
F-statistic:   274 on 1 and 140 DF,  p-value: < 2.2e-16

You can use the broom package to get a “tidy” output and the kable() function in knitr to get a clean-looking table

Code

fit %>%
  broom::tidy() %>%
  knitr::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	-2.738859	3.8042673	-0.7199439	0.4727598
log_gdp_pc	7.549041	0.4560659	16.5525242	0.0000000

Plot it!

Code

ggplot(df_77,aes(log_gdp_pc,lifeExp))+
  geom_point()+
  geom_smooth(method="lm")

`geom_smooth()` using formula = 'y ~ x'

So it looks like there is a strong relationship between (log) GDP and life expectancy in the year 1977. How general is that pattern?

One way we could approach this is to just copy-paste the code above for each year, and look at the outcome. That seems pretty cumbersome and redundant though. Instead, we’ll use nest() and map() to handle this in just a few lines of code, all while keeping our data in a tidy format (!)

Using nest and map to look across many years

nest()

First, we use nest() to create a data frame in which each row contains the dataset for a particular year (nested within a cell!).

Code

nested <- gapminder %>%
  group_by(year) %>%
  nest()

Take a look at what the outcome looks like. We now have two columns, year and data. The data for a particular year is nested in the cell of each row.

Code

nested %>%
  glimpse()

Rows: 12
Columns: 2
Groups: year [12]
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,…
$ data <list> [<tbl_df[142 x 5]>], [<tbl_df[142 x 5]>], [<tbl_df[142 x 5]>], […

map()

Next, we use map() to execute a function in each row of a cell.

First, let’s define a quick function to fit a linear model on our data.

Code

fit_ols <- function(df) {
  lm(lifeExp ~ log_gdp_pc, data = df)
}

Combine nest() and map()

Now, we can use map() to run the function (fit_ols()) fitting a linear model in each row of the nested data frame.

Code

nested_fit <- df %>%
  group_by(year) %>%
  nest() %>%
  mutate(model=map(data,fit_ols))

We’ve created a new column that holds the model fit for each of our linear models (per year).

Finally, we can use the tidy() function in broom to summarize the model outputs into a neat format. Again we use map

Code

nested_output <- nested_fit %>%
  mutate(output = map(model,broom::tidy))

Finally, we use unnest() the output back into a neat dataframe.

Code

unnested_output <- nested_output %>%
  unnest(output)

Et voila!! A clean table of model outputs

Code

unnested_output %>%
  #remove the columns we don't want
  select(-data,-model) %>%
  #just show the effect of gdp
  filter(term=="log_gdp_pc") %>%
  knitr::kable()

year	term	estimate	std.error	statistic
1952	log_gdp_pc	8.829813	0.6625915	13.32618
1957	log_gdp_pc	8.730598	0.6333668	13.78443
1962	log_gdp_pc	8.597827	0.6011185	14.30305
1967	log_gdp_pc	8.049757	0.5583525	14.41698
1972	log_gdp_pc	7.597430	0.4993213	15.21551
1977	log_gdp_pc	7.549041	0.4560659	16.55252
1982	log_gdp_pc	7.493593	0.3990243	18.77979
1987	log_gdp_pc	7.368595	0.3455803	21.32238
1992	log_gdp_pc	7.525306	0.3844216	19.57566
1997	log_gdp_pc	7.613136	0.3749190	20.30608
2002	log_gdp_pc	7.617203	0.4407672	17.28169
2007	log_gdp_pc	7.202802	0.4423393	16.28343

Plot

Here’s a nice companion plot to show what’s going on in the data. Note that we’re not using nesting here, we’re just taking advantage of ggplot’s facet_wrap() function

Code

ggplot(df,aes(log_gdp_pc,lifeExp))+
  geom_point()+
  geom_smooth(method="lm")+
  facet_wrap(~year)

`geom_smooth()` using formula = 'y ~ x'