Learn Tidy Evaluation from Daniel Chen

Peng Chen

March 2, 2021

Below is a summary made from Daniel Chen’s recent talk “Learning Tidy Evaluation by Reimplementing dplyr”. The goal is to re-implement the dplyr::select() function and will be achieved through four attempts.

To make these notes as concise as possible, we load

library(tidyverse)
library(rlang)

and will use the iris dataset for testing.

iris %>% head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Attempt 1

Function definition

select_1 <- function(data, col) {
  col_position <- match(as.character(col), names(data))
  data[, col_position, drop = FALSE]
}

To understand match(), see

match(c("c", "a"), c("a", "b", "c"))
## [1] 3 1

This is useful for selecting multiple columns because it preserve the selection order. Thus, it is better than the which() function

which(c("a", "b", "c") %in% c("c", "a"))
## [1] 1 3

Test 1

select_1(iris, "Species") %>% head(3)
##   Species
## 1  setosa
## 2  setosa
## 3  setosa

Test 2

select_1(iris, Species) %>% head(3)
## Error in select_1(iris, Species): object 'Species' not found

The second test fails because variable Species is never defined in the function. In other words

as.character(Species)

can not be eveluated, unlike

as.character("Species")
## [1] "Species"

Attempt 2

The solution is to capture your code as an expression (without evaluating it), which can be manipulated later, such as being converted to a string.

Function definition

select_2 <- function(data, col) {
  col <- enexpr(col)
  col_position <- match(as.character(col), names(data))
  data[, col_position, drop = FALSE]
}

Test

select_2(iris, Species) %>% head(3)
##   Species
## 1  setosa
## 2  setosa
## 3  setosa

It works because the following code can be evaluated.

as.character(expr(Species))
## [1] "Species"

Next, we can generalize the function to select mutiple columns using dot-dot-dot.

Attempt 3

Function definition

select_3 <- function(data, ...) {
  cols <- rlang::enexprs(...)
  cols_char <- as.vector(cols, mode = "character")
  cols_positions <- match(cols_char, names(data))
  data[, cols_positions, drop = FALSE]
}

Test 1

select_3(iris, Species, Sepal.Width, Petal.Length) %>% head(3)
##   Species Sepal.Width Petal.Length
## 1  setosa         3.5          1.4
## 2  setosa         3.0          1.4
## 3  setosa         3.2          1.3

Test 2

col_name <- "Species"

select_3(iris, col_name, Sepal.Width, Petal.Length) %>% head(3)
## Error in `[.data.frame`(data, , cols_positions, drop = FALSE): undefined columns selected

This time, the second test fails because

cols <- exprs(col_name, Sepal.Width, Petal.Length)
cols
## [[1]]
## col_name
## 
## [[2]]
## Sepal.Width
## 
## [[3]]
## Petal.Length
as.vector(cols, mode = "character")
## [1] "col_name"     "Sepal.Width"  "Petal.Length"

Code col_name is captured as expression col_name, and it has nothing to do with what you want, the “Species” string.

Even if you evaluate expression col_name, it will only work in the global environment because variable col_name is not defined in the function environment. Therefore, the solution is to capture the dot-dot-dot as quosures (expressions + their environments) and then evaluate the quosures.

Attempt 4

Function definition

select_4 <- function(data, ...) {
  cols <- enquos(...)
  vars <- set_names(seq_along(data), names(data)) %>% as.list()
  col_char_num <- map(cols, eval_tidy, vars)
  cols_positions <- map_int(
    col_char_num,
    function(x) ifelse(is.character(x), vars[[x]], x)
  )
  data[, cols_positions, drop = FALSE]
}

There are quite a few changes in the new function. But they are easy to understand by checking the test below.

Test

select_4(iris, col_name, Sepal.Length, "Petal.Width") %>% head(3)
##   Species Sepal.Length Petal.Width
## 1  setosa          5.1         0.2
## 2  setosa          4.9         0.2
## 3  setosa          4.7         0.2

Here we have it, a pretty robust re-implementation of the dplyr::select() function.

To simulate what is happening inside the function, see

(cols <- quos(col_name, Sepal.Length, "Petal.Width"))
## <list_of<quosure>>
## 
## [[1]]
## <quosure>
## expr: ^col_name
## env:  global
## 
## [[2]]
## <quosure>
## expr: ^Sepal.Length
## env:  global
## 
## [[3]]
## <quosure>
## expr: ^"Petal.Width"
## env:  empty

Notice each expression is captured together with its environment.

data <- iris
(vars <- set_names(seq_along(data), names(data)) %>% as.list())
## $Sepal.Length
## [1] 1
## 
## $Sepal.Width
## [1] 2
## 
## $Petal.Length
## [1] 3
## 
## $Petal.Width
## [1] 4
## 
## $Species
## [1] 5
(col_char_num <- map(cols, eval_tidy, data = vars))
## [[1]]
## [1] "Species"
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] "Petal.Width"

Function eval_tidy() can be used to evaluate a quosure (an expression bundled with an environment), which also takes an additional argument, data. If data is supplied, objects in the data mask always have precedence over the quosure environment, i.e. the data masks the environment. When eval_tidy() is applied to quo(col_name), it first searches the name col_name inside list vars, then finds no match, and then evaluates the quosure according to the quosure environment, where col_name = “Species”.

eval_tidy(quo(col_name), data = vars)
## [1] "Species"

When eval_tidy() is applied to quo(Sepal.Length), it first searches the name Sepal.Length inside list vars, then finds a match with value 1.

eval_tidy(quo(Sepal.Length), data = vars)
## [1] 1

Lastly, the string value “Petal.Width” is always evaluated as itself.

eval_tidy(quo("Petal.Width"), data = vars)
## [1] "Petal.Width"

Based on the elements in col_char_num, it is not difficult to understand how the following code finds the correct column positions.

(
  cols_positions <- map_int(
    col_char_num,
    function(x) ifelse(is.character(x), vars[[x]], x)
  )
)
##       
## 5 1 4

Five big ideas of tidy evaluation

This summary omits lots of details in tidy evaluation. I strongly recommend Hadley’s 5 big ideas of tidy evaluation video if you are new to these ideas, which is only 5-minute long. Below are the big five 😀

  1. R code is a tree

  2. Capture the tree by quoting

  3. Unquoting makes it easy to build trees

  4. Quote + unquote

  5. Quosures capture expression & environment