Below is a summary made from Daniel Chen’s recent talk “Learning Tidy Evaluation by Reimplementing dplyr”. The goal is to re-implement the dplyr::select() function and will be achieved through four attempts.
To make these notes as concise as possible, we load
library(tidyverse)
library(rlang)
and will use the iris dataset for testing.
iris %>% head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Attempt 1
Function definition
select_1 <- function(data, col) {
col_position <- match(as.character(col), names(data))
data[, col_position, drop = FALSE]
}
To understand match(), see
match(c("c", "a"), c("a", "b", "c"))
## [1] 3 1
This is useful for selecting multiple columns because it preserve the selection order. Thus, it is better than the which() function
which(c("a", "b", "c") %in% c("c", "a"))
## [1] 1 3
Test 1
select_1(iris, "Species") %>% head(3)
## Species
## 1 setosa
## 2 setosa
## 3 setosa
Test 2
select_1(iris, Species) %>% head(3)
## Error in select_1(iris, Species): object 'Species' not found
The second test fails because variable Species is never defined in the function. In other words
as.character(Species)
can not be eveluated, unlike
as.character("Species")
## [1] "Species"
Attempt 2
The solution is to capture your code as an expression (without evaluating it), which can be manipulated later, such as being converted to a string.
Function definition
select_2 <- function(data, col) {
col <- enexpr(col)
col_position <- match(as.character(col), names(data))
data[, col_position, drop = FALSE]
}
Test
select_2(iris, Species) %>% head(3)
## Species
## 1 setosa
## 2 setosa
## 3 setosa
It works because the following code can be evaluated.
as.character(expr(Species))
## [1] "Species"
Next, we can generalize the function to select mutiple columns using dot-dot-dot.
Attempt 3
Function definition
select_3 <- function(data, ...) {
cols <- rlang::enexprs(...)
cols_char <- as.vector(cols, mode = "character")
cols_positions <- match(cols_char, names(data))
data[, cols_positions, drop = FALSE]
}
Test 1
select_3(iris, Species, Sepal.Width, Petal.Length) %>% head(3)
## Species Sepal.Width Petal.Length
## 1 setosa 3.5 1.4
## 2 setosa 3.0 1.4
## 3 setosa 3.2 1.3
Test 2
col_name <- "Species"
select_3(iris, col_name, Sepal.Width, Petal.Length) %>% head(3)
## Error in `[.data.frame`(data, , cols_positions, drop = FALSE): undefined columns selected
This time, the second test fails because
cols <- exprs(col_name, Sepal.Width, Petal.Length)
cols
## [[1]]
## col_name
##
## [[2]]
## Sepal.Width
##
## [[3]]
## Petal.Length
as.vector(cols, mode = "character")
## [1] "col_name" "Sepal.Width" "Petal.Length"
Code col_name is captured as expression col_name, and it has nothing to do with what you want, the “Species” string.
Even if you evaluate expression col_name, it will only work in the global environment because variable col_name is not defined in the function environment. Therefore, the solution is to capture the dot-dot-dot as quosures (expressions + their environments) and then evaluate the quosures.
Attempt 4
Function definition
select_4 <- function(data, ...) {
cols <- enquos(...)
vars <- set_names(seq_along(data), names(data)) %>% as.list()
col_char_num <- map(cols, eval_tidy, vars)
cols_positions <- map_int(
col_char_num,
function(x) ifelse(is.character(x), vars[[x]], x)
)
data[, cols_positions, drop = FALSE]
}
There are quite a few changes in the new function. But they are easy to understand by checking the test below.
Test
select_4(iris, col_name, Sepal.Length, "Petal.Width") %>% head(3)
## Species Sepal.Length Petal.Width
## 1 setosa 5.1 0.2
## 2 setosa 4.9 0.2
## 3 setosa 4.7 0.2
Here we have it, a pretty robust re-implementation of the dplyr::select() function.
To simulate what is happening inside the function, see
(cols <- quos(col_name, Sepal.Length, "Petal.Width"))
## <list_of<quosure>>
##
## [[1]]
## <quosure>
## expr: ^col_name
## env: global
##
## [[2]]
## <quosure>
## expr: ^Sepal.Length
## env: global
##
## [[3]]
## <quosure>
## expr: ^"Petal.Width"
## env: empty
Notice each expression is captured together with its environment.
data <- iris
(vars <- set_names(seq_along(data), names(data)) %>% as.list())
## $Sepal.Length
## [1] 1
##
## $Sepal.Width
## [1] 2
##
## $Petal.Length
## [1] 3
##
## $Petal.Width
## [1] 4
##
## $Species
## [1] 5
(col_char_num <- map(cols, eval_tidy, data = vars))
## [[1]]
## [1] "Species"
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] "Petal.Width"
Function eval_tidy() can be used to evaluate a quosure (an expression bundled with an environment), which also takes an additional argument, data. If data is supplied, objects in the data mask always have precedence over the quosure environment, i.e. the data masks the environment. When eval_tidy() is applied to quo(col_name), it first searches the name col_name inside list vars, then finds no match, and then evaluates the quosure according to the quosure environment, where col_name = “Species”.
eval_tidy(quo(col_name), data = vars)
## [1] "Species"
When eval_tidy() is applied to quo(Sepal.Length), it first searches the name Sepal.Length inside list vars, then finds a match with value 1.
eval_tidy(quo(Sepal.Length), data = vars)
## [1] 1
Lastly, the string value “Petal.Width” is always evaluated as itself.
eval_tidy(quo("Petal.Width"), data = vars)
## [1] "Petal.Width"
Based on the elements in col_char_num, it is not difficult to understand how the following code finds the correct column positions.
(
cols_positions <- map_int(
col_char_num,
function(x) ifelse(is.character(x), vars[[x]], x)
)
)
##
## 5 1 4
Five big ideas of tidy evaluation
This summary omits lots of details in tidy evaluation. I strongly recommend Hadley’s 5 big ideas of tidy evaluation video if you are new to these ideas, which is only 5-minute long. Below are the big five 😀
R code is a tree
Capture the tree by quoting
Unquoting makes it easy to build trees
Quote + unquote
Quosures capture expression & environment