A Trick to Reduce Severe SQL Duplications

Peng Chen

March 8, 2021

If you want the trick now, jump over to the summary section and see if we feel the same.

Two forms of data

Long form

Suppose we have a dataset, df, simulated as

library(tidyverse)

set.seed(123)
menu <- c("beef", "chicken", "salmon", "broccoli", "apple")
df <- tibble(
  customer_id = sample(x = 7, size = 20, replace = TRUE), 
  food = sample(x = menu, size = 20, replace = TRUE)
) %>% 
  distinct() %>% 
  arrange(customer_id)

df
## # A tibble: 15 x 2
##    customer_id food    
##          <int> <chr>   
##  1           1 broccoli
##  2           1 salmon  
##  3           2 chicken 
##  4           3 beef    
##  5           3 salmon  
##  6           3 apple   
##  7           3 chicken 
##  8           4 beef    
##  9           5 broccoli
## 10           5 beef    
## 11           6 apple   
## 12           6 beef    
## 13           6 salmon  
## 14           7 broccoli
## 15           7 beef

and would like to transform it to the following form.

Wide form

df %>% 
  mutate(value = 1) %>% 
  pivot_wider(
    names_from = food, values_from = value, values_fill = 0, names_sort = TRUE
  )
## # A tibble: 7 x 6
##   customer_id apple  beef broccoli chicken salmon
##         <int> <dbl> <dbl>    <dbl>   <dbl>  <dbl>
## 1           1     0     0        1       0      1
## 2           2     0     0        0       1      0
## 3           3     1     1        0       1      1
## 4           4     0     1        0       0      0
## 5           5     0     1        1       0      0
## 6           6     1     1        0       0      1
## 7           7     0     1        1       0      0

Both forms carry the same information, food items ordered by each customer.

This sort of transformation is particularly useful in AI/ML settings and is often referred as dummy variable creation or one-hot encoding, and you have already seen this can be easily done using R. But what if the long-form table is stored in a database and you have to transform it using SQL?

Same transformation in a database

Well, let us create an in-memory database

library(RSQLite)
library(connections)

con <- connection_open(SQLite(), "customer_and_food.sqlite")

throw in the table

dbWriteTable(con, "customer_and_food", df, overwrite = TRUE)

and try it out.

Use SQL directly

SELECT 
  customer_id,
  MAX(CASE WHEN food = "apple" THEN 1 ELSE 0 END) AS apple,
  MAX(CASE WHEN food = "beef" THEN 1 ELSE 0 END) AS beef,
  MAX(CASE WHEN food = "broccoli" THEN 1 ELSE 0 END) AS broccoli,
  MAX(CASE WHEN food = "chicken" THEN 1 ELSE 0 END) AS chicken,
  MAX(CASE WHEN food = "salmon" THEN 1 ELSE 0 END) AS salmon
FROM 
  customer_and_food
GROUP BY 
  customer_id
Table 1: 7 records
customer_id apple beef broccoli chicken salmon
1 0 0 1 0 1
2 0 0 0 1 0
3 1 1 0 1 1
4 0 1 0 0 0
5 0 1 1 0 0
6 1 1 0 0 1
7 0 1 1 0 0

The SQL code does not look so bad. But who wants to eat apple, beef, broccoli, chicken, and salmon everyday? (at least not me … 🤣) What if we have thousands of food items to record? Do we have to copy and paste

MAX(CASE WHEN food = "food_i" THEN 1 ELSE 0 END) AS food_i

for a thousand times and perhaps blow up our SQL scripts and introduce some errors along the way?

I am sure some SQL experts can eliminate this redundancy using SQL itself. But I prefer the following trick.

Use SQL through dbplyr

tbl(con, "customer_and_food") %>% 
  mutate(value = 1) %>% 
  pivot_wider(
    names_from = food, values_from = value, values_fill = 0, names_sort = TRUE
  )
## # Source:   lazy query [?? x 6]
## # Database: sqlite 3.34.1
## #   [/Users/pengchen/Dropbox/P/Project/peng-chen_blog/content/blog/2021-03-08-one-hot-encoder/customer_and_food.sqlite]
##   customer_id apple  beef broccoli chicken salmon
##         <int> <dbl> <dbl>    <dbl>   <dbl>  <dbl>
## 1           1     0     0        1       0      1
## 2           2     0     0        0       1      0
## 3           3     1     1        0       1      1
## 4           4     0     1        0       0      0
## 5           5     0     1        1       0      0
## 6           6     1     1        0       0      1
## 7           7     0     1        1       0      0

Wait, what? Is this SQL? Yes, it is.

tbl(con, "customer_and_food") %>% 
  mutate(value = 1) %>% 
  pivot_wider(
    names_from = food, values_from = value, values_fill = 0, names_sort = TRUE
  ) %>% 
  show_query()
## <SQL>
## SELECT `customer_id`, MAX(CASE WHEN (`food` = 'apple') THEN (`value`) WHEN NOT(`food` = 'apple') THEN (0.0) END) AS `apple`, MAX(CASE WHEN (`food` = 'beef') THEN (`value`) WHEN NOT(`food` = 'beef') THEN (0.0) END) AS `beef`, MAX(CASE WHEN (`food` = 'broccoli') THEN (`value`) WHEN NOT(`food` = 'broccoli') THEN (0.0) END) AS `broccoli`, MAX(CASE WHEN (`food` = 'chicken') THEN (`value`) WHEN NOT(`food` = 'chicken') THEN (0.0) END) AS `chicken`, MAX(CASE WHEN (`food` = 'salmon') THEN (`value`) WHEN NOT(`food` = 'salmon') THEN (0.0) END) AS `salmon`
## FROM (SELECT `customer_id`, `food`, 1.0 AS `value`
## FROM `customer_and_food`)
## GROUP BY `customer_id`

The same (kinda) SQL code is automatically generated from the {dplyr} code, powered by its {dbplyr} backend, and then used to communicate with the database.

What is the benefit? Well, since the {dplyr} code will not grow any longer, we can quietly order some pizza, and no one will notice 🤐🤫🤫.

connection_close(con) # disconnect from the database

Summary

For one-hot-encoding-like operations, the length of your SQL scripts

SELECT 
  customer_id,
  MAX(CASE WHEN food = "apple" THEN 1 ELSE 0 END) AS apple,
  MAX(CASE WHEN food = "beef" THEN 1 ELSE 0 END) AS beef,
  MAX(CASE WHEN food = "broccoli" THEN 1 ELSE 0 END) AS broccoli,
  MAX(CASE WHEN food = "chicken" THEN 1 ELSE 0 END) AS chicken,
  MAX(CASE WHEN food = "salmon" THEN 1 ELSE 0 END) AS salmon
FROM 
  customer_and_food
GROUP BY 
  customer_id

will grow with the number of levels in the factor/categorical column to be transformed (unless you are good at writing functions and loops in SQL). When you have thousands of levels to transform, your SQL scripts tend to explode, which is not uncommon in many industries.

But this will never happen if you use {dplyr} instead.

tbl(con, "customer_and_food") %>% 
  mutate(value = 1) %>% 
  pivot_wider(
    names_from = food, values_from = value, values_fill = 0, names_sort = TRUE
  )