Verb translation • dbplyr

There are two parts to dbplyr SQL translation: translating dplyr verbs, and translating expressions within those verbs. This vignette describes how entire verbs are translated; vignette("translation-function") describes how individual expressions within those verbs are translated.

All dplyr verbs generate a SELECT statement. To demonstrate we’ll make a temporary database with a couple of tables

library(dplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
flights <- copy_to(con, nycflights13::flights)
airports <- copy_to(con, nycflights13::airports)

Single table verbs

select() and mutate() modify the SELECT clause:

flights %>%
  select(contains("delay")) %>%
  show_query()

## <SQL>
## SELECT `dep_delay`, `arr_delay`
## FROM `nycflights13::flights`

flights %>%
  select(distance, air_time) %>%  
  mutate(speed = distance / (air_time / 60)) %>%
  show_query()

## <SQL>
## SELECT `distance`, `air_time`, `distance` / (`air_time` / 60.0) AS `speed`
## FROM `nycflights13::flights`

filter() generates a WHERE clause:

flights %>% 
  filter(month == 1, day == 1) %>%
  show_query()

## <SQL>
## SELECT `nycflights13::flights`.*
## FROM `nycflights13::flights`
## WHERE (`month` = 1.0) AND (`day` = 1.0)

arrange() generates an ORDER BY clause:

flights %>% 
  arrange(carrier, desc(arr_delay)) %>%
  show_query()

## <SQL>
## SELECT `nycflights13::flights`.*
## FROM `nycflights13::flights`
## ORDER BY `carrier`, `arr_delay` DESC

summarise() and group_by() work together to generate a GROUP BY clause:

flights %>%
  group_by(month, day) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
  show_query()

## `summarise()` has grouped output by "month". You can override using
## the `.groups` argument.

## <SQL>
## SELECT `month`, `day`, AVG(`dep_delay`) AS `delay`
## FROM `nycflights13::flights`
## GROUP BY `month`, `day`

Dual table verbs

R	SQL
`inner_join()`	`SELECT * FROM x JOIN y ON x.a = y.a`
`left_join()`	`SELECT * FROM x LEFT JOIN y ON x.a = y.a`
`right_join()`	`SELECT * FROM x RIGHT JOIN y ON x.a = y.a`
`full_join()`	`SELECT * FROM x FULL JOIN y ON x.a = y.a`
`semi_join()`	`SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)`
`anti_join()`	`SELECT * FROM x WHERE NOT EXISTS (SELECT 1 FROM y WHERE x.a = y.a)`
`intersect(x, y)`	`SELECT * FROM x INTERSECT SELECT * FROM y`
`union(x, y)`	`SELECT * FROM x UNION SELECT * FROM y`
`setdiff(x, y)`	`SELECT * FROM x EXCEPT SELECT * FROM y`

x and y don’t have to be tables in the same database. If you specify copy = TRUE, dplyr will copy the y table into the same location as the x variable. This is useful if you’ve downloaded a summarised dataset and determined a subset of interest that you now want the full data for. You can use semi_join(x, y, copy = TRUE) to upload the indices of interest to a temporary table in the same database as x, and then perform a efficient semi join in the database.

If you’re working with large data, it maybe also be helpful to set auto_index = TRUE. That will automatically add an index on the join variables to the temporary table.

Behind the scenes

The verb level SQL translation is implemented on top of tbl_lazy, which basically tracks the operations you perform in a pipeline (see lazy-ops.R). Turning that into a SQL query takes place in three steps:

sql_build() recurses over the lazy op data structure building up query objects (select_query(), join_query(), set_op_query() etc) that represent the different subtypes of SELECT queries that we might generate.
sql_optimise() takes a pass over these SQL objects, looking for potential optimisations. Currently this only involves removing subqueries where possible.
sql_render() calls an SQL generation function (sql_query_select(), sql_query_join(), sql_query_semi_join(), sql_query_set_op(), …) to produce the actual SQL. Each of these functions is a generic, taking the connection as an argument, so that the translation can be customised for different databases.