Verb translation • dbplyr

There are two parts to dbplyr SQL translation: translating dplyr verbs, and translating expressions within those verbs. This vignette describes how entire verbs are translated; vignette("translation-function") describes how individual expressions within those verbs are translated.

All dplyr verbs generate a SELECT statement. To demonstrate we’ll make a temporary database with a couple of tables

library(dbplyr)
library(dplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
flights <- copy_to(con, nycflights13::flights)
airports <- copy_to(con, nycflights13::airports)

Single table verbs

select() and mutate() modify the SELECT clause:

flights |>
  select(contains("delay")) |>
  show_query()
#> <SQL>
#> SELECT `dep_delay`, `arr_delay`
#> FROM `nycflights13::flights`

flights |>
  select(distance, air_time) |>  
  mutate(speed = distance / (air_time / 60)) |>
  show_query()
#> <SQL>
#> SELECT `distance`, `air_time`, `distance` / (`air_time` / 60.0) AS `speed`
#> FROM `nycflights13::flights`

filter() generates a WHERE clause:

flights |> 
  filter(month == 1, day == 1) |>
  show_query()
#> <SQL>
#> SELECT `nycflights13::flights`.*
#> FROM `nycflights13::flights`
#> WHERE (`month` = 1.0) AND (`day` = 1.0)

arrange() generates an ORDER BY clause:

flights |> 
  arrange(carrier, desc(arr_delay)) |>
  show_query()
#> <SQL>
#> SELECT `nycflights13::flights`.*
#> FROM `nycflights13::flights`
#> ORDER BY `carrier`, `arr_delay` DESC

summarise() and group_by() work together to generate a GROUP BY clause:

flights |>
  group_by(month, day) |>
  summarise(delay = mean(dep_delay, na.rm = TRUE)) |>
  show_query()
#> ! Grouped output by "month".
#> ℹ Override behaviour and silence this message with the `.groups`
#>   argument.
#> ℹ Or use `.by` instead of `group_by()`.
#> <SQL>
#> SELECT `month`, `day`, AVG(`dep_delay`) AS `delay`
#> FROM `nycflights13::flights`
#> GROUP BY `month`, `day`

Subqueries

It’s not always possible to translate a single dplyr verb into a single SQL query. For example, in SQL, variables in the SELECT clause have to come from another table; you can’t refer to a variable that you just created. For that reason, dbplyr will create subqueries where needed:

flights |>
  select(distance, air_time) |>  
  mutate(
    air_time_h = air_time / 60,
    speed = distance / air_time_h) |>
  show_query()
#> <SQL>
#> SELECT `q01`.*, `distance` / `air_time_h` AS `speed`
#> FROM (
#>   SELECT `distance`, `air_time`, `air_time` / 60.0 AS `air_time_h`
#>   FROM `nycflights13::flights`
#> ) AS `q01`

It’s also possible to use a CTE if you so desire:

flights |>
  select(distance, air_time) |>  
  mutate(
    air_time_h = air_time / 60,
    speed = distance / air_time_h) |>
  show_query(sql_options = sql_options(cte = TRUE))
#> <SQL>
#> WITH `q01` AS (
#>   SELECT `distance`, `air_time`, `air_time` / 60.0 AS `air_time_h`
#>   FROM `nycflights13::flights`
#> )
#> SELECT `q01`.*, `distance` / `air_time_h` AS `speed`
#> FROM `q01`

Sometimes dbplyr will create a subquery where it’s not strictly necessary. We strive to avoid this as much as possible, but our analysis of the generated SQL is not always complete, so we’ll typically err on the side of safety (creating more subqueries) rather than performance.

It’s important to know that most SQL dialects either error or include LIMIT and ORDER BY statements inside of subqueries. For that reason, you should always put head() and arrange() as late as possible in your pipeline.

Dual table verbs

R	SQL
`inner_join()`	`SELECT * FROM x JOIN y ON x.a = y.a`
`left_join()`	`SELECT * FROM x LEFT JOIN y ON x.a = y.a`
`right_join()`	`SELECT * FROM x RIGHT JOIN y ON x.a = y.a`
`full_join()`	`SELECT * FROM x FULL JOIN y ON x.a = y.a`
`semi_join()`	`SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)`
`anti_join()`	`SELECT * FROM x WHERE NOT EXISTS (SELECT 1 FROM y WHERE x.a = y.a)`
`intersect(x, y)`	`SELECT * FROM x INTERSECT SELECT * FROM y`
`union(x, y)`	`SELECT * FROM x UNION SELECT * FROM y`
`setdiff(x, y)`	`SELECT * FROM x EXCEPT SELECT * FROM y`

x and y don’t have to be tables in the same database. If you specify copy = TRUE, dplyr will copy the y table into the same location as the x variable. This is useful if you’ve downloaded a summarised dataset and determined a subset of interest that you now want the full data for. You can use semi_join(x, y, copy = TRUE) to upload the indices of interest to a temporary table in the same database as x, and then perform a efficient semi join in the database.

If you’re working with large data, it maybe also be helpful to set auto_index = TRUE. That will automatically add an index on the join variables to the temporary table.