There are two parts to dbplyr SQL translation: translating dplyr
verbs, and translating expressions within those verbs. This vignette
describes how entire verbs are translated;
vignette("translation-function") describes how individual
expressions within those verbs are translated.
All dplyr verbs generate a SELECT statement. To
demonstrate we’ll make a temporary database with a couple of tables
library(dbplyr)
library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
flights <- copy_to(con, nycflights13::flights)
airports <- copy_to(con, nycflights13::airports)Single table verbs
-
select()andmutate()modify theSELECTclause:flights |> select(contains("delay")) |> show_query() #> <SQL> #> SELECT `dep_delay`, `arr_delay` #> FROM `nycflights13::flights` flights |> select(distance, air_time) |> mutate(speed = distance / (air_time / 60)) |> show_query() #> <SQL> #> SELECT `distance`, `air_time`, `distance` / (`air_time` / 60.0) AS `speed` #> FROM `nycflights13::flights` -
filter()generates aWHEREclause:flights |> filter(month == 1, day == 1) |> show_query() #> <SQL> #> SELECT `nycflights13::flights`.* #> FROM `nycflights13::flights` #> WHERE (`month` = 1.0) AND (`day` = 1.0) -
arrange()generates anORDER BYclause:flights |> arrange(carrier, desc(arr_delay)) |> show_query() #> <SQL> #> SELECT `nycflights13::flights`.* #> FROM `nycflights13::flights` #> ORDER BY `carrier`, `arr_delay` DESC -
summarise()andgroup_by()work together to generate aGROUP BYclause:flights |> group_by(month, day) |> summarise(delay = mean(dep_delay, na.rm = TRUE)) |> show_query() #> ! Grouped output by "month". #> ℹ Override behaviour and silence this message with the `.groups` #> argument. #> ℹ Or use `.by` instead of `group_by()`. #> <SQL> #> SELECT `month`, `day`, AVG(`dep_delay`) AS `delay` #> FROM `nycflights13::flights` #> GROUP BY `month`, `day`
Subqueries
It’s not always possible to translate a single dplyr verb into a
single SQL query. For example, in SQL, variables in the
SELECT clause have to come from another table; you can’t
refer to a variable that you just created. For that reason, dbplyr will
create subqueries where needed:
flights |>
select(distance, air_time) |>
mutate(
air_time_h = air_time / 60,
speed = distance / air_time_h) |>
show_query()
#> <SQL>
#> SELECT `q01`.*, `distance` / `air_time_h` AS `speed`
#> FROM (
#> SELECT `distance`, `air_time`, `air_time` / 60.0 AS `air_time_h`
#> FROM `nycflights13::flights`
#> ) AS `q01`It’s also possible to use a CTE if you so desire:
flights |>
select(distance, air_time) |>
mutate(
air_time_h = air_time / 60,
speed = distance / air_time_h) |>
show_query(sql_options = sql_options(cte = TRUE))
#> <SQL>
#> WITH `q01` AS (
#> SELECT `distance`, `air_time`, `air_time` / 60.0 AS `air_time_h`
#> FROM `nycflights13::flights`
#> )
#> SELECT `q01`.*, `distance` / `air_time_h` AS `speed`
#> FROM `q01`Sometimes dbplyr will create a subquery where it’s not strictly necessary. We strive to avoid this as much as possible, but our analysis of the generated SQL is not always complete, so we’ll typically err on the side of safety (creating more subqueries) rather than performance.
It’s important to know that most SQL dialects either error or include
LIMIT and ORDER BY statements inside of
subqueries. For that reason, you should always put head()
and arrange() as late as possible in your pipeline.
Dual table verbs
| R | SQL |
|---|---|
inner_join() |
SELECT * FROM x JOIN y ON x.a = y.a |
left_join() |
SELECT * FROM x LEFT JOIN y ON x.a = y.a |
right_join() |
SELECT * FROM x RIGHT JOIN y ON x.a = y.a |
full_join() |
SELECT * FROM x FULL JOIN y ON x.a = y.a |
semi_join() |
SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a) |
anti_join() |
SELECT * FROM x WHERE NOT EXISTS (SELECT 1 FROM y WHERE x.a = y.a) |
intersect(x, y) |
SELECT * FROM x INTERSECT SELECT * FROM y |
union(x, y) |
SELECT * FROM x UNION SELECT * FROM y |
setdiff(x, y) |
SELECT * FROM x EXCEPT SELECT * FROM y |
x and y don’t have to be tables in the same
database. If you specify copy = TRUE, dplyr will copy the
y table into the same location as the x
variable. This is useful if you’ve downloaded a summarised dataset and
determined a subset of interest that you now want the full data for. You
can use semi_join(x, y, copy = TRUE) to upload the indices
of interest to a temporary table in the same database as x,
and then perform a efficient semi join in the database.
If you’re working with large data, it maybe also be helpful to set
auto_index = TRUE. That will automatically add an index on
the join variables to the temporary table.
