There are two parts to dbplyr SQL translation: translating dplyr
verbs, and translating expressions within those verbs. This vignette
describes how entire verbs are translated;
vignette("translation-function") describes how individual
expressions within those verbs are translated.
All dplyr verbs generate a SELECT statement. To
demonstrate we’ll make a temporary database with a couple of tables
library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
flights <- copy_to(con, nycflights13::flights)
airports <- copy_to(con, nycflights13::airports)Single table verbs
-
select()andmutate()modify theSELECTclause:flights %>% select(contains("delay")) %>% show_query()## <SQL> ## SELECT `dep_delay`, `arr_delay` ## FROM `nycflights13::flights`flights %>% select(distance, air_time) %>% mutate(speed = distance / (air_time / 60)) %>% show_query()## <SQL> ## SELECT `distance`, `air_time`, `distance` / (`air_time` / 60.0) AS `speed` ## FROM `nycflights13::flights` -
filter()generates aWHEREclause:flights %>% filter(month == 1, day == 1) %>% show_query()## <SQL> ## SELECT `nycflights13::flights`.* ## FROM `nycflights13::flights` ## WHERE (`month` = 1.0) AND (`day` = 1.0) -
arrange()generates anORDER BYclause:flights %>% arrange(carrier, desc(arr_delay)) %>% show_query()## <SQL> ## SELECT `nycflights13::flights`.* ## FROM `nycflights13::flights` ## ORDER BY `carrier`, `arr_delay` DESC -
summarise()andgroup_by()work together to generate aGROUP BYclause:flights %>% group_by(month, day) %>% summarise(delay = mean(dep_delay, na.rm = TRUE)) %>% show_query()## `summarise()` has grouped output by "month". You can override using ## the `.groups` argument.## <SQL> ## SELECT `month`, `day`, AVG(`dep_delay`) AS `delay` ## FROM `nycflights13::flights` ## GROUP BY `month`, `day`
Dual table verbs
| R | SQL |
|---|---|
inner_join() |
SELECT * FROM x JOIN y ON x.a = y.a |
left_join() |
SELECT * FROM x LEFT JOIN y ON x.a = y.a |
right_join() |
SELECT * FROM x RIGHT JOIN y ON x.a = y.a |
full_join() |
SELECT * FROM x FULL JOIN y ON x.a = y.a |
semi_join() |
SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a) |
anti_join() |
SELECT * FROM x WHERE NOT EXISTS (SELECT 1 FROM y WHERE x.a = y.a) |
intersect(x, y) |
SELECT * FROM x INTERSECT SELECT * FROM y |
union(x, y) |
SELECT * FROM x UNION SELECT * FROM y |
setdiff(x, y) |
SELECT * FROM x EXCEPT SELECT * FROM y |
x and y don’t have to be tables in the same
database. If you specify copy = TRUE, dplyr will copy the
y table into the same location as the x
variable. This is useful if you’ve downloaded a summarised dataset and
determined a subset of interest that you now want the full data for. You
can use semi_join(x, y, copy = TRUE) to upload the indices
of interest to a temporary table in the same database as x,
and then perform a efficient semi join in the database.
If you’re working with large data, it maybe also be helpful to set
auto_index = TRUE. That will automatically add an index on
the join variables to the temporary table.
Behind the scenes
The verb level SQL translation is implemented on top of
tbl_lazy, which basically tracks the operations you perform
in a pipeline (see lazy-ops.R). Turning that into a SQL
query takes place in three steps:
sql_build()recurses over the lazy op data structure building up query objects (select_query(),join_query(),set_op_query()etc) that represent the different subtypes ofSELECTqueries that we might generate.sql_optimise()takes a pass over these SQL objects, looking for potential optimisations. Currently this only involves removing subqueries where possible.sql_render()calls an SQL generation function (sql_query_select(),sql_query_join(),sql_query_semi_join(),sql_query_set_op(), …) to produce the actual SQL. Each of these functions is a generic, taking the connection as an argument, so that the translation can be customised for different databases.
