There are two parts to dbplyr SQL translation: translating dplyr
verbs, and translating expressions within those verbs. This vignette
describes how entire verbs are translated;
vignette("translation-function")
describes how individual
expressions within those verbs are translated.
All dplyr verbs generate a SELECT
statement. To
demonstrate we’ll make a temporary database with a couple of tables
library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
flights <- copy_to(con, nycflights13::flights)
airports <- copy_to(con, nycflights13::airports)
Single table verbs
-
select()
andmutate()
modify theSELECT
clause:flights %>% select(contains("delay")) %>% show_query()
## <SQL> ## SELECT `dep_delay`, `arr_delay` ## FROM `nycflights13::flights`
flights %>% select(distance, air_time) %>% mutate(speed = distance / (air_time / 60)) %>% show_query()
## <SQL> ## SELECT `distance`, `air_time`, `distance` / (`air_time` / 60.0) AS `speed` ## FROM `nycflights13::flights`
-
filter()
generates aWHERE
clause:flights %>% filter(month == 1, day == 1) %>% show_query()
## <SQL> ## SELECT `nycflights13::flights`.* ## FROM `nycflights13::flights` ## WHERE (`month` = 1.0) AND (`day` = 1.0)
-
arrange()
generates anORDER BY
clause:flights %>% arrange(carrier, desc(arr_delay)) %>% show_query()
## <SQL> ## SELECT `nycflights13::flights`.* ## FROM `nycflights13::flights` ## ORDER BY `carrier`, `arr_delay` DESC
-
summarise()
andgroup_by()
work together to generate aGROUP BY
clause:flights %>% group_by(month, day) %>% summarise(delay = mean(dep_delay, na.rm = TRUE)) %>% show_query()
## `summarise()` has grouped output by "month". You can override using the ## `.groups` argument.
## <SQL> ## SELECT `month`, `day`, AVG(`dep_delay`) AS `delay` ## FROM `nycflights13::flights` ## GROUP BY `month`, `day`
Dual table verbs
R | SQL |
---|---|
inner_join() |
SELECT * FROM x JOIN y ON x.a = y.a |
left_join() |
SELECT * FROM x LEFT JOIN y ON x.a = y.a |
right_join() |
SELECT * FROM x RIGHT JOIN y ON x.a = y.a |
full_join() |
SELECT * FROM x FULL JOIN y ON x.a = y.a |
semi_join() |
SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a) |
anti_join() |
SELECT * FROM x WHERE NOT EXISTS (SELECT 1 FROM y WHERE x.a = y.a) |
intersect(x, y) |
SELECT * FROM x INTERSECT SELECT * FROM y |
union(x, y) |
SELECT * FROM x UNION SELECT * FROM y |
setdiff(x, y) |
SELECT * FROM x EXCEPT SELECT * FROM y |
x
and y
don’t have to be tables in the same
database. If you specify copy = TRUE
, dplyr will copy the
y
table into the same location as the x
variable. This is useful if you’ve downloaded a summarised dataset and
determined a subset of interest that you now want the full data for. You
can use semi_join(x, y, copy = TRUE)
to upload the indices
of interest to a temporary table in the same database as x
,
and then perform a efficient semi join in the database.
If you’re working with large data, it maybe also be helpful to set
auto_index = TRUE
. That will automatically add an index on
the join variables to the temporary table.
Behind the scenes
The verb level SQL translation is implemented on top of
tbl_lazy
, which basically tracks the operations you perform
in a pipeline (see lazy-ops.R
). Turning that into a SQL
query takes place in three steps:
sql_build()
recurses over the lazy op data structure building up query objects (select_query()
,join_query()
,set_op_query()
etc) that represent the different subtypes ofSELECT
queries that we might generate.sql_optimise()
takes a pass over these SQL objects, looking for potential optimisations. Currently this only involves removing subqueries where possible.sql_render()
calls an SQL generation function (sql_query_select()
,sql_query_join()
,sql_query_semi_join()
,sql_query_set_op()
, …) to produce the actual SQL. Each of these functions is a generic, taking the connection as an argument, so that the translation can be customised for different databases.