Skip to content

This document describes how to add a new SQL backend to dbplyr. To begin:

This document is still a work in progress, but it will hopefully get you started. I’d also strongly recommend reading the bundled source code for SQLite, MySQL, and PostgreSQL.

First steps

For interactive exploitation, attach dplyr and DBI. If you’re creating a package, you’ll need to import dplyr and DBI.

Check that you can create a tbl from a connection, like:

con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
DBI::dbWriteTable(con, "mtcars", mtcars)

tbl(con, "mtcars")
#> # A query:  ?? x 11
#> # Database: sqlite 3.51.1 []
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
#> 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
#> # ℹ more rows

If you can’t, this likely indicates some problem with the DBI methods. Use DBItest to narrow down the problem.

Write your first method

The first method of your dbplyr backend should always be for the dbplyr_edition() generic:

#' @importFrom dbplyr dbplyr_edition
#' @export
dbplyr_edition.myConnectionClass <- function(con) 2L

This declares that your package uses version 2 of the API, which is the version that this vignette documents.

Copying, computing, collecting and collapsing

Next, check that copy_to(), collapse(), compute(), and collect() work:

SQL translation: verbs

Make sure you’ve read vignette("translation-verb") so you have the lay of the land. First check that SQL translation for the key verbs work:

Building SQL strings

If you need to generate your own SQL, we recommend using sql_glue2(). It uses glue syntax with type markers for safe SQL generation:

con <- simulate_dbi()

# Create an index
index_name <- "index"
table <- I("schema.table")
columns <- c("column1", "column2")
sql_glue2(con, "CREATE INDEX {.id index_name} ON {.tbl table} {.id columns*}")
#> <SQL> CREATE INDEX "index" ON schema.table ("column1", "column2")

# Insert values safely
name <- "O'Brien"
sql_glue2(con, "INSERT INTO students (name) VALUES {name*}")
#> <SQL> INSERT INTO students (name) VALUES ('O''Brien')

# Build a query
table <- "my_table"
cols <- c("id", "name", "value")
sql_glue2(con, "SELECT {.id cols} FROM {.tbl table}")
#> <SQL> SELECT "id", "name", "value" FROM "my_table"

Behind the scenes

The high-level translation from R code to SQL takes place in a few steps, using two intermediate representations: a “lazy query” and a “query”. A lazy query is closer to R and a query is closer to SQL. This continuum of representation allows us to work with expressions when trying to reason about the transformation pipeline, then later switch to strings of SQL when we get closer to generating a query.

All dbplyr pipelines are built on top of the tbl_lazy data structure, which is the only data structure that the user will ever see. It has two fields:

  • A database connection which is used to specialise SQL translation.
  • A lazy_query which represents the lazy computation described by the pipeline. It always starts off as a lazy_base_query which represents a remote table or SQL query.

Each verb takes a tbl_lazy as input and returns a modified tbl_lazy, modifying the lazy_query with the new requirements. At the simplest level, this involves wrapping the existing lazy query within a new lazy query, such as lazy_query_select() or lazy_query_multi_join(). These represent what later will become a subquery so that (e.g.) filter(), mutate(), select(), arrange(), and summarise() all use lazy_select_query(), whereas left_join() and inner_join() use lazy_query_multi_join().

These lazy queries are typically created by a helper function like add_select(), add_mutate(), and add_filter(). These helpers have two roles:

  1. They abstract away repeated code for substantially similar dplyr verbs. For example, select(), rename(), rename_with() and relocate() all use add_select().

  2. They are responsible for figuring out if we can re-use the existing query or we need to wrap in a subquery. Typically this work is done by a function called can_inline_select() which will use what we know about SQL and the verb to determine whether or not we have to introduce a subquery.

The lazy query gives us a representation of the data manipulation pipeline with enough detail that we can reason about when we need a subquery. Then when the user calls collect(), we need to generate an SQL query. This happens in two steps: building and rendering.

First, sql_build() recurses over the lazy query, building up query objects like select_query() and multi_join_query() that represent the different subtypes of SELECT queries. Compared to the previous “lazy” query representation, these are much closer to SQL than R. They can be much simpler because we no longer need to reason about subqueries, but we still need some structure to facilitate backend-specific translations.

Next, sql_render() is called on the result of sql_build() and dispatches to generics like sql_query_select(), sql_query_join(), sql_query_semi_join(), and sql_query_set_op(). These generics have methods for each backend that are responsible for creating the needed SQL.

SQL translation: vectors

Finally, you may have to provide custom R -> SQL translation for functions that work with vectors within verbs. You can do with by providing a method for sql_translation(), which return an object created by sql_variant(). The sql_variant() function creates a container for three types of function translations:

sql_translation.myConnectionClass <- function(con) {
  sql_variant(
    scalar = sql_translator(base_scalar, ...), # Functions in SELECT (non-aggregated)
    aggregate = sql_translator(base_aggregate, ...), # Aggregation functions (mean, sum, etc.)
    window = sql_translator(base_win, ...) # Window functions (lead, lag, rank, etc.)
  )
}

Each translator will inherits from the base (ANSI SQL) translator and overrides only what’s different for your backend:

sql_translator(
  base_scalar, # Inherit most translations
  # Override specific functions for your backend
  `+` = sql_infix("+"),
  mean = sql_aggregate("AVERAGE", "mean")
)

Scalar function helpers

dbplyr provides several helper functions to make it easier to translate R functions to SQL:

  • sql_prefix(f, n = NULL): For standard SQL functions. The n argument optionally specifies the number of arguments.
  • sql_infix(f): For infix operators like +, *, or ==.
  • sql_cast(type): For type casting functions.
  • sql_not_supported(f): For functions with no SQL translation.

Here’s an example showing all of these helpers in use:

sql_translation.myConnectionClass <- function(con) {
  sql_variant(
    scalar = sql_translator(
      base_scalar,
      # Standard SQL functions
      cos = sql_prefix("COS", 1),
      round = sql_prefix("ROUND", 2),
      # Infix operators
      `+` = sql_infix("+"),
      `*` = sql_infix("*"),
      `==` = sql_infix("="),
      # Type casting
      as.numeric = sql_cast("NUMERIC"),
      as.character = sql_cast("VARCHAR")
    ),
    aggregate = sql_translator(base_agg),
    window = sql_translator(base_win)
  )
}

Aggregation function helpers

  • sql_aggregate(f, f_r = f): For single-argument SQL aggregate functions. The f_r argument gives the name of the R function.
  • sql_aggregate_2(f): For two-argument SQL aggregate functions.
  • sql_aggregate_n(f, f_r = f): For variadic SQL aggregate functions.
sql_translation.myConnectionClass <- function(con) {
  sql_variant(
    scalar = sql_translator(base_scalar),
    aggregate = sql_translator(
      .parent = base_agg,
      # Single-argument aggregates
      mean = sql_aggregate("AVG", "mean"),
      var = sql_aggregate("VAR_SAMP", "var"),
      # Two-argument aggregates
      cov = sql_aggregate_2("COVAR_SAMP"),
      # Variadic aggregates
      pmin = sql_aggregate_n("LEAST", "pmin"),
      pmax = sql_aggregate_n("GREATEST", "pmax"),
      # Unsupported functions
      median = sql_not_supported("median")
    ),
    window = sql_translator(base_win)
  )
}

Window function helpers

Window functions have their own set of helpers:

  • win_rank(f): For ranking functions.
  • win_aggregate(f): For aggregate functions used as window functions.
  • win_cumulative(f): For cumulative functions.
  • win_absent(f): For backends that don’t support certain window functions.

Here’s an example showing all of these helpers in use:

window = sql_translator(
  base_win,
  # Ranking functions
  row_number = win_rank("ROW_NUMBER"),
  rank = win_rank("RANK"),
  dense_rank = win_rank("DENSE_RANK"),
  # Aggregate functions as window functions
  mean = win_aggregate("AVG"),
  sum = win_aggregate("SUM"),
  # Cumulative functions
  cumsum = win_cumulative("SUM"),
  # Absent functions
  cume_dist = win_absent("cume_dist")
)

Custom translation functions

For more complex translations, you can write custom functions that return SQL expressions using sql_glue(). This uses glue syntax for string interpolation with automatic escaping.

scalar = sql_translator(
  base_scalar,

  # Custom log function with change of base
  log = function(x, base = exp(1)) {
    if (isTRUE(all.equal(base, exp(1)))) {
      sql_glue("LN({x})")
    } else {
      sql_glue("LOG({x}) / LOG({base})")
    }
  },

  # Custom paste function using CONCAT
  paste = function(..., sep = " ") {
    sql_glue("CONCAT_WS({sep}, {...})")
  }
)