src_dbi() is a general dplyr backend that connects to any DBI driver. src_memdb() connects to a temporary in-memory SQLite database, that's useful for testing and experimenting.

You can generate a tbl() directly from the DBI connection, or go via src_dbi().

src_dbi(con, auto_disconnect = FALSE)

# S3 method for src_dbi
tbl(src, from, ...)

Arguments

con

An object that inherits from DBI::DBIConnection, typically generated by DBI::dbConnect

auto_disconnect

Should the connection be automatically closed when the src is deleted. This is useful for older DBI backends that don't clean up themselves.

src

Either a src_dbi or DBIConnection

from

Either a string (giving a table name) or literal sql().

...

Needed for compatibility with generic; currently ignored.

Value

An S3 object with class src_dbi, src_sql, src.

Details

All data manipulation on SQL tbls are lazy: they will not actually run the query or retrieve the data unless you ask for it: they all return a new tbl_dbi object. Use compute() to run the query and save the results in a temporary in the database, or use collect() to retrieve the results to R. You can see the query with show_query().

For best performance, the database should have an index on the variables that you are grouping by. Use explain() to check that the database is using the indexes that you expect.

There is one excpetion: do() is not lazy since it must pull the data into R.

Examples

# Basic connection using DBI ------------------------------------------- library(dplyr) con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") src <- src_dbi(con) # Add some data copy_to(src, mtcars) src
#> src: sqlite 3.11.1 [:memory:] #> tbls: mtcars, sqlite_stat1
DBI::dbListTables(con)
#> [1] "mtcars" "sqlite_stat1"
# To retrieve a single table from a source, use `tbl()` src %>% tbl("mtcars")
#> Source: table<mtcars> [?? x 11] #> Database: sqlite 3.11.1 [:memory:] #> #> # S3: tbl_dbi #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 #> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 #> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 #> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 #> 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 #> 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 #> # ... with more rows
# You can also use pass raw SQL if you want a more sophisticated query src %>% tbl(sql("SELECT * FROM mtcars WHERE cyl == 8"))
#> Source: SQL [?? x 11] #> Database: sqlite 3.11.1 [:memory:] #> #> # S3: tbl_dbi #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2 #> 2 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4 #> 3 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3 #> 4 17.3 8 275.8 180 3.07 3.73 17.60 0 0 3 3 #> 5 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3 #> 6 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4 #> # ... with more rows
# Alternatively, you can use the `src_sqlite()` helper src2 <- src_sqlite(":memory:", create = TRUE) # If you just want a temporary in-memory database, use src_memdb() src3 <- src_memdb() # To show off the full features of dplyr's database integration, # we'll use the Lahman database. lahman_sqlite() takes care of # creating the database. if (has_lahman("sqlite")) { lahman_p <- lahman_sqlite() batting <- lahman_p %>% tbl("Batting") batting # Basic data manipulation verbs work in the same way as with a tibble batting %>% filter(yearID > 2005, G > 130) batting %>% select(playerID:lgID) batting %>% arrange(playerID, desc(yearID)) batting %>% summarise(G = mean(G), n = n()) # There are a few exceptions. For example, databases give integer results # when dividing one integer by another. Multiply by 1 to fix the problem batting %>% select(playerID:lgID, AB, R, G) %>% mutate( R_per_game1 = R / G, R_per_game2 = R * 1.0 / G ) # All operations are lazy: they don't do anything until you request the # data, either by `print()`ing it (which shows the first ten rows), # or by `collect()`ing the results locally. system.time(recent <- filter(batting, yearID > 2010)) system.time(collect(recent)) # You can see the query that dplyr creates with show_query() batting %>% filter(G > 0) %>% group_by(playerID) %>% summarise(n = n()) %>% show_query() }
#> Auto-disconnecting PostgreSQLConnection
#> <SQL> #> SELECT `playerID`, COUNT() AS `n` #> FROM `Batting` #> WHERE (`G` > 0.0) #> GROUP BY `playerID`