All data manipulation on SQL tbls are lazy: they will not actually run the query or retrieve the data unless you ask for it: they all return a new tbl_dbi object. Use compute() to run the query and save the results in a temporary in the database, or use collect() to retrieve the results to R. You can see the query with show_query().

# S3 method for src_dbi
tbl(src, from, ...)

Arguments

src

A DBIConnection object produced by DBI::dbConnect().

from

Either a string (giving a table name), a fully qualified table name created by in_schema() or wrapped by ident_q(), or a literal sql() string.

...

Needed for compatibility with generic; currently ignored.

Details

For best performance, the database should have an index on the variables that you are grouping by. Use explain() to check that the database is using the indexes that you expect.

There is one verb that is not lazy: do() is eager because it must pull the data into R.

Examples

library(dplyr) # Connect to a temporary in-memory SQLite database con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") # Add some data copy_to(con, mtcars) DBI::dbListTables(con)
#> [1] "mtcars" "sqlite_stat1" "sqlite_stat4"
# To retrieve a single table from a source, use `tbl()` con %>% tbl("mtcars")
#> # Source: table<mtcars> [?? x 11] #> # Database: sqlite 3.30.1 [:memory:] #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 #> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 #> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 #> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 #> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 #> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 #> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 #> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 #> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 #> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 #> # … with more rows
# Use `in_schema()` or `ident_q()` for fully qualified table names con %>% tbl(in_schema("temp", "mtcars")) %>% head(1)
#> # Source: lazy query [?? x 11] #> # Database: sqlite 3.30.1 [:memory:] #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
con %>% tbl(ident_q("temp.mtcars")) %>% head(1)
#> # Source: lazy query [?? x 11] #> # Database: sqlite 3.30.1 [:memory:] #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# You can also use pass raw SQL if you want a more sophisticated query con %>% tbl(sql("SELECT * FROM mtcars WHERE cyl = 8"))
#> # Source: SQL [?? x 11] #> # Database: sqlite 3.30.1 [:memory:] #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 #> 2 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 #> 3 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3 #> 4 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3 #> 5 15.2 8 276. 180 3.07 3.78 18 0 0 3 3 #> 6 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4 #> 7 10.4 8 460 215 3 5.42 17.8 0 0 3 4 #> 8 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4 #> 9 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 #> 10 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 #> # … with more rows
# If you just want a temporary in-memory database, use src_memdb() src2 <- src_memdb() # To show off the full features of dplyr's database integration, # we'll use the Lahman database. lahman_sqlite() takes care of # creating the database. if (has_lahman("sqlite")) { lahman_p <- lahman_sqlite() batting <- lahman_p %>% tbl("Batting") batting # Basic data manipulation verbs work in the same way as with a tibble batting %>% filter(yearID > 2005, G > 130) batting %>% select(playerID:lgID) batting %>% arrange(playerID, desc(yearID)) batting %>% summarise(G = mean(G), n = n()) # There are a few exceptions. For example, databases give integer results # when dividing one integer by another. Multiply by 1 to fix the problem batting %>% select(playerID:lgID, AB, R, G) %>% mutate( R_per_game1 = R / G, R_per_game2 = R * 1.0 / G ) # All operations are lazy: they don't do anything until you request the # data, either by `print()`ing it (which shows the first ten rows), # or by `collect()`ing the results locally. system.time(recent <- filter(batting, yearID > 2010)) system.time(collect(recent)) # You can see the query that dplyr creates with show_query() batting %>% filter(G > 0) %>% group_by(playerID) %>% summarise(n = n()) %>% show_query() }
#> <SQL> #> SELECT `playerID`, COUNT() AS `n` #> FROM `Batting` #> WHERE (`G` > 0.0) #> GROUP BY `playerID`