sparkbq is a sparklyr extension package providing an integration with Google BigQuery. It builds on top of spark-bigquery, which provides a Google BigQuery data source to Apache Spark.
You can install the released version of sparkbq from CRAN via
or the latest development version through
The following table provides an overview over supported versions of Apache Spark, Scala, and Google Dataproc:
sparkbq | spark-bigquery | Apache Spark | Scala | Google Dataproc |
---|---|---|---|---|
0.1.x | 0.1.0 | 2.2.x and 2.3.x | 2.11 | 1.2.x and 1.3.x |
sparkbq is based on the Spark package spark-bigquery which is available in a separate GitHub repository.
library(sparklyr)
library(sparkbq)
library(dplyr)
config <- spark_config()
sc <- spark_connect(master = "local[*]", config = config)
# Set Google BigQuery default settings
bigquery_defaults(
billingProjectId = "<your_billing_project_id>",
gcsBucket = "<your_gcs_bucket>",
datasetLocation = "US",
serviceAccountKeyFile = "<your_service_account_key_file>",
type = "direct"
)
# Reading the public shakespeare data table
# https://cloud.google.com/bigquery/public-data/
# https://cloud.google.com/bigquery/sample-tables
hamlet <-
spark_read_bigquery(
sc,
name = "hamlet",
projectId = "bigquery-public-data",
datasetId = "samples",
tableId = "shakespeare") %>%
filter(corpus == "hamlet") # NOTE: predicate pushdown to BigQuery!
# Retrieve results into a local tibble
hamlet %>% collect()
# Write result into "mysamples" dataset in our BigQuery (billing) project
spark_write_bigquery(
hamlet,
datasetId = "mysamples",
tableId = "hamlet",
mode = "overwrite")
When running outside of Google Cloud it is necessary to specify a service account JSON key file. The service account key file can be passed as parameter serviceAccountKeyFile
to bigquery_defaults
or directly to spark_read_bigquery
and spark_write_bigquery
.
Alternatively, an environment variable export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service_account_keyfile.json
can be set (see https://cloud.google.com/docs/authentication/getting-started for more information). Make sure the variable is set before starting the R session.
When running on Google Cloud, e.g. Google Cloud Dataproc, application default credentials (ADC) may be used in which case it is not necessary to specify a service account key file.