The Arrow C++ library includes a generic filesystem interface and
specific implementations for some cloud storage systems. This setup
allows various parts of the project to be able to read and write data
with different storage backends. In the arrow
R package,
support has been enabled for AWS S3 and Google Cloud Storage (GCS). This
vignette provides an overview of working with S3 and GCS data using
Arrow.
In Windows and macOS binary packages, S3 and GCS support are included. On Linux when installing from source, S3 and GCS support is not always enabled by default, and it has additional system requirements. See
vignette("install", package = "arrow")
for details.
One way of working with filesystems is to create
?FileSystem
objects. ?S3FileSystem
objects can
be created with the s3_bucket()
function, which
automatically detects the bucket’s AWS region. Similarly,
?GcsFileSystem
objects can be created with the
gs_bucket()
function. The resulting FileSystem
will consider paths relative to the bucket’s path (so for example you
don’t need to prefix the bucket path when listing a directory).
With a FileSystem
object, you can point to specific
files in it with the $path()
method and pass the result to
file readers and writers (read_parquet()
,
write_feather()
, et al.). For example, to read a parquet
file from the example NYC taxi data (used in
vignette("dataset", package = "arrow")
):
<- s3_bucket("voltrondata-labs-datasets")
bucket # Or in GCS (anonymous = TRUE is required if credentials are not configured):
<- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE)
bucket <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet")) df
Note that this will be slower to read than if the file were local, though if you’re running on a machine in the same AWS region as the file in S3, the cost of reading the data over the network should be much lower.
You can list the files and/or directories in a bucket or subdirectory
using the $ls()
method:
$ls("nyc-taxi")
bucket# Or recursive:
$ls("nyc-taxi", recursive = TRUE) bucket
NOTE: in GCS, you should always use
recursive = TRUE
as directories often don’t appear in
$ls()
results.
See help(FileSystem)
for a list of options that
s3_bucket()
/S3FileSystem$create()
and
gs_bucket()
/GcsFileSystem$create()
can
take.
The object that s3_bucket()
and gs_bucket()
return is technically a SubTreeFileSystem
, which holds a
path and a file system to which it corresponds.
SubTreeFileSystem
s can be useful for holding a reference to
a subdirectory somewhere (on S3, GCS, or elsewhere).
One way to get a subtree is to call the $cd()
method on
a FileSystem
<- bucket$cd("2019/06")
june2019 <- read_parquet(june2019$path("data.parquet")) df
SubTreeFileSystem
can also be made from a URI:
<- SubTreeFileSystem$create("s3://voltrondata-labs-datasets/nyc-taxi/2019/06") june2019
File readers and writers (read_parquet()
,
write_feather()
, et al.) also accept a URI as the source or
destination file, as do open_dataset()
and
write_dataset()
. An S3 URI looks like:
s3://[access_key:secret_key@]bucket/path[?region=]
A GCS URI looks like:
gs://[access_key:secret_key@]bucket/path
gs://anonymous@bucket/path
For example, one of the NYC taxi data files used in
vignette("dataset", package = "arrow")
is found at
s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
# Or in GCS (anonymous required on public buckets):
gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
Given this URI, you can pass it to read_parquet()
just
as if it were a local file path:
<- read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
df # Or in GCS:
<- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") df
URIs accept additional options in the query parameters (the part
after the ?
) that are passed down to configure the
underlying file system. They are separated by &
. For
example,
s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
is equivlant to:
<- S3FileSystem$create(
fs endpoint_override="https://storage.googleapis.com",
allow_bucket_creation=TRUE
)$path("voltrondata-labs-datasets/") fs
Both tell the S3FileSystem
that it should allow the
creation of new buckets and to talk to Google Storage instead of S3. The
latter works because GCS implements an S3-compatible API–see File systems that emulate S3
below–but for better support for GCS use the GCSFileSystem with
gs://
. Also note that parameters in the URI need to be percent
encoded, which is why ://
is written as
%3A%2F%2F
.
For S3, only the following options can be included in the URI as
query parameters are region
, scheme
,
endpoint_override
, access_key
,
secret_key
, allow_bucket_creation
, and
allow_bucket_deletion
. For GCS, the supported parameters
are scheme
, endpoint_override
, and
retry_limit_seconds
.
In GCS, a useful option is retry_limit_seconds
, which
sets the number of seconds a request may spend retrying before returning
an error. The current default is 15 minutes, so in many interactive
contexts it’s nice to set a lower value:
gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=10
To access private S3 buckets, you need typically need two secret
parameters: a access_key
, which is like a user id, and
secret_key
, which is like a token or password. There are a
few options for passing these credentials:
Include them in the URI, like
s3://access_key:secret_key@bucket-name/path/to/file
. Be
sure to URL-encode
your secrets if they contain special characters like “/” (e.g.,
URLencode("123/456", reserved = TRUE)
).
Pass them as access_key
and secret_key
to S3FileSystem$create()
or
s3_bucket()
Set them as environment variables named
AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
,
respectively.
Define them in a ~/.aws/credentials
file, according
to the AWS
documentation.
Use an AccessRole
for temporary access by passing the role_arn
identifier to
S3FileSystem$create()
or s3_bucket()
.
The simplest way to authenticate with GCS is to run the gcloud command to setup application default credentials:
gcloud auth application-default login
To manually configure credentials, you can pass either
access_token
and expiration
, for using
temporary tokens generated elsewhere, or json_credentials
,
to reference a downloaded credentials file.
If you haven’t configured credentials, then to access public
buckets, you must pass anonymous = TRUE
or
anonymous
as the user in a URI:
<- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE)
bucket <- GcsFileSystem$create(anonymous = TRUE)
fs <- read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet") df
If you need to use a proxy server to connect to an S3 bucket, you can
provide a URI in the form http://user:password@host:port
to
proxy_options
. For example, a local proxy server running on
port 1316 can be used like this:
<- s3_bucket("voltrondata-labs-datasets", proxy_options = "http://localhost:1316") bucket
The S3FileSystem
machinery enables you to work with any
file system that provides an S3-compatible interface. For example, MinIO is and object-storage server that
emulates the S3 API. If you were to run minio server
locally with its default settings, you could connect to it with
arrow
using S3FileSystem
like this:
<- S3FileSystem$create(
minio access_key = "minioadmin",
secret_key = "minioadmin",
scheme = "http",
endpoint_override = "localhost:9000"
)
or, as a URI, it would be
s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
(note the URL escaping of the :
in
endpoint_override
).
Among other applications, this can be useful for testing out code locally before running on a remote S3 bucket.
As mentioned above, it is possible to make use of environment variables to configure access. However, if you wish to pass in connection details via a URI or alternative methods but also have existing AWS environment variables defined, these may interfere with your session. For example, you may see an error message like:
: IOError: When resolving region for bucket 'analysis': AWS Error [code 99]: curlCode: 6, Couldn't resolve host name Error
You can unset these environment variables using
Sys.unsetenv()
, for example:
Sys.unsetenv("AWS_DEFAULT_REGION")
Sys.unsetenv("AWS_S3_ENDPOINT")
By default, the AWS SDK tries to retrieve metadata about user
configuration, which can cause conficts when passing in connection
details via URI (for example when accessing a MINIO bucket). To disable
the use of AWS environment variables, you can set environment variable
AWS_EC2_METADATA_DISABLED
to TRUE
.
Sys.setenv(AWS_EC2_METADATA_DISABLED = TRUE)