The Apache Arrow C++ library provides rich, powerful features for
working with columnar data. The arrow
R package provides
both a low-level interface to the C++ library and some higher-level,
R-flavored tools for working with it. This vignette provides an overview
of how the pieces fit together, and it describes the conventions that
the classes and methods follow in R.
The arrow
package lets you work efficiently with large,
multi-file datasets using dplyr
methods. See
vignette("dataset", package = "arrow")
for an overview.
arrow
provides some simple functions for using the Arrow
C++ library to read and write files. These functions are designed to
drop into your normal R workflow without requiring any knowledge of the
Arrow C++ library and use naming conventions and arguments that follow
popular R packages, particularly readr
. The readers return
data.frame
s (or if you use the tibble
package,
they will act like tbl_df
s), and the writers take
data.frame
s.
Importantly, arrow
provides basic read and write support
for the Apache Parquet
columnar data file format.
library(arrow)
<- read_parquet("path/to/file.parquet") df
Just as you can read, you can write Parquet files:
write_parquet(df, "path/to/different_file.parquet")
The arrow
package also includes a faster and more robust
implementation of the Feather file format,
providing read_feather()
and write_feather()
.
This implementation depends on the same underlying C++ library as the
Python version does, resulting in more reliable and consistent behavior
across the two languages, as well as improved
performance. arrow
also by default writes the Feather
V2 format (the
Arrow IPC file format), which supports a wider range of data types,
as well as compression.
For CSV and line-delimited JSON, there are
read_csv_arrow()
and read_json_arrow()
,
respectively. While read_csv_arrow()
currently has fewer
parsing options for dealing with every CSV format variation in the wild,
for the files it can read, it is often significantly faster than other R
CSV readers, such as base::read.csv
,
readr::read_csv
, and data.table::fread
.
Using reticulate
,
arrow
lets you share data between R and Python
(pyarrow
) efficiently, enabling you to take advantage of
the vibrant ecosystem of Python packages that build on top of Apache
Arrow. See vignette("python", package = "arrow")
for
details.
The arrow
package also provides many lower-level
bindings to the C++ library, which enable you to access and manipulate
Arrow objects. You can use these to build connectors to other
applications and services that use Arrow. One example is Spark: the sparklyr
package has
support for using Arrow to move data to and from Spark, yielding significant
performance gains.
Arrow defines the following classes for representing metadata:
Class | Description | How to create an instance |
---|---|---|
DataType |
attribute controlling how values are represented | functions in help("data-type") |
Field |
a character string name and a DataType |
field(name, type) |
Schema |
list of Field s |
schema(...) |
Arrow defines the following classes for representing zero-dimensional (scalar), one-dimensional (array/vector-like), and two-dimensional (tabular/data frame-like) data:
Dim | Class | Description | How to create an instance |
---|---|---|---|
0 | Scalar |
single value and its DataType |
Scalar$create(value, type) |
1 | Array |
vector of values and its DataType |
Array$create(vector, type) |
1 | ChunkedArray |
vectors of values and their DataType |
ChunkedArray$create(..., type) or alias
chunked_array(..., type) |
2 | RecordBatch |
list of Array s with a Schema |
RecordBatch$create(...) or alias
record_batch(...) |
2 | Table |
list of ChunkedArray with a Schema |
Table$create(...) , alias arrow_table(...) ,
or arrow::read_*(file, as_data_frame = FALSE) |
2 | Dataset |
list of Table s with the same Schema |
Dataset$create(sources, schema) or alias
open_dataset(sources, schema) |
Each of these is defined as an R6
class in the
arrow
R package and corresponds to a class of the same name
in the Arrow C++ library. The arrow
package provides a
variety of R6
and S3 methods for interacting with instances
of these classes.
For convenience, the arrow
package also defines several
synthetic classes that do not exist in the C++ library, including:
ArrowDatum
: inherited by Scalar
,
Array
, and ChunkedArray
ArrowTabular
: inherited by RecordBatch
and
Table
ArrowObject
: inherited by all Arrow objectsArrow has a rich data type system that includes direct parallels with R’s data types and much more.
In the tables, entries with a -
are not currently
implemented.
R type | Arrow type |
---|---|
logical | boolean |
integer | int32 |
double (“numeric”) | float641 |
character | utf82 |
factor | dictionary |
raw | uint8 |
Date | date32 |
POSIXct | timestamp |
POSIXlt | struct |
data.frame | struct |
list3 | list |
bit64::integer64 | int64 |
hms::hms | time32 |
difftime | duration |
vctrs::vctrs_unspecified | null |
1: float64
and double
are the
same concept and data type in Arrow C++; however, only
float64()
is used in arrow as the function
double()
already exists in base R
2: If the character vector exceeds 2GB of strings, it will
be converted to a large_utf8
Arrow type
3: Only lists where all elements are the same type are able to be translated to Arrow list type (which is a “list of” some type).
Arrow type | R type |
---|---|
boolean | logical |
int8 | integer |
int16 | integer |
int32 | integer |
int64 | integer1 |
uint8 | integer |
uint16 | integer |
uint32 | integer1 |
uint64 | integer1 |
float16 | -2 |
float32 | double |
float64 | double |
utf8 | character |
large_utf8 | character |
binary | arrow_binary 3 |
large_binary | arrow_large_binary 3 |
fixed_size_binary | arrow_fixed_size_binary 3 |
date32 | Date |
date64 | POSIXct |
time32 | hms::hms |
time64 | hms::hms |
timestamp | POSIXct |
duration | difftime |
decimal | double |
dictionary | factor4 |
list | arrow_list 5 |
large_list | arrow_large_list 5 |
fixed_size_list | arrow_fixed_size_list 5 |
struct | data.frame |
null | vctrs::vctrs_unspecified |
map | arrow_list 5 |
union | -2 |
1: These integer types may contain values that exceed the
range of R’s integer
type (32-bit signed integer). When
they do, uint32
and uint64
are converted to
double
(“numeric”) and int64
is converted to
bit64::integer64
. This conversion can be disabled (so that
int64
always yields a bit64::integer64
vector)
by setting options(arrow.int64_downcast = FALSE)
.
2: Some Arrow data types do not currently have an R equivalent and will raise an error if cast to or mapped to via a schema.
3: arrow*_binary
classes are implemented as
lists of raw vectors.
4: Due to the limitation of R factors, Arrow
dictionary
values are coerced to string when translated to
R if they are not already strings.
5: arrow*_list
classes are implemented as
subclasses of vctrs_list_of
with a ptype
attribute set to what an empty Array of the value type converts to.
Arrow supports custom key-value metadata attached to Schemas. When we
convert a data.frame
to an Arrow Table or RecordBatch, the
package stores any attributes()
attached to the columns of
the data.frame
in the Arrow object’s Schema. These
attributes are stored under the “r” key; you can assign additional
string metadata under any other key you wish, like
x$metadata$new_key <- "new value"
.
This metadata is preserved when writing the table to Feather or
Parquet, and when reading those files into R, or when calling
as.data.frame()
on a Table/RecordBatch, the column
attributes are restored to the columns of the resulting
data.frame
. This means that custom data types, including
haven::labelled
, vctrs
annotations, and
others, are preserved when doing a round-trip through Arrow.
Note that the attributes()
stored in
$metadata$r
are only understood by R. If you write a
data.frame
with haven
columns to a Feather
file and read that in Pandas, the haven
metadata won’t be
recognized there. (Similarly, Pandas writes its own custom metadata,
which the R package does not consume.) You are free, however, to define
custom metadata conventions for your application and assign any (string)
values you want to other metadata keys. For more details, see the
documentation for schema()
.
C++ is an object-oriented language, so the core logic of the Arrow
library is encapsulated in classes and methods. In the R package, these
classes are implemented as R6
classes, most of which
are exported from the namespace.
In order to match the C++ naming conventions, the R6
classes are in TitleCase, e.g. RecordBatch
. This makes it
easy to look up the relevant C++ implementations in the code or documentation. To simplify
things in R, the C++ library namespaces are generally dropped or
flattened; that is, where the C++ library has
arrow::io::FileOutputStream
, it is just
FileOutputStream
in the R package. One exception is for the
file readers, where the namespace is necessary to disambiguate. So
arrow::csv::TableReader
becomes
CsvTableReader
, and arrow::json::TableReader
becomes JsonTableReader
.
Some of these classes are not meant to be instantiated directly; they
may be base classes or other kinds of helpers. For those that you should
be able to create, use the $create()
method to instantiate
an object. For example,
rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))
will create a RecordBatch
. Many of these factory methods
that an R user might most often encounter also have a
snake_case
alias, in order to be more familiar for
contemporary R users. So
record_batch(int = 1:10, dbl = as.numeric(1:10))
would do
the same as RecordBatch$create()
above.
The typical user of the arrow
R package may never deal
directly with the R6
objects. We provide more R-friendly
wrapper functions as a higher-level interface to the C++ library. An R
user can call read_parquet()
without knowing or caring that
they’re instantiating a ParquetFileReader
object and
calling the $ReadFile()
method on it. The classes are there
and available to the advanced programmer who wants fine-grained control
over how the C++ library is used.