Building on data.frame
serialization provided by fst
, prt
offers an interface for working with partitioned
data.frame
s, saved as individual fst
files.
You can install the development version of prt from GitHub by running
source("https://install-github.me/nbenn/prt")
Alternatively, if you have the remotes
package
available, the latest release is available by calling
install_github()
as
# install.packages("remotes")
::install_github("nbenn/prt@*release") remotes
Creating a prt
object can be done either by calling
new_prt()
on a list of previously created fst
files or by coercing a data.frame
object to
prt
using as_prt()
.
<- tempfile()
tmp dir.create(tmp)
<- as_prt(nycflights13::flights, n_chunks = 2L, dir = tmp)
flights #> fstcore package v0.9.12
#> (OpenMP was not detected, using single threaded mode)
print(flights)
#> # A prt: 336,776 × 19
#> # Partitioning: [168,388, 168,388] rows
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 517 515 2 830
#> 2 2013 1 1 533 529 4 850
#> 3 2013 1 1 542 540 2 923
#> 4 2013 1 1 544 545 -1 1004
#> 5 2013 1 1 554 600 -6 812
#> …
#> 336,772 2013 9 30 NA 1455 NA NA
#> 336,773 2013 9 30 NA 2200 NA NA
#> 336,774 2013 9 30 NA 1210 NA NA
#> 336,775 2013 9 30 NA 1159 NA NA
#> 336,776 2013 9 30 NA 840 NA NA
#> # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>
In case a prt
object is created from a
data.frame
, the specified number of files is written to the
directory of choice (a newly created directory within
tempdir()
by default).
list.files(tmp)
#> [1] "1.fst" "2.fst"
Subsetting and printing is closely modeled after tibble
and behavior that deviates from that of tibble
will most
likely be considered a bug (please report). Some design
choices that do set a prt
object apart from a
tibble
include the use of data.table
s for any
result of a subsetting operation and the complete disregard for
row.names
.
In addition to standard subsetting operations involving the functions
`[`()
, `[[`()
and `$`()
, the base
generic function subset()
is implemented for the
prt
class, enabling subsetting operations using
non-standard evaluation. Combined with random access to tables stored as
fst
files, this can make data access more efficient in
cases where only a subset of the data is of interest.
<- flights[flights$month == 1, ]
jan identical(jan, subset(flights, month == 1))
#> [1] TRUE
print(jan)
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> 1: 2013 1 1 517 515 2 830 819
#> 2: 2013 1 1 533 529 4 850 830
#> 3: 2013 1 1 542 540 2 923 850
#> 4: 2013 1 1 544 545 -1 1004 1022
#> 5: 2013 1 1 554 600 -6 812 837
#> ---
#> 27000: 2013 1 31 NA 1325 NA NA 1505
#> 27001: 2013 1 31 NA 1200 NA NA 1430
#> 27002: 2013 1 31 NA 1410 NA NA 1555
#> 27003: 2013 1 31 NA 1446 NA NA 1757
#> 27004: 2013 1 31 NA 625 NA NA 934
#> arr_delay carrier flight tailnum origin dest air_time distance hour
#> 1: 11 UA 1545 N14228 EWR IAH 227 1400 5
#> 2: 20 UA 1714 N24211 LGA IAH 227 1416 5
#> 3: 33 AA 1141 N619AA JFK MIA 160 1089 5
#> 4: -18 B6 725 N804JB JFK BQN 183 1576 5
#> 5: -25 DL 461 N668DN LGA ATL 116 762 6
#> ---
#> 27000: NA MQ 4475 N730MQ LGA RDU NA 431 13
#> 27001: NA MQ 4658 N505MQ LGA ATL NA 762 12
#> 27002: NA MQ 4491 N734MQ LGA CLE NA 419 14
#> 27003: NA UA 337 <NA> LGA IAH NA 1416 14
#> 27004: NA UA 1497 <NA> LGA IAH NA 1416 6
#> minute time_hour
#> 1: 15 2013-01-01 05:00:00
#> 2: 29 2013-01-01 05:00:00
#> 3: 40 2013-01-01 05:00:00
#> 4: 45 2013-01-01 05:00:00
#> 5: 0 2013-01-01 06:00:00
#> ---
#> 27000: 25 2013-01-31 13:00:00
#> 27001: 0 2013-01-31 12:00:00
#> 27002: 10 2013-01-31 14:00:00
#> 27003: 46 2013-01-31 14:00:00
#> 27004: 25 2013-01-31 06:00:00
A subsetting operation on a prt
object yields a
data.table
. If the full table is of interest, a
prt
-specific implementation of the
as.data.table()
generic is available.
unlink(tmp, recursive = TRUE)