The fst package for R provides a fast, easy and flexible way to serialize data frames. It has very amazing features, such as fast read and write of R data frames, super file compression and parse data frames without reading it. Considering all these features, now tidyfst could provide a new workflow to manipulate data more efficiently. The core idea is: We never need the whole data all at once, we only need the things we want and aggregate them to get the summary to provide target information.
tidyfst have provided the following functions to facilitate the workflow:
fst::read_fst
but
always return a data.tablefst::write_fst
but
always use largest compress factor (which yields smallest file)In such a workflow, you never need to read the whole data.frame into
your RAM, you just select the target data, process them instantly and
get the results all at once. You do not have to read the data to know
the structure of data.frame, because we have parse_fst
(a
wrapper for fst
in fst package). Now let’s give it
a try.
library(tidyfst)
# Generate some random data frame with 10 million rows and various column types
<- 1e7
nr_of_rows
<- data.frame(
df Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)
# write the fst file, make sure you do not have the file with same name in the directory
export_fst(df,"fst_test.fst")
# remove all variables in the environment
rm(list = ls())
Now, we want to know the information in the data frame.
parse_fst("fst_test.fst") -> ft
ft
If we want to get the information in the Factor
column,
use:
%>%
ft select_fst(Factor) %>%
count_dt(Factor) -> factor_info
factor_info
If we want to calculate the mean of Integer
by the group
of Factor
, use:
%>%
ft select_fst(Integer,Factor) %>%
summarise_dt(avg = mean(Integer),by = Factor) -> avg_info
avg_info
In this workflow, we only select/filter/slice the data we need, and get the results directly from the pipeline. Therefore, we read the minimum needed data into RAM and release it and save only the results we want. This workflow could save memory for many exploratory big data analysis. Last, let’s delete the output file:
# delete the output file
unlink("fst_test.fst")
After (>=) version 0.9.3, tidyfst has also added a function
as_fst()
, which can turn any data.frame into a fst table
and saved the data in the temporary file. This means that we might never
have to save the object in the RAM ever (as long as it is a data.frame)!
A small example:
%>% as_fst() -> iris_fst
iris %>% as_fst() -> mtcars_fst
mtcars
iris_fst mtcars_fst
So when you have generated a pretty large data.frame and do not want
it to consume the cache in your computer, just save it and read it when
needed using as_fst
.