This part of vignette has referred to dplyr
’s vignette
in https://dplyr.tidyverse.org/articles/dplyr.html. We’ll
try to reproduce all the results. First load the needed packages.
library(tidyfst)
library(nycflights13)
library(data.table)
data.table(flights)
filter_dt()
filter_dt(flights, month == 1 & day == 1)
Note that comma could not be used in the expressions. Which means
filter_dt(flights, month == 1,day == 1)
would return error.
## Arrange rows with arrange_dt()
arrange_dt(flights, year, month, day)
Use -
(minus symbol) to order a column in descending
order:
arrange_dt(flights, -arr_delay)
select_dt()
select_dt(flights, year, month, day)
select_dt(flights, year:day)
and
select_dt(flights, -(year:day))
are not supported. But I
have added a feature to help select with regular expression, which means
you can:
select_dt(flights, "^dep")
The rename process is almost the same as that in
dplyr
:
select_dt(flights, tail_num = tailnum)
rename_dt(flights, tail_num = tailnum)
mutate_dt()
mutate_dt(flights,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60
)
However, if you just create the column, please split them. The following codes would not work:
mutate_dt(flights,
gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60)
)
Instead, use:
mutate_dt(flights,gain = arr_delay - dep_delay) %>%
mutate_dt(gain_per_hour = gain / (air_time / 60))
If you only want to keep the new variables, use
transmute_dt()
:
transmute_dt(flights,
gain = arr_delay - dep_delay
)
summarise_dt()
summarise_dt(flights,
delay = mean(dep_delay, na.rm = TRUE)
)
sample_n_dt()
and
sample_frac_dt()
sample_n_dt(flights, 10)
sample_frac_dt(flights, 0.01)
For the below dplyr
codes:
<- group_by(flights, tailnum)
by_tailnum <- summarise(by_tailnum,
delay count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
<- filter(delay, count > 20, dist < 2000) delay
We could get it via:
%>%
flights summarise_dt( count = .N,
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE),by = tailnum)
summarise_dt
(or summarize_dt
) has a
parameter “by”, you can specify the group. We could find the number of
planes and the number of flights that go to each possible
destination:
# the dplyr syntax:
# destinations <- group_by(flights, dest)
# summarise(destinations,
# planes = n_distinct(tailnum),
# flights = n()
# )
summarise_dt(flights,planes = uniqueN(tailnum),flights = .N,by = dest) %>%
arrange_dt(dest)
If you need to group by many variables, use:
# the dplyr syntax:
# daily <- group_by(flights, year, month, day)
# (per_day <- summarise(daily, flights = n()))
%>%
flights summarise_dt(by = .(year,month,day),flights = .N)
# (per_month <- summarise(per_day, flights = sum(flights)))
%>%
flights summarise_dt(by = .(year,month,day),flights = .N) %>%
summarise_dt(by = .(year,month),flights = sum(flights))
# (per_year <- summarise(per_month, flights = sum(flights)))
%>%
flights summarise_dt(by = .(year,month,day),flights = .N) %>%
summarise_dt(by = .(year,month),flights = sum(flights)) %>%
summarise_dt(by = .(year),flights = sum(flights))
tidyfst provides a tidy syntax for data.table. For
such design, tidyfst never runs faster than the analogous
data.table codes. Nevertheless, it facilitate the dplyr-users
to gain the computation performance in no time and guide them to learn
more about data.table for speed. Below, we’ll compare the syntax of
tidyfst
and data.table
(referring to Introduction
to data.table). This could let you know how they are different, and
let users to choose their preference. Ideally, tidyfst will
lead even more users to learn more about data.table and its
wonderful features, so as to design more extentions for tidyfst
in the future.
Because we want a more stable data source, here we’ll use the flight
data from the above nycflights13
package.
library(tidyfst)
library(data.table)
library(nycflights13)
= data.table(flights) %>% na.omit() flights
# data.table
head(flights[origin == "JFK" & month == 6L])
1:2]
flights[order(origin, -dest)]
flights[
# tidyfst
%>%
flights filter_dt(origin == "JFK" & month == 6L) %>%
head()
%>% slice_dt(1:2)
flights %>% arrange_dt(origin,-dest) flights
# data.table
list(arr_delay)]
flights[,
flights[, .(arr_delay, dep_delay)]delay_arr = arr_delay, delay_dep = dep_delay)]
flights[, .(
# tidyfst
%>% select_dt(arr_delay)
flights %>% select_dt(arr_delay, dep_delay)
flights %>% transmute_dt(delay_arr = arr_delay, delay_dep = dep_delay) flights
# data.table
sum( (arr_delay + dep_delay) < 0)]
flights[, == "JFK" & month == 6L,
flights[origin m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
.(== "JFK" & month == 6L, length(dest)]
flights[origin == "JFK" & month == 6L, .N]
flights[origin
# tidyfst
%>% summarise_dt(sum( (arr_delay + dep_delay) < 0))
flights %>%
flights filter_dt(origin == "JFK" & month == 6L) %>%
summarise_dt(m_arr = mean(arr_delay), m_dep = mean(dep_delay))
%>%
flights filter_dt(origin == "JFK" & month == 6L) %>%
nrow()
%>%
flights filter_dt(origin == "JFK" & month == 6L) %>%
count_dt()
%>%
flights filter_dt(origin == "JFK" & month == 6L) %>%
summarise_dt(.N)
In the above examples, we could learn that in tidyfst, you
could still use the methods in data.table, such as .N
.
# data.table
c("arr_delay", "dep_delay")]
flights[,
= c("arr_delay", "dep_delay")
select_cols
flights[ , ..select_cols]= FALSE]
flights[ , select_cols, with
!c("arr_delay", "dep_delay")]
flights[, -c("arr_delay", "dep_delay")]
flights[,
# returns year,month and day
:day]
flights[, year# returns day, month and year
:year]
flights[, day# returns all columns except year, month and day
-(year:day)]
flights[, !(year:day)]
flights[,
# tidyfst
%>% select_dt(c("arr_delay", "dep_delay"))
flights
= c("arr_delay", "dep_delay")
select_cols %>% select_dt(cols = select_cols)
flights
%>% select_dt(-arr_delay,-dep_delay)
flights
%>% select_dt(year:day)
flights %>% select_dt(day:year)
flights %>% select_dt(-(year:day))
flights %>% select_dt(!(year:day)) flights
# data.table
= .(origin)]
flights[, .N, by == "AA", .N, by = origin]
flights[carrier == "AA", .N, by = .(origin, dest)]
flights[carrier == "AA",
flights[carrier mean(arr_delay), mean(dep_delay)),
.(= .(origin, dest, month)]
by
# tidyfst
%>% count_dt(origin) # sort by default
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin)
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin,dest)
flights %>% filter_dt(carrier == "AA") %>%
flights summarise_dt(mean(arr_delay), mean(dep_delay),
by = .(origin, dest, month))
Note that currently keyby
is not used in
tidyfst. This featuer might be included in the future for
better performance in order-independent tasks. Moreover,
count_dt
is sorted automatically by the counted number,
this could be controlled by the parameter “sort”.
# data.table
== "AA", .N, by = .(origin, dest)][order(origin, -dest)]
flights[carrier >0, arr_delay>0)]
flights[, .N, .(dep_delay
# tidyfst
%>%
flights filter_dt(carrier == "AA") %>%
count_dt(origin,dest,sort = FALSE) %>%
arrange_dt(origin,-dest)
%>%
flights summarise_dt(.N,by = .(dep_delay>0, arr_delay>0))
Now let’s try a more complex example:
# data.table
== "AA",
flights[carrier lapply(.SD, mean),
= .(origin, dest, month),
by = c("arr_delay", "dep_delay")]
.SDcols
# tidyfst
%>%
flights filter_dt(carrier == "AA") %>%
group_dt(
by = .(origin, dest, month),
at_dt("_delay",summarise_dt,mean)
)
Let me explain what happens here, especially in
group_dt
. First filter by condition
carrier == "AA"
, then group by three variables, which are
origin, dest, month
. Last, summarise by columns with
“_delay” in the column names and get the mean value of all such
variables(with “_delay” in their column names). This is a very creative
design, utilizing .SD
in data.table and upgrade
the group_by
function in dplyr (because you never
need to ungroup
now, just put the group operations in the
group_dt
). And you can pipe in the group_dt
function. Let’s play with it a little bit further:
%>%
flights filter_dt(carrier == "AA") %>%
group_dt(
by = .(origin, dest, month),
at_dt("_delay",summarise_dt,mean) %>%
mutate_dt(sum = dep_delay + arr_delay)
)
However, I don’t recommend using it if you don’t acutually need it
for group computation (just start another pipe follows
group_dt
). Now let’s end with some easy examples:
# data.table
head(.SD, 2), by = month]
flights[,
# tidyfst
%>%
flights group_dt(by = month,head(2))
Deep inside, tidyfst is born from dplyr and data.table, and use stringr to make flexible APIs, so as to bring their superiority into full play.