Time series data wrangling is an essential skill for any forecaster. timetk
includes the essential data wrangling tools. In this tutorial, we’ll cover:
Additional concepts covered:
%+time
infix operation (See Padding Data: Low to High Frequency)plot_time_series()
for all visualizationsLoad the following libraries.
library(tidyverse)
library(tidyquant)
library(timetk)
This tutorial will use the FANG
dataset:
FANG
## # A tibble: 4,032 x 8
## symbol date open high low close volume adjusted
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FB 2013-01-02 27.4 28.2 27.4 28 69846400 28
## 2 FB 2013-01-03 27.9 28.5 27.6 27.8 63140600 27.8
## 3 FB 2013-01-04 28.0 28.9 27.8 28.8 72715400 28.8
## 4 FB 2013-01-07 28.7 29.8 28.6 29.4 83781800 29.4
## 5 FB 2013-01-08 29.5 29.6 28.9 29.1 45871300 29.1
## 6 FB 2013-01-09 29.7 30.6 29.5 30.6 104787700 30.6
## 7 FB 2013-01-10 30.6 31.5 30.3 31.3 95316400 31.3
## 8 FB 2013-01-11 31.3 32.0 31.1 31.7 89598000 31.7
## 9 FB 2013-01-14 32.1 32.2 30.6 31.0 98892800 31.0
## 10 FB 2013-01-15 30.6 31.7 29.9 30.1 173242600 30.1
## # ... with 4,022 more rows
The adjusted column contains the adjusted closing prices for each day.
%>%
FANG group_by(symbol) %>%
plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)
The volume column contains the trade volume (number of times the stock was transacted) for the day.
%>%
FANG group_by(symbol) %>%
plot_time_series(date, volume, .facet_ncol = 2, .interactive = FALSE)
summarise_by_time()
aggregates by a period. It’s great for:
SUM()
AVERAGE()
, FIRST()
, LAST()
Objective: Get the total trade volume by quarter
SUM()
.by = "quarter"
%>%
FANG group_by(symbol) %>%
summarise_by_time(
.by = "quarter",
date, volume = SUM(volume)
%>%
) plot_time_series(date, volume, .facet_ncol = 2, .interactive = FALSE, .y_intercept = 0)
Objective: Get the first value in each month
FIRST()
to get the first value, which has the effect of reducing the data (i.e. smoothing). We could use AVERAGE()
or MEDIAN()
..by = "month"
to aggregate by month.%>%
FANG group_by(symbol) %>%
summarise_by_time(
.by = "month",
date, adjusted = FIRST(adjusted)
%>%
) plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)
Used to quickly filter a continuous time range.
Objective: Get the adjusted stock prices in the 3rd quarter of 2013.
.start_date = "2013-09"
: Converts to "2013-09-01.end_date = "2013"
: Converts to "2013-12-31%+time
and %-time
is shown in “Padding Data: Low to High Frequency”.%>%
FANG group_by(symbol) %>%
filter_by_time(date, "2013-09", "2013") %>%
plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)
Used to fill in (pad) gaps and to go from from low frequency to high frequency. This function uses the awesome padr
library for filling and expanding timestamps.
Objective: Make an irregular series regular.
NA
..pad_value
or we can impute using a function like ts_impute_vec()
(shown next).%>%
FANG group_by(symbol) %>%
pad_by_time(date, .by = "auto") # Guesses .by = "day"
## pad applied on the interval: day
## # A tibble: 5,836 x 8
## # Groups: symbol [4]
## symbol date open high low close volume adjusted
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AMZN 2013-01-02 256. 258. 253. 257. 3271000 257.
## 2 AMZN 2013-01-03 257. 261. 256. 258. 2750900 258.
## 3 AMZN 2013-01-04 258. 260. 257. 259. 1874200 259.
## 4 AMZN 2013-01-05 NA NA NA NA NA NA
## 5 AMZN 2013-01-06 NA NA NA NA NA NA
## 6 AMZN 2013-01-07 263. 270. 263. 268. 4910000 268.
## 7 AMZN 2013-01-08 267. 269. 264. 266. 3010700 266.
## 8 AMZN 2013-01-09 268. 270. 265. 266. 2265600 266.
## 9 AMZN 2013-01-10 269. 269. 262. 265. 2863400 265.
## 10 AMZN 2013-01-11 265. 268. 264. 268. 2413300 268.
## # ... with 5,826 more rows
Objective: Go from Daily to Hourly timestamp intervals for 1 month from the start date. Impute the missing values.
.by = "hour"
pads from daily to hourlyts_impute_vec()
, which performs linear interpolation when period = 1
.FIRST(date) %+time% "1 month"
: Selecting the first date in the sequence then using a special infix operation, %+time%
, called “add time”. In this case I add “1 month”.%>%
FANG group_by(symbol) %>%
pad_by_time(date, .by = "hour") %>%
mutate_at(vars(open:adjusted), .funs = ts_impute_vec, period = 1) %>%
filter_by_time(date, "start", FIRST(date) %+time% "1 month") %>%
plot_time_series(date, adjusted, .facet_ncol = 2, .interactive = FALSE)
We have a new function, slidify()
that turns any function into a sliding (rolling) window function. It takes concepts from tibbletime::rollify()
and it improves them with the R package slider
.
Objective: Calculate a “centered” simple rolling average with partial window rolling and the start and end windows.
slidify()
turns the AVERAGE()
function into a rolling average.# Make the rolling function
<- slidify(.f = AVERAGE, .period = 30, .align = "center", .partial = TRUE)
roll_avg_30
# Apply the rolling function
%>%
FANG select(symbol, date, adjusted) %>%
group_by(symbol) %>%
# Apply Sliding Function
mutate(rolling_avg_30 = roll_avg_30(adjusted)) %>%
pivot_longer(cols = c(adjusted, rolling_avg_30)) %>%
plot_time_series(date, value, .color_var = name,
.facet_ncol = 2, .smooth = FALSE,
.interactive = FALSE)
For simple rolling calculations (rolling average), we can accomplish this operation faster with slidify_vec()
- A vectorized rolling function for simple summary rolls (e.g. mean()
, sd()
, sum()
, etc)
%>%
FANG select(symbol, date, adjusted) %>%
group_by(symbol) %>%
# Apply roll apply Function
mutate(rolling_avg_30 = slidify_vec(adjusted, ~ AVERAGE(.),
.period = 30, .partial = TRUE))
## # A tibble: 4,032 x 4
## # Groups: symbol [4]
## symbol date adjusted rolling_avg_30
## <chr> <date> <dbl> <dbl>
## 1 FB 2013-01-02 28 30.0
## 2 FB 2013-01-03 27.8 30.1
## 3 FB 2013-01-04 28.8 30.2
## 4 FB 2013-01-07 29.4 30.2
## 5 FB 2013-01-08 29.1 30.3
## 6 FB 2013-01-09 30.6 30.3
## 7 FB 2013-01-10 31.3 30.3
## 8 FB 2013-01-11 31.7 30.2
## 9 FB 2013-01-14 31.0 30.1
## 10 FB 2013-01-15 30.1 30.1
## # ... with 4,022 more rows
Objective: Calculate a rolling regression.
slidify()
is built for this.purrr
..1
, ..2
, ..3
, etc notation to setup a function# Rolling regressions are easy to implement using `.unlist = FALSE`
<- slidify(~ lm(..1 ~ ..2 + ..3), .period = 90,
lm_roll .unlist = FALSE, .align = "right")
%>%
FANG select(symbol, date, adjusted, volume) %>%
group_by(symbol) %>%
mutate(numeric_date = as.numeric(date)) %>%
# Apply rolling regression
mutate(rolling_lm = lm_roll(adjusted, volume, numeric_date)) %>%
filter(!is.na(rolling_lm))
## # A tibble: 3,676 x 6
## # Groups: symbol [4]
## symbol date adjusted volume numeric_date rolling_lm
## <chr> <date> <dbl> <dbl> <dbl> <list>
## 1 FB 2013-05-10 26.7 30847100 15835 <lm>
## 2 FB 2013-05-13 26.8 29068800 15838 <lm>
## 3 FB 2013-05-14 27.1 24930300 15839 <lm>
## 4 FB 2013-05-15 26.6 30299800 15840 <lm>
## 5 FB 2013-05-16 26.1 35499100 15841 <lm>
## 6 FB 2013-05-17 26.2 29462700 15842 <lm>
## 7 FB 2013-05-20 25.8 42402900 15845 <lm>
## 8 FB 2013-05-21 25.7 26261300 15846 <lm>
## 9 FB 2013-05-22 25.2 45314500 15847 <lm>
## 10 FB 2013-05-23 25.1 37663100 15848 <lm>
## # ... with 3,666 more rows
My Talk on High-Performance Time Series Forecasting
Time series is changing. Businesses now need 10,000+ time series forecasts every day.
High-Performance Forecasting Systems will save companies MILLIONS of dollars. Imagine what will happen to your career if you can provide your organization a “High-Performance Time Series Forecasting System” (HPTSF System).
I teach how to build a HPTFS System in my High-Performance Time Series Forecasting Course. If interested in learning Scalable High-Performance Forecasting Strategies then take my course. You will learn:
Modeltime
- 30+ Models (Prophet, ARIMA, XGBoost, Random Forest, & many more)GluonTS
(Competition Winners)