---
title: "Custom Feature Lags"
author: "Nickalus Redell"
date: "`r lubridate::today()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Custom Feature Lags}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
## Purpose
The purpose of this vignette is to demonstrate how custom predictor or feature
lags can be created for forecast model inputs in `forecastML` with the `forecastML::create_lagged_df()`
function. The rationale behind creating custom feature
lags is to improve model accuracy by removing noisy or redundant features in high dimensional training data.
Keeping only those feature lags that show high autocorrelation or cross-correlation with the modeled
outcome--e.g., 3 and 12 months for monthly data--is a good place to start.
```{r, include = FALSE}
knitr::opts_chunk$set(fig.width = 7.15, fig.height = 4)
```
## Load packages and data
```{r, warning = FALSE, message = FALSE}
library(forecastML)
library(DT)
data("data_seatbelts", package = "forecastML")
data <- data_seatbelts
data <- data[, c("DriversKilled", "kms", "PetrolPrice", "law")]
DT::datatable(head(data, 5))
```
## Create a Date Column
* Dates are optional for forecasting with non-grouped data, but we'll add a date column here to
illustrate the functionality.
* The dataset does not come with a date column, but the data was collected monthly from 1969
through 1984. This actually works out nicely because dates are passed in a separate argument,
`dates`, in `create_lagged_df()`.
```{r}
date_frequency <- "1 month"
dates <- seq(as.Date("1969-01-01"), as.Date("1984-12-01"), by = date_frequency)
```
## Custom Feature Lags
* We'll use custom lags for 3 of our 4 modeled features using the `lookback_control` argument
in `create_lagged_df()`.
+ Column 1: 3 and 12 months in the past.
+ Column 2: 1 through 3 months in the past.
+ Column 3: 1 through 12 months in the past.
+ Column 4: We'll treat `law` as a dynamic feature which won't be lagged.
* Although it's not required, our feature lags will be common across forecast
models--3 models that forecast (1) 1 month out, (2) 1:6 months out, and (3) 1:12 months out.
As we'll see below, feature lags that don't support direct forecasting to the given horizon are
silently dropped from the lagged data.frames.
```{r}
horizons <- c(1, 6, 12) # forecasting 1, 1:6, and 1:12 months into the future.
# Create a list of length 3, one slot for each modeled forecast horizon.
lookback_control <- vector("list", length(horizons))
# Within each horizon-specific list, we'll identify the custom feature lags.
lookback_control <- lapply(lookback_control, function(x) {
list(
c(3, 12), # column 1: DriversKilled
1:3, # column 2: kms
1:12, # column 3: PetrolPrice
0 # column 4: law; this could be any value, dynamic features are set to '0' internally.
)
})
data_train <- forecastML::create_lagged_df(data, type = "train",
outcome_col = 1,
horizons = horizons,
lookback_control = lookback_control,
dates = dates,
frequency = date_frequency,
dynamic_features = "law")
```
* Below is a series of feature-level plots of the resulting lagged data.frame features
for each forecast horizon in `data_train`.
* Notice, for instance, how the 1:3 month lags for `kms` were dropped from the 6- and 12-month-out forecast
modeling datasets as these lags don't support direct forecasting at these time horizons.
```{r, results = 'hide'}
plot(data_train)
```
## Removing Features
* Now, let's say that a lag of 12 months for `PetrolPrice` is a poor predictor for our long-term,
12-month-out forecast model. We can remove it by assigning a `NULL` value in the appropriate
slot in our `lookback_control` argument.
* Notice that the `NULL` has to be placed in a `list()` to avoid removing the list slot altogether.
```{r}
horizons <- c(1, 6, 12) # forecasting 1, 1:6, and 1:12 months into the future.
# A list of length 3, one slot for each modeled forecast horizon.
lookback_control <- vector("list", length(horizons))
lookback_control <- lapply(lookback_control, function(x) {
# 12 feature lags for each of our 4 modeled features. Dynamic features will be coerced to "0" internally.
lapply(1:4, function(x) {1:12})
})
# Find the column index of the feature that we're removing.
remove_col <- which(grepl("PetrolPrice", names(data)))
# Remove the feature from the 12-month-out lagged data.frame.
lookback_control[[which(horizons == 12)]][remove_col] <- list(NULL)
data_train <- forecastML::create_lagged_df(data, type = "train",
outcome_col = 1,
lookback_control = lookback_control,
horizons = horizons,
dates = dates,
frequency = date_frequency,
dynamic_features = "law")
```
* Inspecting the plot confirms that the 12-month-lagged feature for
`PetrolPrice` is not a feature in our 12-month-out forecast model training data set.
```{r, results = 'hide'}
plot(data_train)[[remove_col]] # we're selecting 1 of our 3 lagged feature-level plots.
```
* Inspecting the modeling data.frame confirms that the 12-month-lagged feature for
`PetrolPrice` is not a feature in our 12-month-out forecast model training data set.
```{r}
DT::datatable(head(data_train$horizon_12))
```
***