The purpose of this vignette is to demonstrate how custom predictor
or feature lags can be created for forecast model inputs in
forecastML
with the
forecastML::create_lagged_df()
function. The rationale
behind creating custom feature lags is to improve model accuracy by
removing noisy or redundant features in high dimensional training data.
Keeping only those feature lags that show high autocorrelation or
cross-correlation with the modeled outcome–e.g., 3 and 12 months for
monthly data–is a good place to start.
library(forecastML)
library(DT)
data("data_seatbelts", package = "forecastML")
data <- data_seatbelts
data <- data[, c("DriversKilled", "kms", "PetrolPrice", "law")]
DT::datatable(head(data, 5))
Dates are optional for forecasting with non-grouped data, but we’ll add a date column here to illustrate the functionality.
The dataset does not come with a date column, but the data was
collected monthly from 1969 through 1984. This actually works out nicely
because dates are passed in a separate argument, dates
, in
create_lagged_df()
.
lookback_control
argument in
create_lagged_df()
.
law
as a dynamic feature which
won’t be lagged.
horizons <- c(1, 6, 12) # forecasting 1, 1:6, and 1:12 months into the future.
# Create a list of length 3, one slot for each modeled forecast horizon.
lookback_control <- vector("list", length(horizons))
# Within each horizon-specific list, we'll identify the custom feature lags.
lookback_control <- lapply(lookback_control, function(x) {
list(
c(3, 12), # column 1: DriversKilled
1:3, # column 2: kms
1:12, # column 3: PetrolPrice
0 # column 4: law; this could be any value, dynamic features are set to '0' internally.
)
})
data_train <- forecastML::create_lagged_df(data, type = "train",
outcome_col = 1,
horizons = horizons,
lookback_control = lookback_control,
dates = dates,
frequency = date_frequency,
dynamic_features = "law")
Below is a series of feature-level plots of the resulting lagged
data.frame features for each forecast horizon in
data_train
.
Notice, for instance, how the 1:3 month lags for kms
were dropped from the 6- and 12-month-out forecast modeling datasets as
these lags don’t support direct forecasting at these time
horizons.
Now, let’s say that a lag of 12 months for
PetrolPrice
is a poor predictor for our long-term,
12-month-out forecast model. We can remove it by assigning a
NULL
value in the appropriate slot in our
lookback_control
argument.
Notice that the NULL
has to be placed in a
list()
to avoid removing the list slot altogether.
horizons <- c(1, 6, 12) # forecasting 1, 1:6, and 1:12 months into the future.
# A list of length 3, one slot for each modeled forecast horizon.
lookback_control <- vector("list", length(horizons))
lookback_control <- lapply(lookback_control, function(x) {
# 12 feature lags for each of our 4 modeled features. Dynamic features will be coerced to "0" internally.
lapply(1:4, function(x) {1:12})
})
# Find the column index of the feature that we're removing.
remove_col <- which(grepl("PetrolPrice", names(data)))
# Remove the feature from the 12-month-out lagged data.frame.
lookback_control[[which(horizons == 12)]][remove_col] <- list(NULL)
data_train <- forecastML::create_lagged_df(data, type = "train",
outcome_col = 1,
lookback_control = lookback_control,
horizons = horizons,
dates = dates,
frequency = date_frequency,
dynamic_features = "law")
PetrolPrice
is not a feature in our 12-month-out forecast
model training data set.PetrolPrice
is not a feature in our
12-month-out forecast model training data set.