Data Preprocessing Routines — DataPreprocessing • stochtree

The functions in this group are designed to handle data preprocessing for stochastic forest models. For example, factor-valued columns in data frames are either one-hot encoded or converted to integer indices before the dataframe is converted to a standard matrix format for sampling. This preprocessing routine defines a set of "steps" that must be repeated on out-of-sample datasets before predictions can be obtained from a sampling model.

preprocessTrainData preprocesses covariates for the forest sampler routines, depending on the input type. DataFrames will be preprocessed based on their column types (numeric columns are not modified, ordered factors are integer coded, and unordered factors are one-hot encoded). Matrices are unmodified (assuming all columns are numeric). This function also records and returns a "metadata" list with preprocessing details to ensure that other datasets can be preprocessing identically.

preprocessPredictionData preprocesses covariates for the forest sampler routines, based on the steps outlined in a metadata list produced by preprocessTrainData.

These procedures are handled internally in the bart() and bcf() functions, but they are provided in stochtree as convenience functions for users writing custom samplers. Furthermore, while R lists can be serialized to RDS format, we offer a number of JSON serialization routines for the metadata list produced by preprocessTrainData for consistency with the broader serialization approach of stochtree (see BARTSerialization and BCFSerialization).

Following the API for serializing bartmodel and bcfmodel objects, we can convert metadata to JSON or JSON strings via savePreprocessorToJson and savePreprocessorToJsonString. Similarly, we can reload a metadata list from JSON or JSON strings via createPreprocessorFromJson and createPreprocessorFromJsonString.

Usage

preprocessTrainData(input_data)

preprocessPredictionData(input_data, metadata)

savePreprocessorToJson(object)

savePreprocessorToJsonString(object)

createPreprocessorFromJson(json_object)

createPreprocessorFromJsonString(json_string)

Arguments

input_data: Covariates, provided as either a dataframe or a matrix
metadata: List containing information on variables, including train set categories for categorical variables
object: List containing information on variables, including train set categories for categorical variables
json_object: in-memory wrapper around JSON C++ object containing covariate preprocessor metadata
json_string: in-memory JSON string containing covariate preprocessor metadata

Value

preprocessTrainData returns a list with transformed matrix data and a "metadata" list with details on the preprocessing procedures applied. preprocessPredictionData returns a matrix reflecting the data transformations specified in the provided metadata list.

savePreprocessorToJson return an object of type CppJson. savePreprocessorToJsonString returns a string dump of the preprocessor's JSON representation.

createPreprocessorFromJson and createPreprocessorFromJsonString both return metadata lists.

Examples

# Check that running the same data through `preprocessTrainData`
# and `preprocessPredictionData` yields the same result
n <- 100
x1 <- rnorm(n)
x2 <- factor(sample(1:3, n, replace = TRUE), ordered = TRUE)
x3 <- factor(sample(1:3, n, replace = TRUE), ordered = FALSE)
df1 <- data.frame(x1 = x1, x2 = x2, x3 = x3)
df2 <- data.frame(x1 = x1, x2 = x2, x3 = x3)
preprocess_train_list <- preprocessTrainData(df1)
df1_process <- preprocess_train_list$data
df1_metadata <- preprocess_train_list$metadata
df2_process <- preprocessPredictionData(df2, df1_metadata)
all.equal(df1_process, df2_process)
#> [1] TRUE

# Save to in-memory JSON
metadata_json <- savePreprocessorToJson(df1_metadata)
# Save to JSON string
metadata_json_string <- savePreprocessorToJsonString(df1_metadata)

# Reload metadata list from in-memory JSON object
metadata_roundtrip <- createPreprocessorFromJson(metadata_json)
# Reload metadata list from JSON string
metadata_roundtrip <- createPreprocessorFromJsonString(metadata_json_string)

cov_df <- data.frame(x1 = 1:5, x2 = 5:1, x3 = 6:10)
metadata <- list(num_ordered_cat_vars = 0, num_unordered_cat_vars = 0,
                 num_numeric_vars = 3, numeric_vars = c("x1", "x2", "x3"))
X_preprocessed <- preprocessPredictionData(cov_df, metadata)