data.Dataset

data.Dataset()

Wrapper around a C++ class that stores all of the non-outcome data used in stochtree. This includes:

  1. Features used for partitioning (also referred to as “covariates” in many places in these docs).
  2. Basis vectors used to define non-constant leaf models. This is optional but may be included via the add_basis method.
  3. Variance weights used to define heteroskedastic or otherwise weighted models. This is optional but may be included via the add_variance_weights method.

Methods

Name Description
add_covariates Add covariates to a dataset
add_basis Add basis matrix to a dataset
update_basis Update basis matrix in a dataset. Allows users to build an ensemble whose leaves
add_variance_weights Add variance weights to a dataset
update_variance_weights Update variance weights in a dataset. Allows users to build an ensemble that depends on
num_observations Query the number of observations in a dataset
num_covariates Query the number of covariates (features) in a dataset
num_basis Query the dimension of the basis vector in a dataset
get_covariates Return the covariates in a Dataset as a numpy array
get_basis Return the bases in a Dataset as a numpy array
get_variance_weights Return the variance weights in a Dataset as a numpy array
has_basis Whether or not a dataset has a basis vector (for leaf regression)
has_variance_weights Whether or not a dataset has variance weights
add_auxiliary_dimension Add an auxiliary data dimension to the dataset
set_auxiliary_data_value Set a value in the auxiliary data
get_auxiliary_data_value Get a value from the auxiliary data
get_auxiliary_data_vector Get an auxiliary data vector as a numpy array

add_covariates

data.Dataset.add_covariates(covariates)

Add covariates to a dataset

Parameters

Name Type Description Default
covariates np.array Numpy array of covariates. If data contain categorical, string, time series, or other columns in a dataframe, please first preprocess using the CovariateTransformer. required

add_basis

data.Dataset.add_basis(basis)

Add basis matrix to a dataset

Parameters

Name Type Description Default
basis np.array Numpy array of basis vectors. required

update_basis

data.Dataset.update_basis(basis)

Update basis matrix in a dataset. Allows users to build an ensemble whose leaves regress on bases that are updated throughout the sampler.

Parameters

Name Type Description Default
basis np.array Numpy array of basis vectors. required

add_variance_weights

data.Dataset.add_variance_weights(variance_weights)

Add variance weights to a dataset

Parameters

Name Type Description Default
variance_weights np.array Univariate numpy array of variance weights. required

update_variance_weights

data.Dataset.update_variance_weights(variance_weights, exponentiate=False)

Update variance weights in a dataset. Allows users to build an ensemble that depends on variance weights that are updated throughout the sampler.

Parameters

Name Type Description Default
variance_weights np.array Univariate numpy array of variance weights. required
exponentiate bool Whether to exponentiate the variance weights before storing them in the dataset. False

num_observations

data.Dataset.num_observations()

Query the number of observations in a dataset

Returns

Name Type Description
int Number of observations in the dataset

num_covariates

data.Dataset.num_covariates()

Query the number of covariates (features) in a dataset

Returns

Name Type Description
int Number of covariates in the dataset

num_basis

data.Dataset.num_basis()

Query the dimension of the basis vector in a dataset

Returns

Name Type Description
int Dimension of the basis vector in the dataset, returning 0 if the dataset does not have a basis

get_covariates

data.Dataset.get_covariates()

Return the covariates in a Dataset as a numpy array

Returns

Name Type Description
np.array Covariate data

get_basis

data.Dataset.get_basis()

Return the bases in a Dataset as a numpy array

Returns

Name Type Description
np.array Basis data

get_variance_weights

data.Dataset.get_variance_weights()

Return the variance weights in a Dataset as a numpy array

Returns

Name Type Description
np.array Variance weights data

has_basis

data.Dataset.has_basis()

Whether or not a dataset has a basis vector (for leaf regression)

Returns

Name Type Description
bool True if the dataset has a basis, False otherwise

has_variance_weights

data.Dataset.has_variance_weights()

Whether or not a dataset has variance weights

Returns

Name Type Description
bool True if the dataset has variance weights, False otherwise

add_auxiliary_dimension

data.Dataset.add_auxiliary_dimension(dim_size)

Add an auxiliary data dimension to the dataset

Parameters

Name Type Description Default
dim_size int Number of elements in the new auxiliary dimension required

set_auxiliary_data_value

data.Dataset.set_auxiliary_data_value(dim_idx, element_idx, value)

Set a value in the auxiliary data

Parameters

Name Type Description Default
dim_idx int Index of the auxiliary dimension required
element_idx int Index of the element within the dimension required
value float Value to set required

get_auxiliary_data_value

data.Dataset.get_auxiliary_data_value(dim_idx, element_idx)

Get a value from the auxiliary data

Parameters

Name Type Description Default
dim_idx int Index of the auxiliary dimension required
element_idx int Index of the element within the dimension required

Returns

Name Type Description
float The auxiliary data value

get_auxiliary_data_vector

data.Dataset.get_auxiliary_data_vector(dim_idx)

Get an auxiliary data vector as a numpy array

Parameters

Name Type Description Default
dim_idx int Index of the auxiliary dimension required

Returns

Name Type Description
np.array The auxiliary data vector