StochTree 0.0.1
|
Classes | |
class | CategorySampleTracker |
Mapping categories to the indices they contain TODO: Add run-time checks for categories with a few observations. | |
class | ColumnMatrix |
Internal wrapper around Eigen::MatrixXd interface for multidimensional floating point data. More... | |
class | ColumnVector |
Internal wrapper around Eigen::VectorXd interface for univariate floating point data. The (frequently updated) full / partial residual used in sampling forests is stored internally as a ColumnVector by the sampling functions (see Forest Sampler API). More... | |
class | CutpointGridContainer |
Container class for FeatureCutpointGrid objects stored for every feature in a dataset. More... | |
class | FeatureCutpointGrid |
Computing and tracking cutpoints available for a given feature at a given node Store cutpoint bins in 0-indexed fashion, so that if a given node has. More... | |
class | FeaturePresortPartition |
Data structure that tracks pre-sorted feature values through a tree's split lifecycle. More... | |
class | FeaturePresortRoot |
Data structure for presorting a feature by its values. More... | |
class | FeaturePresortRootContainer |
Container class for FeaturePresortRoot objects stored for every feature in a dataset. More... | |
class | FeatureUnsortedPartition |
Mapping nodes to the indices they contain. More... | |
class | ForestContainer |
Container of TreeEnsemble forest objects. This is the primary (in-memory) storage interface for multiple "samples" of a decision tree ensemble in stochtree . More... | |
class | ForestDataset |
API for loading and accessing data used to sample tree ensembles The covariates / bases / weights used in sampling forests are stored internally as a ForestDataset by the sampling functions (see Forest Sampler API). More... | |
class | ForestTracker |
"Superclass" wrapper around tracking data structures for forest sampling algorithms More... | |
class | GammaSampler |
class | GaussianConstantLeafModel |
Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More... | |
class | GaussianConstantSuffStat |
Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More... | |
class | GaussianMultivariateRegressionLeafModel |
Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More... | |
class | GaussianMultivariateRegressionSuffStat |
Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More... | |
class | GaussianUnivariateRegressionLeafModel |
Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More... | |
class | GaussianUnivariateRegressionSuffStat |
Sufficient statistic and associated operations for gaussian homoskedastic constant leaf outcome model. More... | |
class | GlobalHomoskedasticVarianceModel |
Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More... | |
class | IGVariancePrior |
class | InverseGammaSampler |
class | LabelMapper |
Standalone container for the map from category IDs to 0-based indices. More... | |
class | LeafNodeHomoskedasticVarianceModel |
Marginal likelihood and posterior computation for gaussian homoskedastic constant leaf outcome model. More... | |
class | LogLinearVarianceLeafModel |
Marginal likelihood and posterior computation for heteroskedastic log-linear variance model. More... | |
class | LogLinearVarianceSuffStat |
Sufficient statistic and associated operations for heteroskedastic log-linear variance model. More... | |
class | MultivariateNormalSampler |
class | MultivariateRegressionRandomEffectsModel |
Posterior computation and sampling and state storage for random effects model with a group-level multivariate basis regression. More... | |
class | NodeCutpointTracker |
Computing and tracking cutpoints available for a given feature at a given node. More... | |
class | NodeOffsetSize |
Tracking cutpoints available at a given node. More... | |
class | RandomEffectsContainer |
class | RandomEffectsDataset |
API for loading and accessing data used to sample (additive) random effects. More... | |
class | RandomEffectsGaussianPrior |
class | RandomEffectsRegressionGaussianPrior |
class | RandomEffectsTracker |
Wrapper around data structures for random effects sampling algorithms. More... | |
class | SampleCategoryMapper |
Class storing sample-node map for each tree in an ensemble TODO: Add run-time checks for categories with a few observations. More... | |
class | SampleNodeMapper |
Class storing sample-node map for each tree in an ensemble. More... | |
class | SamplePredMapper |
Class storing sample-prediction map for each tree in an ensemble. More... | |
class | SortedNodeSampleTracker |
Data structure for tracking observations through a tree partition with each feature pre-sorted. More... | |
class | Tree |
Decision tree data structure. More... | |
class | TreeEnsemble |
Class storing a "forest," or an ensemble of decision trees. More... | |
class | TreePrior |
class | TreeSplit |
Representation of arbitrary tree split rules, including numeric split rules (X[,i] <= c ) and categorical split rules (X[,i] in {2,4,6,7} ) More... | |
class | UnivariateNormalSampler |
class | UnsortedNodeSampleTracker |
Mapping nodes to the indices they contain. More... | |
Enumerations | |
enum | ModelType |
Leaf models for the forest sampler: More... | |
enum | TreeNodeType |
Tree node type. | |
Functions | |
static void | ExtractMultipleFeaturesFromMemory (std::vector< std::string > *text_data, const Parser *parser, std::vector< int32_t > &column_indices, Eigen::MatrixXd &data, data_size_t num_rows) |
Extract multiple features from the raw data loaded from a file into an Eigen::MatrixXd . Lightly modified from LightGBM's datasetloader interface to support stochtree 's use cases. | |
static void | ExtractSingleFeatureFromMemory (std::vector< std::string > *text_data, const Parser *parser, int32_t column_index, Eigen::VectorXd &data, data_size_t num_rows) |
Extract a single feature from the raw data loaded from a file into an Eigen::VectorXd . Lightly modified from LightGBM's datasetloader interface to support stochtree 's use cases. | |
std::string | TreeNodeTypeToString (TreeNodeType type) |
Get string representation of TreeNodeType. | |
TreeNodeType | TreeNodeTypeFromString (std::string const &name) |
Get NodeType from string. | |
bool | operator== (const Tree &lhs, const Tree &rhs) |
Comparison operator for trees. | |
bool | SplitTrueNumeric (double fvalue, double threshold) |
Determine whether an observation produces a "true" value in a numeric split node. | |
bool | SplitTrueCategorical (double fvalue, std::vector< std::uint32_t > const &category_list) |
Determine whether an observation produces a "true" value in a categorical split node. | |
int | NextNodeNumeric (double fvalue, double threshold, int left_child, int right_child) |
Return left or right node id based on a numeric split. | |
int | NextNodeCategorical (double fvalue, std::vector< std::uint32_t > const &category_list, int left_child, int right_child) |
Return left or right node id based on a categorical split. | |
int | EvaluateTree (Tree const &tree, Eigen::MatrixXd &data, int row) |
int | EvaluateTree (Tree const &tree, Eigen::Map< Eigen::Matrix< double, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor > > &data, int row) |
bool | RowSplitLeft (Eigen::MatrixXd &covariates, int row, int split_index, double split_value) |
Determine whether a given observation is "true" at a split proposed by split_index and split_value. | |
bool | RowSplitLeft (Eigen::MatrixXd &covariates, int row, int split_index, std::vector< std::uint32_t > const &category_list) |
Determine whether a given observation is "true" at a split proposed by split_index and split_value. | |
static void | VarSplitRange (ForestTracker &tracker, ForestDataset &dataset, int tree_num, int leaf_split, int feature_split, double &var_min, double &var_max) |
Computer the range of available split values for a continuous variable, given the current structure of a tree. | |
static bool | NodesNonConstantAfterSplit (ForestDataset &dataset, ForestTracker &tracker, TreeSplit &split, int tree_num, int leaf_split, int feature_split) |
Determines whether a proposed split creates two leaf nodes with constant values for every feature (thus ensuring that the tree cannot split further). | |
template<typename LeafModel , typename LeafSuffStat , typename... LeafSuffStatConstructorArgs> | |
static void | GFRSampleOneIter (TreeEnsemble &active_forest, ForestTracker &tracker, ForestContainer &forests, LeafModel &leaf_model, ForestDataset &dataset, ColumnVector &residual, TreePrior &tree_prior, std::mt19937 &gen, std::vector< double > &variable_weights, double global_variance, std::vector< FeatureType > &feature_types, int cutpoint_grid_size, bool keep_forest, bool pre_initialized, bool backfitting, LeafSuffStatConstructorArgs &... leaf_suff_stat_args) |
template<typename LeafModel , typename LeafSuffStat , typename... LeafSuffStatConstructorArgs> | |
static void | MCMCSampleOneIter (TreeEnsemble &active_forest, ForestTracker &tracker, ForestContainer &forests, LeafModel &leaf_model, ForestDataset &dataset, ColumnVector &residual, TreePrior &tree_prior, std::mt19937 &gen, std::vector< double > &variable_weights, double global_variance, bool keep_forest, bool pre_initialized, bool backfitting, LeafSuffStatConstructorArgs &... leaf_suff_stat_args) |
Runs one iteration of the MCMC sampler for a tree ensemble model, which consists of two steps for every tree in a forest: | |
Copyright (c) 2024 stochtree authors.
General-purpose data structures used for keeping track of categories in a training dataset.
SampleCategoryMapper is a simplified version of SampleNodeMapper, which is not tree-specific as it tracks categories loaded into a training dataset, and we do not expect to modify it during training.
SampleCategoryMapper is used in two places:
CategorySampleTracker is a simplified version of FeatureUnsortedPartition, which as above does not vary based on tree / partition and is not expected to change during training.
SampleNodeMapper is inspired by the design of the DataPartition class in LightGBM, released under the MIT license with the following copyright:
Copyright (c) 2016 Microsoft Corporation. All rights reserved. Licensed under the MIT License. See LICENSE file in the project root for license information.
Copyright (c) 2024 stochtree authors. All rights reserved.
Simple container-like interfaces for samples of common models.
Copyright (c) 2024 stochtree authors.
Data structures for enumerating potential cutpoint candidates.
This is used in the XBART family of algorithms, which samples split rules based on the log marginal likelihood of every potential cutpoint. For numeric variables with large sample sizes, it is often unnecessary to consider every unique value, so we allow for an adaptive "grid" of potential cutpoint values.
Algorithms for enumerating cutpoints take Dataset and SortedNodeSampleTracker objects as inputs, so that each feature is "pre-sorted" according to its value within a given node. The size of the adaptive cutpoint grid is set by the cutpoint_grid_size configuration parameter.
When a node has fewer available observations than cutpoint_grid_size, full enumeration of unique available cutpoints is done via the EnumerateNumericCutpointsDeduplication
function
When a node has more available observations than cutpoint_grid_size, potential cutpoints are "thinned out" by considering every k-th observation, where k is implied by the number of observations and the target cutpoint_grid_size.
In this case, the grid is every unique value of the ordered categorical feature in ascending order.
In this case, the grid is every unique value of the unordered categorical feature, arranged in an outcome-dependent order, as described in Fisher (1958)
Copyright (c) 2024 stochtree authors. All rights reserved. Licensed under the MIT License. See LICENSE file in the project root for license information.
Copyright (c) 2024 stochtree authors. All rights reserved.
Copyright (c) 2024 stochtree authors.
Data structures used for tracking dataset through the tree building process.
The first category of data structure tracks observations available in nodes of a tree. a. UnsortedNodeSampleTracker tracks the observations available in every leaf of every tree in an ensemble, in no feature-specific sort order. It is primarily designed for use in BART-based algorithms. b. SortedNodeSampleTracker tracks the observations available in a every leaf of a tree, pre-sorted separately for each feature. It is primarily designed for use in XBART-based algorithms.
The second category, SampleNodeMapper, maps observations from a dataset to leaf nodes.
SampleNodeMapper is inspired by the design of the DataPartition class in LightGBM, released under the MIT license with the following copyright:
Copyright (c) 2016 Microsoft Corporation. All rights reserved. Licensed under the MIT License. See LICENSE file in the project root for license information.
SortedNodeSampleTracker is inspired by the "approximate" split finding method in xgboost, released under the Apache license with the following copyright:
Copyright 2015~2023 by XGBoost Contributors
Copyright (c) 2023 stochtree authors. All rights reserved. Licensed under the MIT License. See LICENSE file in the project root for license information.