StochTree#

stochtree (short for "stochastic trees") unlocks flexible decision tree modeling in R or Python.

Table of Contents#

Getting Started: Details on how to install and use stochtree
About: Overview of the models supported by stochtree and pointers to further reading
R Package: Complete documentation of the R package
Python Package: Complete documentation of the Python package
C++ Core API and Architecture: Overview and documentation of the C++ codebase that supports stochtree
Advanced Vignettes: In-depth tutorials on new methods implemented using stochtree
Development: Roadmap and how to contribute

What does the software do?#

Boosted decision tree models (like xgboost, LightGBM, or scikit-learn's HistGradientBoostingRegressor) are great, but often require time-consuming hyperparameter tuning. stochtree can help you avoid this, by running a fast Bayesian analog of gradient boosting (called BART -- Bayesian Additive Regression Trees).

stochtree has two primary interfaces:

"High-level": robust implementations of many popular stochastic tree algorithms (BART, XBART, BCF, XBCF), with support for serialization and parallelism.
"Low-level": access to the "inner loop" of a forest sampler, allowing custom tree algorithm development in <50 lines of code.

The "core" of the software is written in C++, but it provides R and Python APIs. The R package is available on CRAN and the python package will soon be on PyPI.

Why "stochastic" trees?#

"Stochastic" loosely means the same thing as "random." This naturally raises the question: how is stochtree different from a random forest library? At a superficial level, both are decision tree ensembles that use randomness in training.

The difference lies in how that "randomness" is deployed. Random forests take random subsets of a training dataset, and then run a deterministic decision tree fitting algorithm (recursive partitioning). Stochastic tree algorithms use randomness to construct decision tree ensembles from a fixed training dataset.

The original stochastic tree model, Bayesian Additive Regression Trees (BART), used Markov Chain Monte Carlo (MCMC) to sample forests from their posterior distribution.

So why not call our project bayesiantree?

Some algorithms implemented in stochtree are "quasi-Bayesian" in that they are inspired by a Bayesian model, but are sampled with fast algorithms that do not provide a valid Bayesian posterior distribution.

Moreover, we think of stochastic forests as general-purpose modeling tools. What makes them useful is their strong empirical performance -- especially on small or noisy datasets -- not their adherence to any statistical framework.

So why not just call our project decisiontree?

Put simply, the sampling approach is part of what makes BART and other stochtree algorithms work so well -- we know because we have tested out versions that did not do stochastic sampling of the tree fits.

So we settled on the term "stochastic trees", or "stochtree" for short (pronounced "stoke-tree").