Back to TreeExtra package web page

Additive Groves

Quick start

If you want to make a simple test run of Additive Groves on your data without reading the whole manual, go here.

Algorithm overview

Additive Groves is a supervised learning algorithm that consistently shows high performance on regression and classification problems. It is based on regression trees, additive models and bagging and is capable of both fitting additive structure of the problem and modelling its highly nonlinear components with very large trees at the same time. As it is based on bagging, it does not overfit when the number of iterations is increased. Combination of these properties makes Additive Groves superior in performance to other existing tree ensemble methods like bagging, boosting and Random Forests.

Additive Groves consists of bagged additive models, where every single model is a tree. A size of a single Grove (single additive model) is defined by two parameters: size of tree and number of trees. Each Grove of a non-trivial size is iteratively built from smaller Groves, so on every bagging iteration a set of Groves of different sizes is built. A single Grove consisting of large trees can and will overfit heavily to the training set, therefore bagging is applied on top in order to reduce variance. See [1] for details of the algorithm.


[1] Daria Sorokina, Rich Caruana, Mirek Riedewald.
Additive Groves of Regression Trees. In proceedings of the 18th European Conference on Machine Learning (ECML'07)


Current code release

This code implements original Additive Groves for regression*, where root mean squared error is used for fitting the trees. However, it can be used for binary classification as well. The code provides a choice of two metrics for evaluating models on the stage of choosing parameters: RMS - root mean squared error and ROC - area under the ROC curve (also referred to as AUC). ROC is preferrable for binary classification, because it is sensitive only to the relative order of the predictions and ignores their actual values. (Additive Groves can produce predictions greater than 1 and less than 0, and RMS will view such predictions as errors). For truly regression problems, of course, RMS is the only available option of performance metric.
The main drawback of Additive Groves is its running time. If the data set is large (~10 000 data points or > 100 features), and you are training your models in the default non-parallel mode, it is strongly recommended to run bagging with feature evaluation as a preprocessing step. Running Additive Groves on top 20 features returned by feature evaluation often results in good performance and reasonable running time. Another option is to run different bagging iteration in parallel, see section on parallelization below.

* In the versions 2.2 and up, the implementation contains a few improvements over the algorithm described in the original paper.

Important parameters of the algorithm


Training, validation and test sets.

To ensure good performance of the algorithm, it is important to choose the best values for all three parameters. Good values of α and N vary significantly between different data sets, so no predefined values of these parameters can be recommended. Therefore estimating these parameters on a validation data set - a separate part of the training data, not used for fitting the models - is crucial for Additive Groves. In the current implementation the process of finding best parameter values is built-in: models of different sizes that are naturally produced during training phase are evaluated on the validation set and the best combination of parameter values is reported along with recommendations whether more models, or models of larger sizes should be built for this dataset. Note that if you want to get the fair estimate of how well the algorithm performs, you cannot use the performance on the validation set during training - instead you need to save the best model, produce predictions for a separate test set and evaluate the final performance based on these predictions.


Multi-stage training

Due to the specifics of building the AG models with different parameters - more complex models are built from simpler models, and we don't know which models will be the best - the training is performed with sequences of three commands: ag_train, ag_expand and ag_save. ag_train builds initial set of models, ag_expand creates more complex models without rebuilding those already built, and ag_save allows to pick a model with specific parameters and save it in a named file. There are default values of all parameters and ag_train and ag_expand give recommendations on what to call next. Here is an example of the simple training sequence:

> ag_train -t data.train -v data.valid -r data.attr
... recommendation: ag_expand -a 0.001 -n 11 -b 100

> ag_expand -a 0.001 -n 11 -b 100
... recommendation: ag_expand -b 140

> ag_expand -b 140
... recommendation: ag_save -a 0.005 -n 8

> ag_save -a 0.005 -n 8

Therefore, the user can choose not to care about what the parameters are at all and still get the best performing model. To run the whole sequence automatically, check out the python script written by Alex Sorokin.

After the model is saved, it can be used by ag_predict command to produce predictions on the new data.


Temporary files

Commands ag_expand and ag_save rely on the information saved by previous runs of ag_train and ag_expand in temporary files, therefore the sequence of these commands should always be run in the same directory. Temporary files are stored in the directory AGTemp, which is created by ag_train. Although refered as temporary, these files are never deleted by any of the commands, in order to give you the possibility to run ag_save and ag_expand at any point of the time. Once the model is saved into a separate file, it does not rely on temporary files anymore and can be moved to other directories where it can be used for producing predictions.


Training speed

As the running time can be the main issue with Additive Groves, several speed modes are provided.

Slow mode

Slow mode follows the original algorithm exactly, builds and tests two versions of each complex Grove on every bagging iteration. The algorithm provides the best performance in the slow mode.

Fast mode

In the fast mode, Groves of all sizes are still built on every iteration, however, they are built faster because the best path to build them is determined during the first bagging iteration and then is reused by all subsequent iterations. Therefore only during the first iteration each Grove is built twice, and the running time of all other iterations is decreased almost by the factor of two. Additive Groves model trained in the fast mode slightly loses in performance to Additive Groves trained in the slow mode.

Layered mode

In the layered mode each Grove is trained from its "left neighbor" - a Grove with the same number of smaller trees. The running time of training in this mode is the same as in the fast mode. The models trained this way are more stable and layered mode is required for the feature selection or interaction detection analysis. The best model parameters and the expanding recommendations produced in the layered mode correspond to the best model for interaction detection, not to the model providing the best performance on the validation set.

ag_addbag - "superfast" mode

One can decrease fast mode running time further by an order of magnitude if one knows beforehand the size of Groves that will constitute the final model. In this case we need to build only a sequence of Groves of different sizes that are directly needed for iterative training of the final model. This is only a subset of the table of all possible sizes that we have to build otherwise.

To determine the sequence, one need to run at least one bagging iteration in the fast mode, save the model, and run ag_addbag. This command increases the number of bagged Groves in the saved ensemble. Note that you cannot change the size parameters of the model after running ag_addbag, so it always should be run only after the decision on final values of these parameters is made.

Parallelization by bagging iterations

To further increase the speed of training, one can train several model grids in parallel in different directories (using commands ag_train / ag_expand) and then merge them using the command ag_merge.

Note: the format of ag_merge has changed between versions 2.0 and 2.1.

Each run of ag_merge merges several model grids. They should have the same size with respect to α and N. The number of bagging iterations in the resulting model grid is the sum of numbers of bagging iterations in the original grids. The models built on different bagging iterations need to be different, so it is very important to make sure that original model grids are created with different random seeds. ag_merge should be called in a new directory, different from directories where the original model grids reside. Those directories (where ag_train / ag_expand were run) should be passed as input arguments. The resulting merged model grid will be located in the directory where ag_merge was called. The output of ag_merge is the same as the output of ag_train / ag_expand and the resulting model grid can be treated the same way as the output of those commands.

Parallelization by training different branches of the same tree

Starting version 2.3, Linux version of TreeExtra package uses multithreading for training the trees. Parallel branches are trained at the same time. Default number of threads used is 6, but it can be changed using input argument -h.

Commands specification

ag_train -t _train_set_ -v _validation_set_ -r _attr_file_ [-a _alpha_value_] [-n _N_value_] [-b _bagging_iterations_] [-s slow|fast] [-i _init_random_] [-c rms|roc] [-h _threads_]
O argument description default value
-t _train_set_ training set file name
-v _validation_set_ validation set file name
-r _attr_file_ attribute file name
-a _alpha_value_ parameter that controls max size of tree 0.01
-n _N_value_ max number of trees in a Grove 8
-b _bagging_iterations_ number of bagging iterations 60
-s slow | fast | layered training mode fast
-i _init_random_ init value for random number generator 1
-c rms|roc performance metric rms
-h _threads_ number of threads, linux version only 6

Output:

  1. Saves all α*N grid of models in temporary files with fixed names.
  2. Saves input parameters and training paths in separate temporary files.
  3. Outputs α*N grid of performance on validation set, measured by RMSE or ROC into a file performance.txt. The same file also contains binary information about the bagging learning curve convergence. 1 means the learning curve has converged for the given values of (α, N) parameters, 0 means it has not.
  4. Outputs bagging curve in best (α, N) point into a file bagging_rms.txt (and bagging_roc.txt, if applicable).
  5. Log output gives best (α, N) combination, rmse of this model on validation set and recommendations on expanding by values of α, N, or bagging iterations. (Log output is directed both to stdout and log.txt in all commands.)
  6. Training log is saved in log.txt file. If an old log.txt file already exists in the working directory, its contents are appended to logs.archive.txt
ag_expand [-a _alpha_value_] [-n _N_value_] [-b _bagging_iterations_] [-i _init_random_] [-h _threads_]
O argument description default value
-a _alpha_value_ parameter that controls max size of tree value used in previous train/expand session
-n _N_value_ max number of trees in a Grove value used in previous train/expand session
-b _bagging_iterations_ number of bagging iterations value used in previous train/expand session
-i _init_random_ init value for random number generator 10000 + value used in previous train/expand session
-h _threads_ number of threads, linux version only 6

Output: same as for ag_train . The log output is appended to the log.txt file.


ag_merge -d _dir1_ _dir2_ _dir3_ ...
O argument description
-d _dir1_
_dir2_
_dir3_
...
directories where the input model grids were created

Output: same as for ag_train .


ag_save [-m _model_file_name_] [-a _alpha_value] [-n _N_value_] [-b _bagging_iterations_]
O argument description default value
-m _model_file_name_ name of the output file for the model model.bin
-a _alpha_value_ parameter that controls max size of tree value with best results on validation set
-n _N_value_ max number of trees in a Grove value with best results on validation set
-b _bagging_iterations_ overall number of bagging iterations value used in previous train/expand session

Output: saves a model with given parameters in a specified file.

ag_addbag [-m _model_file_name_] [-b _bagging_iterations_] [-i _init_random_] [-h _threads_]
O argument description default value
-m _model_file_name_ name of the model file model.bin
-b _bagging_iterations_ overall number of bagging iterations previous value + 40
-i _init_random_ init value for random number generator number of bagging iterations
-h _threads_ number of threads, linux version only 6

Output:

  1. New base models are added to the ensemble.
  2. Outputs bagging curve to a fixed file bagging_rms.txt (and bagging_roc.txt, if applicable)
ag_predict -p _test_set_ -r _attr_file_ [-m _model_file_name_] [-o _output_file_name_] [-c rms|roc]
O argument description default value
-p _test_set_ cases that need predictions
-r _attr_file attribute file name
-m _model_file_name_ name of the input file containing the model model.bin
-o _output_file_name_ name of the output file for predictions preds.txt
-c rms|roc performance metric rms

Output:

  1. Predictions are saved into a specified output file, one prediction value per line.
  2. If true values are specified in the test file, standard output shows performance on the test set.