Back to TreeExtra package web page

Bagged Trees with Feature Evaluation

This code provides tools for building bagged trees models with an extremely fast built-in feature evaluation technique. It is recommended to use this tool for the preprocessing for Additive Groves when the original data set contains large numbers of features. This tool can also be used for building bagged trees models themselves.

Bagging

Bagging was invented by Leo Breiman [1]. This ensemble technique decreases variance of the original single models by building every new model on a bootstrap of the train set and averaging predictions of those models. In this tool bagging is applied to decision trees.
In this implementation size of trees can be controlled through the input parameter α. α influences max percentage of training data in the leaf*, so in some sense it is reverse to the size of the tree. α = 1 produces a stump, α = 0 - a full tree. The following values of α can be used in the training: 1, 0.5, 0.2, 0.1, 0.05, 0.02, 0.01, 0.005, ... , 0.
The best strategy is to build several ensembles with different values of α and compare their performance on the validation set.
Root mean squared error is used both as splitting criterion and as performance measure, therefore this tool can be used both for binary classification and regression data sets.

* In the versions 2.3 and up the max size of the leaf is defined by both α and the height of the branch. This way lower nodes are less likely to be split than higher nodes, and the resulting trees are more balanced. If you are interested in the exact algorithm, read the code or contact Daria.

[1] Leo Breiman. Bagging predictors. Machine Learning 24(2), 1996


Feature evaluation technique: multiple counts

Most feature selection techniques require repeated training of models using different combinations of features. When the number of features in the data set is large, such approach can be infeasible in practice.
In [2] we have suggested a feature evaluation technique referred as multiple counts. It ranks the features based solely on how they are used by a single ensemble of bagged trees. Multiple counts scores every feature by the number of data points present in the nodes split by this feature. Empirical results show that this technique produces a ranking very similar to a more expensive sensitivity analysis method commonly used for this purpose.
Starting with TreeExtra 2.4, this score is also normalized by the feature entropy to ensure comparability of features with different number of values.

bt_train command implements bagging with feature evaluation. -k argument allows to specify a number of top features that should be provided in the output. k = -1 means that all features should be ranked, 0 — that ranking is not needed. In case if you want to use this feature evaluation as fast feature selection, bt_train generates a new attribute file, where only top k features are left active.

[2] R. Caruana, M. Elhawary, A. Munson, M. Riedewald, D. Sorokina, D. Fink, W. Hochachka, S. Kelling.
Mining Citizen Science Data to Predict Prevalence of Wild Bird Species. In proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06).


Train, test and validation set

It is highly recommended to use a separate validation set for tuning α. However, validation set is not directly required for training. One needs to provide some data as a validation set to the train command, but it is used solely for analyzing whether the bagging curve has converged well.


Commands specification

bt_train -t _train_set_ -v _validation_set_ -r _attr_file_ [-a _alpha_value_] [-b _bagging_iterations_] [-i _init_random_] [-m _model_file_name_] [-k _attributes_to_output_] [-o _output_file_name_] [-l log|nolog] [-c rms|roc] [-h _threads_]
O argument description default value
-t _train_set_ training set file name
-v _validation_set_ validation set file name
-r _attr_file_ attribute file name
-a _alpha_value_ parameter that controls max size of tree 0
-b _bagging_iterations_ number of bagging iterations 100
-i _init_random_ init value for random number generator 1
-m _model_file_name_ name of the output file for the model model.bin
-k _attributes_to_output_ number of ranked features to output (-1 = all) 0 (no feature selection)
-o _output_file_name_ name of the output file with the prediction scores on the validation data preds.txt
-l log | nolog amount of log output to stdout log
-c rms|roc performance metric used in the output rms
-h _threads_ number of threads, linux version only 6

Output:

  1. Saves the resulting model into the specified file.
  2. Outputs bagging curve on validation set into bagging_rms.txt (and bagging_roc.txt, if applicable).
  3. Outputs list of k top ranked features with their scores and column numbers into feature_scores.txt. Set k to -1 to rank all features.
  4. Saves the attribute file with only top k features as active. The name of the new file has a suffix ".fs" before the file extension.
  5. Predictions are saved into a specified output file, one prediction value per line.
  6. Training log is saved in log.txt file. If an old log.txt file already exists in the working directory, its contents are appended to logs.archive.txt
  7. If the log flag is on, full log is shown in the standard output. If the log flag is off, standard output shows performance on the validation set only.

bt_predict -p _test_set_ -r _attr_file_ [-m _model_file_name_] [-o _output_file_name_] [-l log|nolog] [-c rms|roc]
O argument description default value
-p _test_set_ cases that need predictions
-r _attr_file attribute file name
-m _model_file_name_ name of the input file containing the model model.bin
-o _output_file_name_ name of the output file for predictions preds.txt
-l log | nolog amount of log output to stdout log
-c rms|roc performance metric used in the output rms

Output:

  1. Predictions are saved into a specified output file, one prediction value per line.
  2. If the true values are specified in the test file, performance on the test set is saved to the log.
  3. If the log flag is on, full log is shown in the standard output. If log flag is off, standard output shows only performance on the test set.