Additive Groves code: TreeExtra package
TreeExtra is a set of tools implementing the following algorithms:
- Additive Groves - a supervised learning algorithm
- Bagged trees with "multiple counts" feature evaluation
- Interaction detection with Additive Groves
- Feature selection by backward elimination with Additive Groves
- Effect and interaction visualization
Additive Groves is an ensemble of regression trees developed by Daria Sorokina,
Rich Caruana and Mirek Riedewald.
Feature evaluation technique referred as "multiple counts" is developed by Art Munson
and all of the above.
All code is written by Daria Sorokina unless stated otherwise. The code is available on GitHub
under BSD license and is free to use for any purpose. (It also makes use of external libraries available under LGPLv2.1 license.)
Contact: Daria Sorokina
( fi...@gmail.com )
Please e-mail me any comments, suggestions, bug reports or feature requests. I am
interested in how my algorithm is doing: if you have successfully (or unsuccessfully)
applied Additive Groves to your data, I'd be happy to hear about your experience
- 10 Feb '17. TreeExtra 2.5 is released on GitHub. See release notes for details.
- 25 Mar '16. TreeExtra 2.4 is released on GitHub
- Feature evaluation algorithm in bagged trees is changed. Now a score in each node is normalized by the entropy of the split feature in that node. This way the scores of binary features become comparable with the scores of continuous features with multiple values.
- Effect and interaction plots now consider all data, including data points with missing values. For the features with substantial number of missing values, the effect of missing value is also plotted.
- 13 Aug '13. TreeExtra 2.3 is released
- Tree-building algorithm is modified towards building more balanced trees. Up to 10% improvement in predictive performance on some data sets. Note that the
best values of parameters might differ from those produced by previous versions.
- Linux version is now making use of multithreading and trains different branches of a tree in parallel. The running time for training decreased 1.5 times
- A major issue is fixed in the Windows version. It is possible now to train good models on data sets larger than 32,000 data points. (Linux version did not
have this problem.)
- One of bt_train output files is renamed from features.txt to feature_scores.txt to decrease a chance of conflict with the input
data file name.
- 21 Apr '12. TreeExtra 2.2 is released
- Tree training is now faster without any impact on performance
- Treatment of missing values is improved: both probabilistic and "missing value is a separate value" approaches are evaluated in every split.
- N - number of trees in a grove - is now increased exponentially instead of linearly. It takes on values 1,2,3,4,6,8,11,16,23,32,45,64,... - major running
time savings on data sets with strong additive structure.
- Several tweaks on the original Additive Groves algorithm result in further performance improvement. Namely, in most cases convergence test and vertical vs
horizontal step tests are made on training instead of validation data now.
- -x option in vis_iplot tool allows to visualize slices of n-dimensional effect plots for higher order
- Nominals are accepted in the data files as long as they are not used.
- 05 Oct '10. TreeExtra 2.1 is released
- More convenient interface for ag_merge: now you can run any
number of iterations in parallel and merge them with a single command.
- Memory usage is decreased almost by a factor of two without any impact on
- 13 Jun '10. Additive Groves officialy got 4th place in the main track of the
Yahoo! Learning To Rank Challenge.
- 30 Dec '09. Additive Groves got the third place in the supervised challenge of
2009 IEEE ICDM Data Mining Contest.
- 29 Nov '09. TreeExtra 2.0 is released.
- New set of commands implements interaction detection, feature selection and effect /
interaction visualization with Additive Groves.
- New training mode of Additive Groves: option -s layered
invokes layered training algorithm.
- The header format of the model files has changed and is not compatible with
TreeExtra 1.x versions.
- bt_train command now saves a new attribute file where only
top k features are left active. Setting -k option to -1 allows to
create ranking for all features.
- 8 Oct '09. Several updates on the web site.
- 11 Aug '09. TreeExtra 1.2 is released.
- New command ag_merge joins several model grids. Therefore it
is now possible to parallelize training of Additive Groves.
- New option -l allows to control the volume of information in
standard output for bagged trees commands.
- The size of temporary folder AGTemp is decreased.
- 08 Jun '09. TreeExtra 1.1 is released. A choice of ROC for a model evaluation metric
is added for binary classification problems.
TreeExtra 2.4 and up
Versions after 2.3 are released on GitHub. You can find both code and binaries for Linux and Windows there.
Earlier TreeExtra versions
Application of Additive Groves to the Yahoo! Learning to Rank Challenge.
Modeling Additive Structure and Detecting Interactions with Additive Groves of Regression Trees
CMU Machine Learning Lunch, March 2010
Video (You need to scroll down to March 1 2010 talk. Sound is bad for the first few minutes only.)
Daria Sorokina, Rich Caruana, Mirek Riedewald, Wes Hochachka, Steve Kelling.
Detecting and Interpreting Variable Interactions in Observational Ornithology
In proceedings of the ICDM'09 Workshop on Domain Driven Data Mining (DDDM'09).
Application of Additive Groves Ensemble with Multiple Counts Feature Evaluation to KDD Cup'09 Small
In proceedings of the KDD Cup 2009 workshop.
Modeling Additive Structure and Detecting Interactions with Groves of Trees.
PhD dissertation, Cornell University, 2008.
Daria Sorokina, Rich Caruana, Mirek Riedewald, Daniel Fink.
Detecting Statistical Interactions with Additive Groves of Trees.
In proceedings of the 25th International Conference on Machine Learning (ICML'08).
Video of ICML presentation
Daria Sorokina, Rich Caruana, Mirek Riedewald.
Additive Groves of Regression Trees.
In proceedings of the 18th European Conference on Machine Learning (ECML'07) (Best Student Paper
Video of ECML presentation
R. Caruana, M. Elhawary, A. Munson, M. Riedewald, D. Sorokina, D. Fink, W. Hochachka,
Mining Citizen Science Data to Predict Prevalence of Wild Bird Species. <--
feature evaluation methods described here
In proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data