ÿþ<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" > <head> <title>Bagged Trees with Feature Evaluation</title> <link rel="stylesheet" type="text/css" href="TreeExtra.css" /> </head> <body style="font-family: Times New Roman"> <a href="http://additivegroves.net">Back to TreeExtra package web page</a> <h2> Bagged Trees with Feature Evaluation </h2> This code provides tools for building bagged trees models with an extremely fast built-in feature evaluation technique. It is recommended to use this tool for the preprocessing for Additive Groves when the original data set contains large numbers of features. On average, Additive Groves build good models using top 20 features identified by our feature evaluation technique. This tool can also be used for building bagged trees models <em>per se</em>.<br> <br /> <h3> Bagging </h3> Bagging was invented by Leo Breiman [1]. This ensemble technique decreases variance of the original single models by building every new model on a bootstrap of the train set and averaging predictions of those models. In this tool bagging is applied to decision trees.<br /> The tree growing algorithm used in this code reflects strong views of the author that every tree-based ensemble should consider size of trees as an external parameter. In this implementation size of trees can be controlled through the input parameter ±. ± defines max percentage of training data in the leaf, so in some sense it is reverse to the size of the tree. ± = 1 produces a stump, ± = 0 - a full tree. The following values of ± can be used in the training: 1, 0.5, 0.2, 0.1, 0.05, 0.02, 0.01, 0.005, ... , 0. <br /> Note that a popular myth that bagging the largest possible trees provides you with the best performance is not true: noisy data sets often are much better processed with smaller trees. The best strategy is to build several ensembles with different values of ± and compare their performance on the validation set.<br /> Root mean squared error is used both as splitting criterion and as performance measure, therefore this tool can be used both for binary classification and regression data sets.<br /> <br /> <em>[1] Leo Breiman.</em> <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.9399"> Bagging predictors</a>. Machine Learning 24(2), 1996<br /> <br /> <br /> <h3> Feature evaluation technique: multiple counts</h3> <p> Most feature selection techniques require repeated training of models using different combinations of features. When the number of features in the data set is large, such approach can be infeasible in practice.<br /> In [2] we have suggested a feature evaluation technique referred as <em>multiple counts</em>. It ranks the features based solely on how they are used by a single ensemble of bagged trees. Empirical results show that this technique produces a ranking very similar to a more expensive sensitivity analysis method commonly used for this purpose. </p> <p> <span class="code">bt_train</span> command implements bagging with feature evaluation. <span class="code">-k</span> argument allows to specify a number of top features that should be provided in the output. <span class="code">k</span> = -1 means that all features should be ranked, 0 &mdash; that ranking is not needed. In case if you want to use this feature evaluation as fast feature selection, <span class="code">bt_train</span> generates a new attribute file, where only top <span class="code">k</span> features are left active. </p> <em>[2] R. Caruana, M. Elhawary, A. Munson, M. Riedewald, D. Sorokina, D. Fink, W. Hochachka, S. Kelling. <br /> </em><a href="http://additivegroves.net/papers/BirdMining.pdf" onClick="javascript: _gaq.push(['_trackPageview', 'BirdMining.pdf']);">Mining Citizen Science Data to Predict Prevalence of Wild Bird Species.</a> In proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). <br /> <br /> <br /> <h3> Train, test and validation set </h3> It is highly recommended to use a separate validation set for tuning ±. However, validation set is not directly required for training. One needs to provide some data as a validation set to the train command, but it is used solely for analyzing whether the bagging curve has converged well.<br /> <br /><br /> <h3> Commands specification</h3> <span class="codeblue" >bt_train -t _train_set_ -v _validation_set_ -r _attr_file_ [-a _alpha_value_] [-b _bagging_iterations_] [-i _init_random_] [-m _model_file_name_] [-k _attributes_to_output_] [-l log|nolog] [-c rms|roc]</span> <table border="1"> <tr> <td style="width: 26px; height: 24px"> <span style="font-family: Wingdings">O</span></td> <td style="width: 124px; font-family: Times New Roman; height: 24px"> <strong>argument</strong></td> <td style="width: 454px; font-family: Times New Roman; height: 24px"> <strong>description</strong></td> <td style="width: 115px; font-family: Times New Roman; height: 24px"> <strong>default value</strong></td> </tr> <tr> <td>-t</td> <td>_train_set_</td> <td> training set file name</td> <td></td> </tr> <tr> <td>-v</td> <td>_validation_set_</td> <td>validation set file name</td> <td></td> </tr> <tr> <td>-r</td> <td>_attr_file_</td> <td>attribute file name</td> <td> </td> </tr> <tr> <td>-a</td> <td>_alpha_value_</td> <td >parameter that controls max size of tree</td> <td>0</td> </tr> <tr> <td>-b</td> <td>_bagging_iterations_</td> <td>number of bagging iterations</td> <td>100</td> </tr> <tr> <td>-i</td> <td>_init_random_</td> <td>init value for random number generator</td> <td>1</td> </tr> <tr> <td>-m</td> <td>_model_file_name_</td> <td>name of the output file for the model</td> <td>model.bin</td> </tr> <tr> <td>-k</td> <td>_attributes_to_output_</td> <td>number of ranked features to output (-1 = all)</td> <td>0 (no feature selection)</td> </tr> <tr> <td>-l</td> <td>log | nolog</td> <td>amount of log output to stdout</td> <td>log</td> </tr> <tr> <td>-c</td> <td>rms|roc</td> <td>performance metric used in the output</td> <td>rms</td> </tr> </table> <p> Output: </p> <ol> <li>Saves the resulting model into the specified file.</li> <li>Outputs bagging curve on validation set into bagging_rms.txt (and bagging_roc.txt, if applicable).</li> <li>Outputs list of k top ranked features with their scores and column numbers into features.txt. Set k to -1 to rank all features.</li><li>Saves the attribute file with only top k features as active. The name of the new file has a suffix ".fs" before the file extension.</li> <li>Training log is saved in log.txt file. If log flag is on, full log is also shown in standard output.</li> <li>If log flag is off, standard output shows performance on validation set only.</li> </ol> <br> <span class="codeblue" >bt_predict -p _test_set_ -r _attr_file_ [-m _model_file_name_] [-o _output_file_name_] [-l log|nolog] [-c rms|roc]</span> <table border="1"> <tr> <td style="width: 26px; height: 24px"> <span style="font-family: Wingdings">O</span></td> <td style="width: 180px; font-family: Times New Roman; height: 24px"> <strong>argument</strong></td> <td style="width: 454px; font-family: Times New Roman; height: 24px"> <strong>description</strong></td> <td style="width: 115px; font-family: Times New Roman; height: 24px"> <strong>default value</strong></td> </tr> <tr> <td>-p</td> <td>_test_set_</td> <td>cases that need predictions</td> <td></td> </tr> <tr> <td>-r</td> <td>_attr_file</td> <td>attribute file name</td> <td></td> </tr> <tr> <td>-m</td> <td>_model_file_name_</td> <td>name of the input file containing the model</td> <td>model.bin</td> </tr> <tr> <td>-o</td> <td>_output_file_name_</td> <td>name of the output file for predictions</td> <td>preds.txt</td> </tr> <tr> <td>-l</td> <td>log | nolog</td> <td>amount of log output to stdout</td> <td>log</td> </tr> <tr> <td>-c</td> <td>rms|roc</td> <td>performance metric used in the output</td> <td>rms</td> </tr> </table> <p> Output: </p> <ol> <li>Predictions are saved into a specified output file, one prediction value per line.</li> <li> If true values are specified in the test file, performance on test set is output to the log.</li> <li>If log flag is off, standard output shows performance on validation set only.</li> </ol> <script type="text/javascript"> var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-13054580-2']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); </script> </body> </html>