CGAL 5.4 - Classification
|
#include <CGAL/Classification/OpenCV/Random_forest_classifier.h>
Classifier based on the OpenCV version of the random forest algorithm.
Constructor | |
Random_forest_classifier (const Label_set &labels, const Feature_set &features, int max_depth=20, int min_sample_count=5, int max_categories=15, int max_number_of_trees_in_the_forest=100, float forest_accuracy=0.01f) | |
instantiates the classifier using the sets of labels and features . More... | |
Parameters | |
void | set_max_depth (int max_depth) |
void | set_min_sample_count (int min_sample_count) |
void | set_max_categories (int max_categories) |
void | set_max_number_of_trees_in_the_forest (int max_number_of_trees_in_the_forest) |
void | set_forest_accuracy (float forest_accuracy) |
Training | |
template<typename LabelIndexRange > | |
void | train (const LabelIndexRange &ground_truth) |
runs the training algorithm. More... | |
Input/Output | |
void | save_configuration (const char *filename) |
saves the current configuration in the file named filename . More... | |
void | load_configuration (const char *filename) |
loads a configuration from the file named filename . More... | |
CGAL::Classification::OpenCV::Random_forest_classifier::Random_forest_classifier | ( | const Label_set & | labels, |
const Feature_set & | features, | ||
int | max_depth = 20 , |
||
int | min_sample_count = 5 , |
||
int | max_categories = 15 , |
||
int | max_number_of_trees_in_the_forest = 100 , |
||
float | forest_accuracy = 0.01f |
||
) |
instantiates the classifier using the sets of labels
and features
.
Parameters documentation is copy-pasted from the official documentation of OpenCV. For more details on this method, please refer to it.
labels | label set used. |
features | feature set used. |
max_depth | the depth of the tree. A low value will likely underfit and conversely a high value will likely overfit. The optimal value can be obtained using cross validation or other suitable methods. |
min_sample_count | minimum samples required at a leaf node for it to be split. A reasonable value is a small percentage of the total data e.g. 1%. |
max_categories | Cluster possible values of a categorical variable into \( K \leq max\_categories \) clusters to find a suboptimal split. If a discrete variable, on which the training procedure tries to make a split, takes more than max_categories values, the precise best subset estimation may take a very long time because the algorithm is exponential. Instead, many decision trees engines (including ML) try to find sub-optimal split in this case by clustering all the samples into max_categories clusters that is some categories are merged together. The clustering is applied only in \( n>2-class \) classification problems for categorical variables with \( N > max\_categories \) possible values. In case of regression and 2-class classification the optimal split can be found efficiently without employing clustering, thus the parameter is not used in these cases. |
max_number_of_trees_in_the_forest | The maximum number of trees in the forest (surprise, surprise). Typically the more trees you have the better the accuracy. However, the improvement in accuracy generally diminishes and asymptotes pass a certain number of trees. Also to keep in mind, the number of tree increases the prediction time linearly. |
forest_accuracy | Sufficient accuracy (OOB error). |
void CGAL::Classification::OpenCV::Random_forest_classifier::load_configuration | ( | const char * | filename | ) |
loads a configuration from the file named filename
.
The input file should be in the XML format written by the save_configuration()
method. The feature set of the classifier should contain the exact same features in the exact same order as the ones present when the file was generated using save_configuration()
.
void CGAL::Classification::OpenCV::Random_forest_classifier::save_configuration | ( | const char * | filename | ) |
saves the current configuration in the file named filename
.
This allows to easily save and recover a specific classification configuration.
The output file is written in an XML format that is readable by the load_configuration()
method.
void CGAL::Classification::OpenCV::Random_forest_classifier::train | ( | const LabelIndexRange & | ground_truth | ) |
runs the training algorithm.
From the set of provided ground truth, this algorithm estimates sets up the random trees that produce the most accurate result with respect to this ground truth.
ground_truth | vector of label indices. It should contain for each input item, in the same order as the input set, the index of the corresponding label in the Label_set provided in the constructor. Input items that do not have a ground truth information should be given the value -1 . |