CGAL 5.2 - Classification
CGAL::Classification::ETHZ::Random_forest_classifier Class Reference

#include <CGAL/Classification/ETHZ/Random_forest_classifier.h>

Definition

Classifier based on the ETH Zurich version of the random forest algorithm [2].

Note
This classifier is distributed under the MIT license.
Is Model Of:
CGAL::Classification::Classifier

Constructor

 Random_forest_classifier (const Label_set &labels, const Feature_set &features)
 instantiates the classifier using the sets of labels and features.
 
 Random_forest_classifier (const Random_forest_classifier &other, const Feature_set &features)
 copies the other classifier's configuration using another set of features. More...
 

Training

template<typename ConcurrencyTag , typename LabelIndexRange >
void train (const LabelIndexRange &ground_truth, bool reset_trees=true, std::size_t num_trees=25, std::size_t max_depth=20)
 runs the training algorithm. More...
 

Miscellaneous

void get_feature_usage (std::vector< std::size_t > &count) const
 computes, for each feature, how many nodes in the forest uses it as a split criterion. More...
 

Input/Output

void save_configuration (std::ostream &output) const
 saves the current configuration in the stream output. More...
 
void load_configuration (std::istream &input)
 loads a configuration from the stream input. More...
 

Deprecated Input/Output

static void convert_deprecated_configuration_to_new_format (std::istream &input, std::ostream &output)
 converts a deprecated configuration (in compressed ASCII format) to a new configuration (in binary format). More...
 

Constructor & Destructor Documentation

◆ Random_forest_classifier()

CGAL::Classification::ETHZ::Random_forest_classifier::Random_forest_classifier ( const Random_forest_classifier other,
const Feature_set features 
)

copies the other classifier's configuration using another set of features.

This constructor can be used to apply a trained random forest to another data set.

Warning
The feature set should be composed of the same features than the ones used by other, and in the same order.

Member Function Documentation

◆ convert_deprecated_configuration_to_new_format()

static void CGAL::Classification::ETHZ::Random_forest_classifier::convert_deprecated_configuration_to_new_format ( std::istream &  input,
std::ostream &  output 
)
static

converts a deprecated configuration (in compressed ASCII format) to a new configuration (in binary format).

The input file should be a GZIP container written by the save_configuration() method from CGAL 5.1 and earlier. The output is a valid configuration for CGAL 5.2 and later.

Note
This function depends on the Boost libraries Serialization and IO Streams (compiled with the GZIP dependency).
Examples:
Classification/example_deprecated_conversion.cpp.

◆ get_feature_usage()

void CGAL::Classification::ETHZ::Random_forest_classifier::get_feature_usage ( std::vector< std::size_t > &  count) const

computes, for each feature, how many nodes in the forest uses it as a split criterion.

Each tree of the random forest recursively splits the training data set using at each node one of the input features. This method counts, for each feature, how many times it was selected by the training algorithm as a split criterion.

This method allows to evaluate how useful a feature was with respect to a training set: if a feature is used a lot, that means that it has a strong discriminative power with respect to how the labels are represented by the feature set; on the contrary, if a feature is not used very often, its discriminative power is probably low; if a feature is never used, it likely has no interest at all and is completely uncorrelated to the label segmentation of the training set.

Parameters
countvector where the result is stored. After running the method, it contains, for each feature, the number of nodes in the forest that use it as a split criterion, in the same order as the feature set order.

◆ load_configuration()

void CGAL::Classification::ETHZ::Random_forest_classifier::load_configuration ( std::istream &  input)

loads a configuration from the stream input.

The input file should be a binary file written by the save_configuration() method. The feature set of the classifier should contain the exact same features in the exact same order as the ones present when the file was generated using save_configuration().

Warning
If the file you are trying to load was saved using CGAL 5.1 or earlier, you have to convert it first using convert_deprecated_configuration_to_new_format() as the exchange format for ETHZ Random Forest changed in CGAL 5.2.

◆ save_configuration()

void CGAL::Classification::ETHZ::Random_forest_classifier::save_configuration ( std::ostream &  output) const

saves the current configuration in the stream output.

This allows to easily save and recover a specific classification configuration.

The output file is written in a binary format that is readable by the load_configuration() method.

◆ train()

template<typename ConcurrencyTag , typename LabelIndexRange >
void CGAL::Classification::ETHZ::Random_forest_classifier::train ( const LabelIndexRange &  ground_truth,
bool  reset_trees = true,
std::size_t  num_trees = 25,
std::size_t  max_depth = 20 
)

runs the training algorithm.

From the set of provided ground truth, this algorithm estimates sets up the random trees that produce the most accurate result with respect to this ground truth.

Precondition
At least one ground truth item should be assigned to each label.
Template Parameters
ConcurrencyTagenables sequential versus parallel algorithm. Possible values are Parallel_tag (default value if CGAL is linked with TBB) or Sequential_tag (default value otherwise).
Parameters
ground_truthvector of label indices. It should contain for each input item, in the same order as the input set, the index of the corresponding label in the Label_set provided in the constructor. Input items that do not have a ground truth information should be given the value -1.
reset_treesshould be set to false if the users wants to add new trees to the existing forest, and kept to true if the training should be recomputing from scratch (discarding the current forest).
num_treesnumber of trees generated by the training algorithm. Higher values may improve result at the cost of higher computation times (in general, using a few dozens of trees is enough).
max_depthmaximum depth of the trees. Higher values will improve how the forest fits the training set. A overly low value will underfit the test data and conversely an overly high value will likely overfit.