Skip to content

gradBoostTrn

Description

The Gradient Boosting decision tree model is an ensemble learning technique that builds a series of decision trees, where each tree corrects the errors of the previous ones. This method is widely used for both classification and regression tasks due to its strong predictive performance. In this implementation, we use the version provided by scikit-learn, which allows for flexible and efficient training. The generated decision tree rules allow Fidex to identify hyperplanes in the feature space that discriminate between different classes, thus enabling the extraction of decision rules, making the model's decisions more transparent and easier to explain. For more details on the Gradient Boosting algorithm with Dimlp, you can refer to this paper.

Arguments list

The gradBoostTrn algorithm works with both required and optional arguments. Each argument has specific properties:

  • Is required means whether an argument must be specified when calling the program or not.
  • Type specifies the argument datatype.
  • CLI argument syntax is the exact name to use if you are writing the argument along with the program call.
  • JSON identifier is the exact name to use if you are writing the argument inside a JSON configuration file.
  • Default value is the value that will be used by the program if the argument is not specified. If None, it could mean that the argument is not used at all during the algorithm execution or could also mean that you have to specify it yourself.

Show help

Display parameters and other helpful information concerning the program usage and terminate it when done.

Property Value
Is required No
Type None
CLI argument syntax -h, --help or None
JSON identifier N/A
Default value None

Warning

Every other specified argument will be ignored.


JSON configuration file

File containing the configuration for the algorithm in JSON format (see more about JSON configuration files).

Property Value
Is required No
Type String
CLI argument syntax --json-configuration-file
JSON identifier N/A
Default value None

Warning

If you use this argument, it must be the only one specified. No other argument can be specified with it.


Root folder path

Default path from where all the other arguments related to file paths are going to be based. Using this allows you to work with paths relative from this location and avoid writing absolute paths or lengthy relative paths.

Property Value
Is required No
Type String
CLI argument syntax --root_folder
JSON identifier root_folder
Default value .

Train data file

File containing the train portion of the dataset, It can also contain training "true classes" (see Train true classes file).

Property Value
Is required Yes
Type String
CLI argument syntax --train_data_file
JSON identifier train_data_file
Default value None

Test data file

Path to the file containing test portion of the dataset, It can also contain testing "true classes" (see Test true classes file).

Property Value
Is required Yes
Type String
CLI argument syntax --test_data_file
JSON identifier test_data_file
Default value None

Number of attributes

Number of attributes in the dataset (should be equal to the number of inputs of the model). Takes values in the range [1,∞[.

Property Value
Is required Yes
Type Integer
CLI argument syntax --nb_attributes
JSON identifier nb_attributes
Default value None

Number of classes

Number of classes in the dataset (should be equal to the number of outputs of the model). Takes values in the range [2,∞[.

Property Value
Is required Yes
Type Integer
CLI argument syntax --nb_classes
JSON identifier nb_classes
Default value None

Train true classes file

File containing "true classes" (expected predictions), from the train portion of the dataset used to train the model.

Property Value
Is required No**
Type String
CLI argument syntax --train_class_file
JSON identifier train_class_file
Default value None

Warning

This argument is not required if, and only if, the true classes are already specified inside the train data file.


Test true classes file

File containing "true classes" (expected predictions), from the test portion of the dataset used to train the model.

Property Value
Is required No**
Type String
CLI argument syntax --test_class_file
JSON identifier test_class_file
Default value None

Warning

This argument is not required if, and only if, the true classes are already specified inside the test data file.


Train prediction output file

Path to the file where the train predictions will be stored.

Property Value
Is required No
Type String
CLI argument syntax --train_pred_outfile
JSON identifier train_pred_outfile
Default value predTrain.out

Test prediction output file

Path to the file where the test predictions will be stored.

Property Value
Is required No
Type String
CLI argument syntax --test_pred_outfile
JSON identifier test_pred_outfile
Default value predTest.out

Statistics output file

Name of the output file that will contain all computed statistics.

Property Value
Is required No
Type String
CLI argument syntax --stats_file
JSON identifier stats_file
Default value stats.txt

Logs output file

Name of file containing every feedback made by the algorithm during its execution. If not specified, the feedback is displayed into the terminal.

Property Value
Is required No
Type String
CLI argument syntax --console_file
JSON identifier console_file
Default value None

Rules output file

Path to the file where the gradient boosting output rules will be stored.

Property Value
Is required No
Type String
CLI argument syntax --rules_outfile
JSON identifier rules_outfile
Default value GB_rules.rls

Number of estimators

Number of generated trees in the forest. Takes values in the range [1,∞[.

Property Value
Is required No
Type Integer
CLI argument syntax --n_estimators
JSON identifier n_estimators
Default value 100

Loss function

Loss function to be used and optimized. Options are log_loss and exponential.

Property Value
Is required No
Type String
CLI argument syntax --loss
JSON identifier loss
Default value log_loss

Learning rate

Shrinks the contribution of each tree. Takes values in the range [0,∞[.

Property Value
Is required No
Type Float
CLI argument syntax --learning_rate
JSON identifier learning_rate
Default value 0.1

Subsample

Fraction of samples to be used for fitting the individual base learners. Takes values in the range ]0,1].

Property Value
Is required No
Type Float
CLI argument syntax --subsample
JSON identifier subsample
Default value 1.0

Criterion

Function to measure split quality. Options are friedman_mse and squared_error.

Property Value
Is required No
Type String
CLI argument syntax --criterion
JSON identifier criterion
Default value friedman_mse

Maximum depth

Maximum depth of the individual regression estimators. Can be an Integer in the range [2,∞[ or no_max_depth.

Property Value
Is required No
Type Integer or String
CLI argument syntax --max_depth
JSON identifier max_depth
Default value 3

Minimum of samples to split

Minimum number of samples required to split an internal node, if float, it is a fraction of the number of samples. Takes integers in the range [2,∞[ and floats in the range ]0,1].

Property Value
Is required No
Type Integer or Float
CLI argument syntax --min_samples_split
JSON identifier min_samples_split
Default value 2

Minimum of samples to be leaf

Minimum number of samples required to be at a leaf node, if float, it is a fraction of the number of samples. Takes integers in the range [1,∞[ and floats in the range ]0,1[.

Property Value
Is required No
Type Integer or Float
CLI argument syntax --min_samples_leaf
JSON identifier min_samples_leaf
Default value 1

Minimum weighted fraction to be leaf

Minimum weighted fraction of the sum total of input samples weights required to be at a leaf node. Takes values in the range [0,0.5].

Property Value
Is required No
Type Float
CLI argument syntax --min_weight_fraction_leaf
JSON identifier min_weight_fraction_leaf
Default value 0.0

Maximum number of features

Number of features to consider when looking for the best split. If float, it is a fraction of the number of features. 1 stand for 1 feature, for all features put all, not 1.0. Values can be a String, options are: sqrt, log2 or all. Takes floats in the range ]0,1[ and integers in the range [1,∞[.

Property Value
Is required No
Type Integer, Float or String
CLI argument syntax --max_features
JSON identifier max_features
Default value sqrt

Maximum number of leaf nodes

Grow trees with a limited amount of leaf nodes in a best-first fashion. Takes values in the range [2,∞[.

Property Value
Is required No
Type Integer
CLI argument syntax --max_leaf_nodes
JSON identifier max_leaf_nodes
Default value None

Minimum impurity decrease

A node will be split if this split induces a decrease of the impurity greater than or equal to this value. Takes values in the range [0,∞[.

Property Value
Is required No
Type Float
CLI argument syntax --min_impurity_decrease
JSON identifier min_impurity_decrease
Default value 0.0

Initial estimator

Estimator object used to compute the initial predictions. Option is zero.

Property Value
Is required No
Type String
CLI argument syntax --init
JSON identifier init
Default value None

Seed

Seed for random number generation. Takes values in the range [0,∞[.

Property Value
Is required No
Type Integer
CLI argument syntax --seed
JSON identifier seed
Default value None

Verbosity level

Controls the verbosity when fitting and predicting. Takes values in the range [0,∞[.

Property Value
Is required No
Type Integer
CLI argument syntax --verbose
JSON identifier verbose
Default value 0

Warm start

Whether to reuse the solution of the previous call to fit and add more estimators to the ensemble.

Property Value
Is required No
Type Boolean
CLI argument syntax --warm_start
JSON identifier warm_start
Default value False

Validation Fraction

Proportion of training data to set aside as validation set for early stopping. Takes values in the range ]0,1[.

Property Value
Is required No
Type Float
CLI argument syntax --validation_fraction
JSON identifier validation_fraction
Default value 0.1

Number of non-significant iterations before stopping

Decide if early stopping will be used to terminate training when the validation score is not improving, stopping if the validation doesn't improve during this number of iterations. Takes values in the range [1,∞[.

Property Value
Is required No
Type Integer
CLI argument syntax --n_iter_no_change
JSON identifier n_iter_no_change
Default value None

Tolerance

Tolerance for the early stopping. Takes values in the range [0,∞[.

Property Value
Is required No
Type Float
CLI argument syntax --tol
JSON identifier tol
Default value 0.0001

CCP alpha

Complexity parameter used for Minimal Cost-Complexity Pruning. Takes values in the range [0,∞[.

Property Value
Is required No
Type Float
CLI argument syntax --ccp_alpha
JSON identifier ccp_alpha
Default value 0.0

Usage example

Example

from trainings import gradBoostTrn

gradBoostTrn(
"""--train_data_file train_data.txt 
--train_class_file train_class.txt 
--test_data_file test_data.txt 
--test_class_file test_class.txt 
--stats_file gb/stats.txt 
--train_pred_outfile gb/predTrain.out 
--test_pred_outfile gb/predTest.out 
--rules_outfile gb/GB_rules.rls 
--nb_attributes 16 
--nb_classes 2 
--root_folder dimlp/datafiles"""
)
./gradBoostTrn --train_data_file train_data.txt --train_class_file train_class.txt --test_data_file test_data.txt --test_class_file test_class.txt --stats_file gb/stats.txt --train_pred_outfile gb/predTrain.out --test_pred_outfile gb/predTest.out --rules_outfile gb/GB_rules.rls --nb_attributes 16 --nb_classes 2 --root_folder ../dimlp/datafiles

Output interpretation


Train/Test prediction file

This file contains the predicted probabilities for each possible class for each train (or test) sample. Each row corresponds to the prediction for a single sample, with N values representing the probability that the sample belongs to class 0, 1, ... or class N. The values in each row sum to 1. The class with the highest probability is considered the predicted class for that sample, unless a decision threshold is applied for a specific class. In that case, if the predicted probability for that class exceeds the threshold, the sample is classified as belonging to that class.

For example:

0.000718874 0.999281
0.949143 0.050857

In the first row, the model predicts a probability of approximately 0.0007 that the sample belongs to class 0, and 0.9993 that it belongs to class 1. Therefore, the model predicts class 1 for this sample. In the second row, the model predicts a probability of 0.949 that the sample belongs to class 0, and 0.051 that it belongs to class 1. Hence, the model predicts class 0 for this sample.

Each row of probabilities allows you to interpret the model's confidence in its predictions, enabling you to understand the likelihood of each sample belonging to a particular class.


Rules output file

This file contains the decision rules generated by the Gradient Boosting Decision Tree model. Each set of rules corresponds to a specific tree in the model, and these rules are used to identify the discriminant hyperplanes in the feature space during the Fidex algorithm.

File structure:

The file is organized into trees, with each tree containing a series of rules. Each tree contributes to the overall decision-making process in the Gradient Boosting model.

Rule structure:

Each rule consists of conditions on various attributes, followed by a score value. Let's break down this rule as an example:

Rule 1: X2<=0.8075000047683716 X0<=0.7166664898395538 X4<=0.5 -> value [1.69695588]
X2, X0, X4
These represent the variables from the dataset.
Rule 1
This indicates the rule number within the tree.
Conditions
The conditions are logical comparisons between a feature (X) and a threshold value (e.g., X1<=0.5). Multiple conditions are combined in a rule, and all conditions must be satisfied for the rule to apply.
Value
This represents the output value (or prediction) of the tree if the conditions of the rule are satisfied. The value is typically a contribution to the final prediction in the Gradient Boosting model.

Statistics file

This file contains accuracy on the training and testing sets. It offers a clear overview of the model’s performance across different datasets, helping to evaluate how well the model has learned and generalized to unseen data.

Accuracy
Indicates the proportion of correctly classified samples in each dataset (training or testing).