normalization¶
Description¶
The normalization is a method that adjusts the scale of data attributes, typically by converting values to a standard range (e.g., zero mean and unit variance), facilitating more efficient and accurate model training. This algorithm offers flexibility by allowing normalization based on pre-defined parameters, calculated statistics from provided data files, or manual input of mean and standard deviation. Additionally, it supports denormalizing rule files for more interpretable results. The process generates normalized data files and denormalized rule files, and is especially useful when preparing data for Dimlp models.
Keep the following in mind:
- Normalization is recommended before training with DimlpTrn and DimlpBT.
- Normalization is not necessary for CNN, MLP, and SVM as normalization is handled internally during the training process.
- Decision trees (e.g., Gradient Boosting, Random Forests) do not require normalization as they are robust to unnormalized data.
Do not forget to denormalize the generated rules afterwards if you have normalized your data.
Arguments list¶
The normalization
algorithm works with both required and optional arguments. Each argument has specific properties:
- Is required means whether an argument must be specified when calling the program or not.
- Type specifies the argument datatype.
- CLI argument syntax is the exact name to use if you are writing the argument along with the program call.
- JSON identifier is the exact name to use if you are writing the argument inside a JSON configuration file.
- Default value is the value that will be used by the program if the argument is not specified. If
None
, it could mean that the argument is not used at all during the algorithm execution or could also mean that you have to specify it yourself.
Show help¶
Display parameters and other helpful information concerning the program usage and terminate it when done.
Property | Value |
---|---|
Is required | No |
Type | None |
CLI argument syntax | -h , --help or None |
JSON identifier | N/A |
Default value | None |
Warning
Every other specified argument will be ignored.
JSON configuration file¶
File containing the configuration for the algorithm in JSON format (see more about JSON configuration files).
Property | Value |
---|---|
Is required | No |
Type | String |
CLI argument syntax | --json-configuration-file |
JSON identifier | N/A |
Default value | None |
Warning
If you use this argument, it must be the only one specified. No other argument can be specified with it.
Root folder path¶
Default path from where all the other arguments related to file paths are going to be based. Using this allows you to work with paths relative to this location and avoid writing absolute paths or lengthy relative paths.
Property | Value |
---|---|
Is required | No |
Type | String |
CLI argument syntax | --root_folder |
JSON identifier | root_folder |
Default value | . |
Number of attributes¶
Number of attributes in the dataset (should be equal to the number of inputs of the model). Takes values in the range [1,∞[
.
Property | Value |
---|---|
Is required | Yes |
Type | Integer |
CLI argument syntax | --nb_attributes |
JSON identifier | nb_attributes |
Default value | None |
Number of classes¶
Number of classes in the dataset (should be equal to the number of outputs of the model). Takes values in the range [2,∞[
.
Property | Value |
---|---|
Is required | No |
Type | Integer |
CLI argument syntax | --nb_classes |
JSON identifier | nb_classes |
Default value | None |
Attributes file¶
File containing attributes and classes names.
Property | Value |
---|---|
Is required | No |
Type | String |
CLI argument syntax | --attributes_file |
JSON identifier | attributes_file |
Default value | None |
Data files¶
List of data files to normalize, they are normalized with respect to the first one if normalization file is not specified.
Property | Value |
---|---|
Is required | No** |
Type | List of strings |
CLI argument syntax | --data_files |
JSON identifier | data_files |
Default value | None |
Warning
This argument is not required if, and only if, the rule files is specified.
Rule files¶
List of rule files to denormalize, denormalization is possible only if a normalization file or mus, sigmas and normalization indices are given.
Property | Value |
---|---|
Is required | No** |
Type | List of strings |
CLI argument syntax | --rule_files |
JSON identifier | rule_files |
Default value | None |
Warning
This argument is not required if, and only if, the data files is specified.
Missing values¶
String representing a missing value in your data.
Property | Value |
---|---|
Is required | No** |
Type | String |
CLI argument syntax | --normalization_file |
JSON identifier | normalization_file |
Default value | None |
Warning
This argument is required for normalization. Put NaN
or any string not present in your data if there is no missing data.
Normalization file¶
File containing the mean and standard deviation for specified attributes to normalize data or denormalize rules.
Property | Value |
---|---|
Is required | No |
Type | String |
CLI argument syntax | --normalization_file |
JSON identifier | normalization_file |
Default value | None |
Mus¶
Mean or median of each attribute index specified in normalization indices to normalize data or denormalize rules. This argument is used alongside sigmas and normalization indices. Takes values in the range ]-∞,∞[
.
Property | Value |
---|---|
Is required | No** |
Type | Float list |
CLI argument syntax | --mus |
JSON identifier | mus |
Default value | None |
Warning
If sigmas or normalization indices are used, then this argument is required. Not used if a normalization file is given.
Sigmas¶
Standard deviation of each attribute index specified in normalization indices to normalize data or denormalize rules. This argument is used alongside mus and normalization indices. Takes values in the range ]-∞,∞[
.
Property | Value |
---|---|
Is required | No** |
Type | Float list |
CLI argument syntax | --sigmas |
JSON identifier | sigmas |
Default value | None |
Warning
If mus or normalization indices are used, then this argument is required. Not used if a normalization file is given.
Normalization indices¶
Indices of attributes to normalize or denormalize. Index starts at 0. Each index takes values in the range [0,nb_attributes-1]
.
Property | Value |
---|---|
Is required | No** |
Type | List of integers |
CLI argument syntax | --normalization_indices |
JSON identifier | normalization_indices |
Default value | [0,...,nb_attributes-1] |
Warning
If mus or sigmas are used, then this argument is required. Not used if a normalization file is given.
Normalization output file¶
Path to the file where the mean and standard deviation of the normalized attributes will be stored.
Property | Value |
---|---|
Is required | No |
Type | String |
CLI argument syntax | --output_normalization_file |
JSON identifier | output_normalization_file |
Default value | normalization_stats.txt |
Data output files¶
List containing the paths where the normalized data files will be saved.
Property | Value |
---|---|
Is required | No |
Type | List of strings |
CLI argument syntax | --output_data_files |
JSON identifier | output_data_files |
Default value | <original_name>_denormalized<original_extension> |
Warning
If one name is specified, all names are required.
Rule output files¶
List containing the paths where the denormalized rule files will be saved.
Property | Value |
---|---|
Is required | No |
Type | List of strings |
CLI argument syntax | --output_rule_files |
JSON identifier | output_rule_files |
Default value | <original_name>_denormalized<original_extension> |
Warning
If one name is specified, all names are required.
Use median¶
Whether to use median instead of mean to normalize.
Property | Value |
---|---|
Is required | No |
Type | Boolean |
CLI argument syntax | --with_median |
JSON identifier | with_median |
Default value | False |
Fill missing values¶
Whether to fill missing values with mean (or median) during normalization.
Property | Value |
---|---|
Is required | No |
Type | Boolean |
CLI argument syntax | --fill_missing_values |
JSON identifier | fill_missing_values |
Default value | True |
Usage example¶
For datafile normalization :
Example
For rulefile denormalization :
Example
Output interpretation¶
Normalized output data files¶
This file contains the normalized values of a dataset (taken from an original data file), where certain attributes have been adjusted based on a predefined mean (or median) and standard deviation (std). For attributes with defined mean and std values, normalization is applied using the formula: \( \text{normalized_value} = \frac{\text{(value - mean)}}{std} \). If no mean and std are defined for an attribute, its values remain unchanged during the normalization process. This normalization helps standardize data for more effective use in machine learning models.
Denormalized output rule files¶
This file contains denormalized values for a set of rules (taken from an original rules file) based on predefined mean (or median) and standard deviation (std) values for certain attributes. For attributes with defined mean and std values, denormalization is applied using the formula: \( \text{denormalized_value} =(\text{normalized_value}×std)+\text{mean} \). Attributes without defined mean and std values will remain unchanged during the denormalization process. This procedure helps convert normalized rule thresholds back to their original scale, making the rules easier to interpret.
Normalization output file¶
This file stores the mean(or median) and standard deviation (std) values for specific attributes, which are used for normalization or denormalization purposes. Each line corresponds to an attribute, with the format:
[attribute index] : original mean: [mean value], original std: [std value]
These mean(or median) and std values are applied during normalization to transform raw data into normalized values, and during denormalization to revert normalized values back to their original scale.
Attribute indices can be replaced with attribute names. In this case, an attribute file is required.