Skip to content

normalization

Description

The normalization is a method that adjusts the scale of data attributes, typically by converting values to a standard range (e.g., zero mean and unit variance), facilitating more efficient and accurate model training. This algorithm offers flexibility by allowing normalization based on pre-defined parameters, calculated statistics from provided data files, or manual input of mean and standard deviation. Additionally, it supports denormalizing rule files for more interpretable results. The process generates normalized data files and denormalized rule files, and is especially useful when preparing data for Dimlp models.

Keep the following in mind:

  • Normalization is recommended before training with DimlpTrn and DimlpBT.
  • Normalization is not necessary for CNN, MLP, and SVM as normalization is handled internally during the training process.
  • Decision trees (e.g., Gradient Boosting, Random Forests) do not require normalization as they are robust to unnormalized data.

Do not forget to denormalize the generated rules afterwards if you have normalized your data.

Arguments list

The normalization algorithm works with both required and optional arguments. Each argument has specific properties:

  • Is required means whether an argument must be specified when calling the program or not.
  • Type specifies the argument datatype.
  • CLI argument syntax is the exact name to use if you are writing the argument along with the program call.
  • JSON identifier is the exact name to use if you are writing the argument inside a JSON configuration file.
  • Default value is the value that will be used by the program if the argument is not specified. If None, it could mean that the argument is not used at all during the algorithm execution or could also mean that you have to specify it yourself.

Show help

Display parameters and other helpful information concerning the program usage and terminate it when done.

Property Value
Is required No
Type None
CLI argument syntax -h, --help or None
JSON identifier N/A
Default value None

Warning

Every other specified argument will be ignored.


JSON configuration file

File containing the configuration for the algorithm in JSON format (see more about JSON configuration files).

Property Value
Is required No
Type String
CLI argument syntax --json-configuration-file
JSON identifier N/A
Default value None

Warning

If you use this argument, it must be the only one specified. No other argument can be specified with it.


Root folder path

Default path from where all the other arguments related to file paths are going to be based. Using this allows you to work with paths relative to this location and avoid writing absolute paths or lengthy relative paths.

Property Value
Is required No
Type String
CLI argument syntax --root_folder
JSON identifier root_folder
Default value .

Number of attributes

Number of attributes in the dataset (should be equal to the number of inputs of the model). Takes values in the range [1,∞[.

Property Value
Is required Yes
Type Integer
CLI argument syntax --nb_attributes
JSON identifier nb_attributes
Default value None

Number of classes

Number of classes in the dataset (should be equal to the number of outputs of the model). Takes values in the range [2,∞[.

Property Value
Is required No
Type Integer
CLI argument syntax --nb_classes
JSON identifier nb_classes
Default value None

Attributes file

File containing attributes and classes names.

Property Value
Is required No
Type String
CLI argument syntax --attributes_file
JSON identifier attributes_file
Default value None

Data files

List of data files to normalize, they are normalized with respect to the first one if normalization file is not specified.

Property Value
Is required No**
Type List of strings
CLI argument syntax --data_files
JSON identifier data_files
Default value None

Warning

This argument is not required if, and only if, the rule files is specified.


Rule files

List of rule files to denormalize, denormalization is possible only if a normalization file or mus, sigmas and normalization indices are given.

Property Value
Is required No**
Type List of strings
CLI argument syntax --rule_files
JSON identifier rule_files
Default value None

Warning

This argument is not required if, and only if, the data files is specified.


Missing values

String representing a missing value in your data.

Property Value
Is required No**
Type String
CLI argument syntax --normalization_file
JSON identifier normalization_file
Default value None

Warning

This argument is required for normalization. Put NaN or any string not present in your data if there is no missing data.


Normalization file

File containing the mean and standard deviation for specified attributes to normalize data or denormalize rules.

Property Value
Is required No
Type String
CLI argument syntax --normalization_file
JSON identifier normalization_file
Default value None

Mus

Mean or median of each attribute index specified in normalization indices to normalize data or denormalize rules. This argument is used alongside sigmas and normalization indices. Takes values in the range ]-∞,∞[.

Property Value
Is required No**
Type Float list
CLI argument syntax --mus
JSON identifier mus
Default value None

Warning

If sigmas or normalization indices are used, then this argument is required. Not used if a normalization file is given.


Sigmas

Standard deviation of each attribute index specified in normalization indices to normalize data or denormalize rules. This argument is used alongside mus and normalization indices. Takes values in the range ]-∞,∞[.

Property Value
Is required No**
Type Float list
CLI argument syntax --sigmas
JSON identifier sigmas
Default value None

Warning

If mus or normalization indices are used, then this argument is required. Not used if a normalization file is given.


Normalization indices

Indices of attributes to normalize or denormalize. Index starts at 0. Each index takes values in the range [0,nb_attributes-1].

Property Value
Is required No**
Type List of integers
CLI argument syntax --normalization_indices
JSON identifier normalization_indices
Default value [0,...,nb_attributes-1]

Warning

If mus or sigmas are used, then this argument is required. Not used if a normalization file is given.


Normalization output file

Path to the file where the mean and standard deviation of the normalized attributes will be stored.

Property Value
Is required No
Type String
CLI argument syntax --output_normalization_file
JSON identifier output_normalization_file
Default value normalization_stats.txt

Data output files

List containing the paths where the normalized data files will be saved.

Property Value
Is required No
Type List of strings
CLI argument syntax --output_data_files
JSON identifier output_data_files
Default value <original_name>_denormalized<original_extension>

Warning

If one name is specified, all names are required.


Rule output files

List containing the paths where the denormalized rule files will be saved.

Property Value
Is required No
Type List of strings
CLI argument syntax --output_rule_files
JSON identifier output_rule_files
Default value <original_name>_denormalized<original_extension>

Warning

If one name is specified, all names are required.


Use median

Whether to use median instead of mean to normalize.

Property Value
Is required No
Type Boolean
CLI argument syntax --with_median
JSON identifier with_median
Default value False

Fill missing values

Whether to fill missing values with mean (or median) during normalization.

Property Value
Is required No
Type Boolean
CLI argument syntax --fill_missing_values
JSON identifier fill_missing_values
Default value True

Usage example

For datafile normalization :

Example

from trainings import normalization

normalization(
"""--data_files [train_data.txt,test_data.txt]
--normalization_indices [0,2,4]
--nb_attributes 16
--missing_values NaN
--root_folder dimlp/datafiles"""
)
./normalization --data_files [train_data.txt,test_data.txt] --normalization_indices [0,2,4] --nb_attributes 16 --missing_values NaN --root_folder ../dimlp/datafiles

For rulefile denormalization :

Example

from trainings import normalization

normalization(
"""--normalization_file normalization_stats.txt
--rule_files globalRules.rls
--nb_attributes 16
--root_folder dimlp/datafiles"""
)
./normalization --normalization_file normalization_stats.txt --rule_files globalRules.rls --nb_attributes 16 --root_folder ../dimlp/datafiles

Output interpretation


Normalized output data files

This file contains the normalized values of a dataset (taken from an original data file), where certain attributes have been adjusted based on a predefined mean (or median) and standard deviation (std). For attributes with defined mean and std values, normalization is applied using the formula: \( \text{normalized_value} = \frac{\text{(value - mean)}}{std} \)​. If no mean and std are defined for an attribute, its values remain unchanged during the normalization process. This normalization helps standardize data for more effective use in machine learning models.


Denormalized output rule files

This file contains denormalized values for a set of rules (taken from an original rules file) based on predefined mean (or median) and standard deviation (std) values for certain attributes. For attributes with defined mean and std values, denormalization is applied using the formula: \( \text{denormalized_value} =(\text{normalized_value}×std)+\text{mean} \). Attributes without defined mean and std values will remain unchanged during the denormalization process. This procedure helps convert normalized rule thresholds back to their original scale, making the rules easier to interpret.


Normalization output file

This file stores the mean(or median) and standard deviation (std) values for specific attributes, which are used for normalization or denormalization purposes. Each line corresponds to an attribute, with the format:

[attribute index] : original mean: [mean value], original std: [std value]

These mean(or median) and std values are applied during normalization to transform raw data into normalized values, and during denormalization to revert normalized values back to their original scale.

Attribute indices can be replaced with attribute names. In this case, an attribute file is required.