trainings package

Submodules

trainings.cnnTrn module

trainings.cnnTrn.cnnTrn(args: str = None)

Trains a convolutional neural network (CNN) model using the Keras library with optional support for popular architectures like ResNet and VGG. This function processes data preprocessing that includes resizing, normalization and a staircase activation function that allows for the characterization of discriminating hyperplanes, which are used in Fidex. This allows us to then use Fidex for comprehensible rule extraction. It accommodates various types of image datasets including MNIST, CIFAR-10, and CIFAR-100, and allows for extensive customization through command-line arguments. It’s also possible to use other data types.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • It’s mandatory to specify the number of classes in the data, as well as the train and test datasets.

  • Validation data can either be specified directly or split from the training data based on a provided ratio.

  • If validation files are given, and you want to use Fidex algorithms later, you will have to use both train and validation datas given here in the train datas and classes of Fidex.

  • It’s mandatory to specify the size of the original inputs as well as the number of channels (it should be 3 for RGB and 1 for B&W). The number of attributes is inferred from it.

  • It’s mandatory to chose a model. There is a large model, a small one, a VGG16 and a Resnet50 available. You can add any other model you want by modifying the code.

  • It’s mandatory to specify the format of the data values: ‘normalized_01’ if the data are normalized between 0 and 1, ‘classic’ if they are between 0 and 255, and ‘other’ otherwise.

  • Data is reshaped in 3-channels shape if there is only one and usinf VGG or Resnet.

  • If Fidex is meant to be executed afterward for rule extraction, resizing inputs beforhand to a smaller size is recommended as it will take a lot of time because of the number of parameters.

  • It is also possible to resize the inputs just for training with the model_input_size parameter. Training with smaller inputs will be worst but will save a lot of time.

  • Parameters can be specified using the command line or a JSON configuration file.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

  • It’s not necessary to normalize data before training because a normalization is done during the process.

Outputs:

  • train_valid_pred_outfile : File containing the model’s train and validation (in this order) predictions.

  • test_pred_outfile : File containing the model’s test predictions.

  • weights_outfile : File containing the model’s trained weights.

  • stats_file : File containing train and test accuracy.

  • console_file : If specified, contains the console output.

File formats:

  • Data files: These files should contain one sample (input/image) per line, with numbers separated either by spaces, tabs, semicolons or commas. Each pixel must be given one after the other. Supported formats:

    1. Only attributes (floats).

    2. Attributes (floats) followed by an integer class ID.

    3. Attributes (floats) followed by one-hot encoded class.

  • Class files: These files should contain one class sample per line, with integers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Integer class ID.

    2. One-hot encoded class.

Example of how to call the function:

from trainings.cnnTrn import cnnTrn

cnnTrn(‘--model small --train_data_file trainData.txt --train_class_file trainClass.txt --test_data_file testData.txt --test_class_file testClass.txt --original_input_size (28,28) --nb_channels 1 --data_format classic --nb_classes 10 --root_folder dimlp/datafiles/Mnist’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments or all arguments for the function, formatted like command-line input. This includes dataset selection, file paths, training parameters, and options for model architecture and output files.

Returns:

Returns 0 for successful execution, -1 for any errors encountered during the process.

trainings.cnnTrn.get_and_check_parameters(init_args)

Processes and validates command-line arguments for a convolution model training. This function cleans the input arguments by removing None values ensuring no unnecessary arguments are passed to the parser. It initializes the argument parser with basic configurations and adds various arguments required for the normalization process. It deternines which arguments are required or not and defines their default values.

Parameters:

init_args (list) – A list of command-line arguments passed to the program.

Returns:

A namespace object containing all the arguments that have been parsed and validated.

Return type:

argparse.Namespace

trainings.computeRocCurve module

trainings.computeRocCurve.computeRocCurve(args: str = None)

Computes and plots the Receiver Operating Characteristic (ROC) curve for a given set of test predictions and true class labels. The function supports various customizations through command-line arguments, including specifying input files, choosing the positive class index, and output options.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • The function is not compatible with SVM models directly due to the different process required for generating ROC curves for them.

  • It’s mandatory to specify the number of classes, the index of the positive class, and provide the test class labels and prediction scores.

  • Parameters can be specified using the command line or a JSON configuration file.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

Outputs:

  • stats_file : If specified, contains AUC scores.

  • output_roc : PNG file containing the ROC curve.

File formats:

  • Class file: Thesen files should contain one class sample per line, with integers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Integer class ID.

    2. One-hot encoded class.

  • Prediction file : These files should contain the prediction scores for the test set, with one sample per line, with scores (float) for each class separated either by spaces, tabs, semicolons or commas.

Example of how to call the function:

from trainings.computeRocCurve import computeRocCurve

computeRocCurve(‘--test_class_file dataclass2Test.txt --test_pred_file predTest.out --positive_class_index 1 --output_roc roc_curve.png --stats_file stats.txt --root_folder dimlp/datafiles --nb_classes 2’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments or all arguments for the function, formatted like command-line input. This includes file paths, the positive class index, and options for the output and statistical analysis.

Returns:

Returns 0 for successful execution, -1 for any errors encountered during the process. Additionally, it returns an array containing interpolated false positive rates (FPR), true positive rates (TPR), and the area under the ROC curve (AUC) for further analysis or cross-validation purposes.

trainings.computeRocCurve.get_and_check_parameters(init_args)

Processes and validates command-line arguments for a ROC curve computation. This function cleans the input arguments by removing None values ensuring no unnecessary arguments are passed to the parser. It initializes the argument parser with basic configurations and adds various arguments required for the normalization process. It deternines which arguments are required or not and defines their default values.

Parameters:

init_args (list) – A list of command-line arguments passed to the program.

Returns:

A namespace object containing all the arguments that have been parsed and validated.

Return type:

argparse.Namespace

trainings.crossValid module

trainings.crossValid.create_or_clear_directory(folder_name)
trainings.crossValid.crossValid(*args, **kwargs)
trainings.crossValid.formatting(number)
trainings.crossValid.get_dimlprul_stats(rule_file)
trainings.crossValid.get_test_acc(stats_file, train_method)

trainings.gradBoostTrn module

trainings.gradBoostTrn.get_and_check_parameters(init_args)

Processes and validates command-line arguments for a model training with gradient boosting. This function cleans the input arguments by removing None values ensuring no unnecessary arguments are passed to the parser. It initializes the argument parser with basic configurations and adds various arguments required for the normalization process. It deternines which arguments are required or not and defines their default values.

Parameters:

init_args (list) – A list of command-line arguments passed to the program.

Returns:

A namespace object containing all the arguments that have been parsed and validated.

Return type:

argparse.Namespace

trainings.gradBoostTrn.gradBoostTrn(args: str = None)

Trains a gradient boosting decision trees model. The nodes of the trees represent the discriminating hyperplanes used in Fidex. This allows us to then use Fidex for comprehensible rule extraction. The function offers a wide range of customization through command-line arguments, allowing for the specification of gradient boosting parameters, output options, and more.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • It’s mandatory to specify the number of attributes and classes in the data, as well as the train and test datasets.

  • True train and test class labels must be provided, either within the data file or separately through a class file.

  • Parameters can be defined directly via the command line or through a JSON configuration file.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

  • It’s not necessary to normalize data before training because decision trees don’t need normalization.

Outputs:

  • train_pred_outfile : File containing the model’s train predictions.

  • test_pred_outfile : File containing the model’s test predictions.

  • rules_outfile : File containing the model’s trained rules.

  • stats_file : File containing train and test accuracy.

  • console_file : If specified, contains the console output.

File formats:

  • Data files: These files should contain one sample per line, with numbers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Only attributes (floats).

    2. Attributes (floats) followed by an integer class ID.

    3. Attributes (floats) followed by one-hot encoded class.

  • Class files: These files should contain one class sample per line, with integers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Integer class ID.

    2. One-hot encoded class.

Example of how to call the function:

from trainings.gradBoostTrn import gradBoostTrn

gradBoostTrn(‘--train_data_file datanormTrain.txt --train_class_file dataclass2Train.txt --test_data_file datanormTest.txt --test_class_file dataclass2Test.txt --stats_file gb/stats.txt --train_pred_outfile gb/predTrain.out --test_pred_outfile gb/predTest.out --rules_outfile gb/RF_rules.rls --nb_attributes 16 --nb_classes 2 --root_folder dimlp/datafiles’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments, or all arguments for the function formatted like command-line input. This includes file paths, gradient boosting parameters, and options for output.

Returns:

Returns 0 for successful execution, -1 for errors encountered during the process.

trainings.mlpTrn module

trainings.mlpTrn.get_and_check_parameters(init_args)

Processes and validates command-line arguments for a model training with MLP. This function cleans the input arguments by removing None values ensuring no unnecessary arguments are passed to the parser. It initializes the argument parser with basic configurations and adds various arguments required for the normalization process. It deternines which arguments are required or not and defines their default values.

Parameters:

init_args (list) – A list of command-line arguments passed to the program.

Returns:

A namespace object containing all the arguments that have been parsed and validated.

Return type:

argparse.Namespace

trainings.mlpTrn.mlpTrn(args: str = None)

Trains an MLP model with data preprocessing that includes normalization and a staircase activation function that allows for the characterization of discriminating hyperplanes, which are used in Fidex. This allows us to then use Fidex for comprehensible rule extraction. The function offers a wide range of customization through command-line arguments, allowing for the specification of MLP parameters, output options, and more.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • It’s mandatory to specify the number of attributes and classes in the data, as well as the train and test datasets.

  • True train and test class labels must be provided, either within the data file or separately through a class file.

  • Parameters can be defined directly via the command line or through a JSON configuration file.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

  • It’s not necessary to normalize data before training because a normalization is done during the process.

Outputs:

  • train_pred_outfile : File containing the model’s train predictions.

  • test_pred_outfile : File containing the model’s test predictions.

  • weights_outfile : File containing the model’s trained weights.

  • stats_file : File containing train and test accuracy.

  • console_file : If specified, contains the console output.

File formats:

  • Data files: These files should contain one sample per line, with numbers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Only attributes (floats).

    2. Attributes (floats) followed by an integer class ID.

    3. Attributes (floats) followed by one-hot encoded class.

  • Class files: These files should contain one class sample per line, with integers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Integer class ID.

    2. One-hot encoded class.

Example of how to call the function:

from trainings.mlpTrn import mlpTrn

mlpTrn(‘--train_data_file datanormTrain.txt --train_class_file dataclass2Train.txt --test_data_file datanormTest.txt --test_class_file dataclass2Test.txt --weights_outfile mlp/weights.wts --stats_file mlp/stats.txt --train_pred_outfile mlp/predTrain.out --test_pred_outfile mlp/predTest.out --nb_attributes 16 --nb_classes 2 --root_folder dimlp/datafiles’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments, or all arguments for the function formatted like command-line input. This includes file paths, MLP parameters, and options for output and for the staircase activation process.

Returns:

Returns 0 for successful execution, -1 for errors encountered during the process.

trainings.normalization module

trainings.normalization.denormalize_rule(line, pattern, antecedent_pattern, dimlp_rule, with_attribute_names, normalization_indices, attributes, sigmas, mus)

This function denormalizes a given rule line based on specified rule patterns and normalization parameters. If the line don’t correspond to the pattern, it’s left unchanged. Otherwise, it parses the rule line using the provided regular expression patterns, identifies each antecedent corresponding to indices that we need to denormalize, and applies denormalization to the numeric values using the provided sigma (standard deviation) and mu (mean) values. The function reconstructs the rule with the denormalized values.

Parameters:
  • line (str) – The rule line to be denormalized.

  • pattern (str) – The regular expression pattern for identifying and parsing the entire rule line.

  • antecedent_pattern (str) – The regular expression pattern for identifying each antecedent in the rule.

  • dimlp_rule (bool) – A flag indicating whether the rule is a DIMLP rule or a Fidex rule, which affects attribute indexing and rule pattern.

  • with_attribute_names (bool) – A flag indicating whether the rules use attribute names instead of numeric IDs.

  • normalization_indices (list of int) – A list of indices of the attributes to be denormalized.

  • attributes (list of str or int) – A list of attribute names or numeric IDs of the attributes.

  • sigmas (list of float) – A list of standard deviation values for denormalization, corresponding to each attribute of normalization_indices.

  • mus (list of float) – A list of mean or median values for denormalization, corresponding to each attribute of normalization_indices.

Returns:

The denormalized rule line.

Return type:

str

trainings.normalization.gaussian_normalization(data_files, normalization_indices, normalized_file, fill_missing_values, normalization_file=None, attributes=None, missing_value=None, with_median=False, mus=None, sigmas=None)

Perform Gaussian normalization on specified attributes of the given data.

This function normalizes the data based on the Gaussian distribution, where each selected attribute is adjusted to have a mean (or median) of zero and a standard deviation of one. It handles missing values, writes the normalization statistics to a file if provided, and saves the normalized data to a specified file.

Parameters:
  • data_files (list of lists) – The dataset to be normalized.

  • normalization_indices (list) – Indices of the attributes to be normalized.

  • normalized_file (str) – File path to save the normalized data.

  • fill_missing_values (bool) – Flag to fill missing values with the mean/median.

  • normalization_file (str, optional) – Optional file path to save the normalization statistics.

  • attributes (list, optional) – Optional list of attribute names corresponding to indices.

  • missing_value (str or None) – Representation of missing values in the data.

  • with_median (bool) – Flag to use median for normalization instead of mean.

  • mus (list of float, optional) – Predefined list of mean (or median) values for each attribute, defaults to None.

  • sigmas (list of float, optional) – Predefined list of standard deviation values for each attribute, defaults to None.

Raises:

ValueError – If mus, sigmas, and normalization_indices are not of the same length or other existing conditions.

Returns:

Tuple containing the mean (or median) and standard deviation used for normalization.

Return type:

(float, float)

trainings.normalization.get_and_check_parameters(init_args)

Processes and validates command-line arguments for a data normalization application. This function cleans the input arguments by removing None values ensuring no unnecessary arguments are passed to the parser. It initializes the argument parser with basic configurations and adds various arguments required for the normalization process. It deternines which arguments are required or not and defines their default values.

Parameters:

init_args (list) – A list of command-line arguments passed to the program.

Returns:

A namespace object containing all the arguments that have been parsed and validated.

Return type:

argparse.Namespace

trainings.normalization.get_pattern_from_rule_file(rule_file, possible_patterns)

This function reads through a given rule file and identifies the pattern that matches the rules in the file. It narrows down the possible patterns to the one that matches the content of the file. If no pattern matches, or if multiple patterns match different rules, an error is raised.

Parameters:
  • rule_file (str) – The path to the rule file to be analyzed.

  • possible_patterns (list of str) – A list of regular expression patterns to check against the lines in the file.

Raises:
  • ValueError – If no rule matches the provided patterns, or if multiple patterns match different rules.

  • ValueError – If the file is not found or cannot be opened.

Returns:

The pattern that matches the rules in the file.

Return type:

str

trainings.normalization.normalization(args: str = None)

This function serves two primary purposes: to normalize data files and to denormalize rule files. It offers flexibility in the normalization process through various options.

Normalization can be performed in several ways:

  1. Using a normalization_file file containing normalization parameters along with one or more data files.

  2. Providing data files directly, where the first file is normalized to determine mean/median and standard deviation, which are then applied to other files.

  3. Supplying mean/median (mus) and standard deviations (sigmas) as lists, along with the data files.

In the last two cases, indices of attributes to normalize must be provided, and a normalization_file file is generated for future use.

Denormalization can also be done in multiple ways:

  1. Using a normalization_file file with one or more rule files.

  2. Directly providing mean/median (mus) and standard deviations (sigmas) along with the rule files. Attribute indices to be denormalized must be provided in this case.

The function generates new normalized and/or denormalized files.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • It’s mandatory to specify the number of attributes in the data and the symbol representing missing data.

  • Choose whether to replace missing data or not.

  • If normalizing training data, it is advisable to normalize test/validation files simultaneously for consistency.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

When to use :

  • It’s good to normalize data before training with Dimlp and dimlpBT.

  • It’s not necessary to normalize data before training with cnnTrn, MLP and SVM because a normalization is done during the process.

  • It’s not necessary to normalize data before training with GradientBoosting and RandomForests because decision trees don’t need normalization.

Outputs :

  • output_normalization_file : File containing the mean and std of the normalized attributes.

  • output_data_files : Files containing the original data files normalized.

  • output_rule_files : Files containing the original rule files denormalized.

File formats:

  • Normalization file: Each line contains the mean/median and standard deviation for an attribute.

    Format: ‘2 : original mean: 0.8307, original std: 0.0425’

    Attribute indices (index 2 here) can be replaced with attribute names, then an attribute file is required.

  • Data files: These files should contain one sample per line, with numbers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Only attributes (floats).

    2. Attributes (floats) followed by an integer class ID.

    3. Attributes (floats) followed by one-hot encoded class.

  • Rule files: Contain rules in Dimlp or Fidex format. Formats:

    Dimlp: ‘Rule 1: (x2 > 0.785787) (x5 > 0.591247) (x8 < 0.443135) Class = 1 (187)’

    Fidex: ‘X1>=0.414584 X10<0.507982 X5>=0.314835 X6>=0.356158 -> class 0’

    In both formats, attribute indices (e.g., X1, x2) and class identifiers can be replaced with attribute names and class names, respectively, then an attribute file is required.

  • Attribute file: Each line corresponds to an attribute’s name, with optional class names at the end. Names can’t have spaces inbetween (replace by _).

Examples of how to call the function:

from trainings.normalization import normalization

  • For data files: normalization(‘--data_files [datanormTrain.txt,datanormTest.txt] --normalization_indices [0,2,4] --nb_attributes 16 --missing_values NaN --root_folder dimlp/datafiles’)

  • For rule files: normalization(‘--normalization_file normalization_stats.txt --rule_files globalRulesDatanorm.txt --nb_attributes 16 --root_folder dimlp/datafiles’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments, or all arguments for the function formatted like command-line input. This includes file paths for the normalization/denormalization process and other options.

Returns:

Returns 0 for successful execution, -1 for errors.

trainings.normalization.parse_normalization_file(file_name, nb_attributes, attributes=None)

Parse a file containing normalization statistics previously generated using the normalization function.

Parameters:
  • file_name – Name of the file containing the normalization statistics. The expected format for each line in the file is either: “[attribute_name] : original (mean|median): [value], original std: [value]” or “[attribute_index] : original (mean|median): [value], original std: [value]” where [attribute_name] is a string from the attributes list (if provided), [attribute_index] is an integer (if attributes is not provided), and [value] is a floating-point number.

  • nb_attributes – Number of attributes.

  • attributes – List of attribute names (optional).

Raises:

ValueError – If the file is not found, cannot be opened, is not in the correct format, or if there is an inconsistency in using mean or median across different lines.

Returns:

Tuple of (indices_list, with_median, mean_median_values, std_values).

trainings.parameters module

class trainings.parameters.CustomArgumentParser(prog=None, usage=None, description=None, epilog=None, parents=[], formatter_class=<class 'argparse.HelpFormatter'>, prefix_chars='-', fromfile_prefix_chars=None, argument_default=None, conflict_handler='error', add_help=True, allow_abbrev=True, exit_on_error=True)

Bases: ArgumentParser

A custom argument parser that overrides the default exit behavior to raise an exception instead of exiting.

exit(status=0, message=None)

Overrides the default exit method to prevent system exit on parsing errors.

Parameters:
  • status – The exit status code.

  • message – The error message to print.

Raises:

ValueError – Always raised to avoid exiting the script.

class trainings.parameters.CustomHelpFormatter(prog, indent_increment=2, max_help_position=24, width=None)

Bases: ArgumentDefaultsHelpFormatter

A custom help formatter for argparse that categorizes arguments into required, optional, and tagged parameters and enhances help message formatting.

add_arguments(actions)

Organizes and adds argument descriptions to the help message.

Parameters:

actions – A list of argparse actions.

add_text(text, raw=False)

Adds custom text to the help message. This method allows for raw text addition, bypassing the standard formatting applied by argparse.

Parameters:
  • text (str) – The text to be added to the help message.

  • raw (bool) – If True, adds the text without any formatting. If False, the standard formatting is applied.

format_help()

Cleans up and formats the overall help message, removing unnecessary lines and organizing content.

Returns:

A cleaned and formatted help message string.

class trainings.parameters.TaggableAction(*args, **kwargs)

Bases: Action

A custom argparse action that supports tagging, allowing for additional metadata to be associated with arguments.

Parameters:

tag – An optional tag to associate with the action.

trainings.parameters.bool_type(value)

Converts a string to a boolean value, recognizing various true/false representations.

Parameters:

value – The input string to convert.

Returns:

The converted boolean value.

Raises:

argparse.ArgumentTypeError – If the input cannot be interpreted as a boolean.

trainings.parameters.dict_type(value: str)

Validates and converts a string representation of a dictionary into a dictionary.

Parameters:

value_str – The string representation of a dictionary.

Returns:

The dictionary represented by the input string.

Raises:

argparse.ArgumentTypeError – If the input string does not represent a valid dictionary.

trainings.parameters.directory(path: str)

An argparse type function that validates if the provided path is a directory.

Parameters:

path – The input path string.

Returns:

The validated directory path.

Raises:

argparse.ArgumentTypeError – If the path is not a valid directory.

trainings.parameters.enum_type(value: str, *valid_strings, **valid_types)

Attempts to match the given value against a set of valid string values or some specified types.

Parameters:
  • value – The value to be validated or converted.

  • valid_strings – A variable number of string arguments representing valid values for value.

  • valid_types – Keyword arguments where each key is a type identifier and its value is a dict containing the type function under the key ‘func’ and any additional keyword arguments for the function.

Returns:

The original value if it matches one of the valid_strings, or the type converted value if it matches one of the type constraints specified in valid_types.

Raises:

argparse.ArgumentTypeError – If value does not match any of the valid_strings or cannot be converted to match the specified type constraints, with a detailed error message.

trainings.parameters.float_type(value: str, min=-inf, max=inf, min_inclusive=True, max_inclusive=True)

Validates and converts a string to a float, with optional range constraints.

Parameters:
  • value – The input string to convert.

  • min – The minimum acceptable value.

  • max – The maximum acceptable value.

  • min_inclusive – Whether the minimum value is inclusive.

  • max_inclusive – Whether the maximum value is inclusive.

Returns:

The converted float.

Raises:

argparse.ArgumentTypeError – If the input is invalid or out of the specified range.

trainings.parameters.get_args(args, init_args, parser)

Finalizes argument parsing, either from the JSON configuration file or command-line input, using a given parser.

Parameters:
  • args – The arguments previously parsed by the initial parser.

  • init_args – The initial command-line arguments.

  • parser – The argparse parser instance to use for final argument parsing.

Returns:

The fully parsed arguments.

trainings.parameters.get_common_parser(args, initial_parser)

Creates and returns a common argument parser for handling shared training arguments.

Parameters:
  • args – The arguments previously parsed by the initial parser.

  • initial_parser – The instance of the initial parser to use as a parent for common arguments.

Returns:

The instance of the common argument parser.

trainings.parameters.get_initial_parser(init_args)

Creates and returns an initial argument parser for handling the root folder and JSON configuration file options.

Parameters:

init_args – The initial command-line arguments.

Returns:

A tuple containing the parsed arguments and the initial parser instance.

trainings.parameters.get_tag_value(actions)

Retrieves the tag attribute from a list of argparse actions.

Parameters:

actions – A list of argparse actions.

Returns:

The first tag attribute value found among the actions, or None if no tag is present.

trainings.parameters.int_type(value: str, min=-inf, max=inf, allow_value=None)

Validates and converts a string to an integer, with optional range constraints and the option to allow None as a value.

Parameters:
  • value – The input string to convert.

  • min – The minimum acceptable value.

  • max – The maximum acceptable value.

Returns:

The converted integer.

Raises:

argparse.ArgumentTypeError – If the input is invalid or out of the specified range.

trainings.parameters.json_to_args(jsonfile: str)

Parses a JSON file and converts it into a list of command-line arguments.

Parameters:

jsonfile – The path to the JSON configuration file.

Returns:

A list of command-line arguments derived from the JSON file.

Raises:

ValueError – If there is an error parsing the JSON file.

trainings.parameters.list_type(str_list: str, valid_type: dict)

Converts a string representation of a list into a list of values of a specified type, checking this type.

Parameters:
  • str_list – The string representation of the list to be converted. The string should be delimited by commas or spaces, and optionally enclosed in brackets or parentheses.

  • valid_type – A dictionary specifying the type to which each element of the list should be converted. It must contain a ‘func’ key with a function for type conversion, and can include additional keys for type-specific constraints.

Returns:

A list of values of the specified type, with each element having passed the defined constraints.

Example usage: list_type(“[1, 2, 3]”, valid_type={‘func’: int}) -> [1, 2, 3] list_type(“4 5 6”, valid_type={‘func’: int, ‘min’: 3}) -> [4, 5, 6] with min value constraint applied

trainings.parameters.pair_type(str_list: str, valid_type: dict)

Converts a string representation of a pair into a pair of values of a specified type, ensuring that the pair contains exactly 2 elements.

Parameters:
  • str_list – The string representation of the pair to be converted. The string should be delimited by commas or spaces, and optionally enclosed in brackets or parentheses.

  • valid_type – A dictionary specifying the type to which each element of the pair should be converted. It must contain a ‘func’ key with a function for type conversion, and can include additional keys for type-specific constraints.

Returns:

A pair of exactly 2 values of the specified type, with each element having passed the defined constraints. Raises a ValueError if the list does not contain exactly 2 elements.

Example usage:

pair_type(“[1, 2]”, valid_type={‘func’: int}) -> (1, 2)

pair_type(“(4 5)”, valid_type={‘func’: int, ‘min’: 3}) -> (4, 5) with min value constraint applied

trainings.parameters.print_parameters(args)

Prints the list of parameters passed to the program.

This function iterates over all arguments contained within args (expected to be an argparse Namespace) and prints them to the standard output along with their values, provided they are not None.

Parameters:

args – A Namespace containing the program’s arguments.

trainings.parameters.sanitizepath(path: str, file: str, access_type: str = 'r')

Validates and constructs a full file path based on a base path and a filename, checking for file existence based on access type.

Parameters:
  • path – The base path.

  • file – The filename to append to the base path.

  • access_type – The file access type (‘r’ for read, ‘w’ for write).

Returns:

The constructed and validated file path.

Raises:

argparse.ArgumentTypeError – If the path does not exist or the file is not valid.

trainings.randForestsTrn module

trainings.randForestsTrn.get_and_check_parameters(init_args)

Processes and validates command-line arguments for a model training with random forests. This function cleans the input arguments by removing None values ensuring no unnecessary arguments are passed to the parser. It initializes the argument parser with basic configurations and adds various arguments required for the normalization process. It deternines which arguments are required or not and defines their default values.

Parameters:

init_args (list) – A list of command-line arguments passed to the program.

Returns:

A namespace object containing all the arguments that have been parsed and validated.

Return type:

argparse.Namespace

trainings.randForestsTrn.randForestsTrn(args: str = None)

Trains a random forests decision trees model. The nodes of the trees represent the discriminating hyperplanes used in Fidex. This allows us to then use Fidex for comprehensible rule extraction. The function offers a wide range of customization through command-line arguments, allowing for the specification of random forests parameters, output options, and more.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • It’s mandatory to specify the number of attributes and classes in the data, as well as the train and test datasets.

  • True train and test class labels must be provided, either within the data file or separately through a class file.

  • Parameters can be defined directly via the command line or through a JSON configuration file.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

  • It’s not necessary to normalize data before training because decision trees don’t need normalization.

Outputs:

  • train_pred_outfile : File containing the model’s train predictions.

  • test_pred_outfile : File containing the model’s test predictions.

  • rules_outfile : File containing the model’s trained rules.

  • stats_file : File containing train and test accuracy.

  • console_file : If specified, contains the console output.

File formats:

  • Data files: These files should contain one sample per line, with numbers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Only attributes (floats).

    2. Attributes (floats) followed by an integer class ID.

    3. Attributes (floats) followed by one-hot encoded class.

  • Class files: These files should contain one class sample per line, with integers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Integer class ID.

    2. One-hot encoded class.

Example of how to call the function:

from trainings.randForestsTrn import randForestsTrn

randForestsTrn(‘--train_data_file datanormTrain.txt --train_class_file dataclass2Train.txt --test_data_file datanormTest.txt --test_class_file dataclass2Test.txt --stats_file rf/stats.txt --train_pred_outfile rf/predTrain.out --test_pred_outfile rf/predTest.out --rules_outfile rf/RF_rules.rls --nb_attributes 16 --nb_classes 2 --root_folder dimlp/datafiles’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments, or all arguments for the function formatted like command-line input. This includes file paths, random forests parameters, and options for output.

Returns:

Returns 0 for successful execution, -1 for errors encountered during the process.

trainings.stairObj module

class trainings.stairObj.StairObj(nb_bins, hiknot)

Bases: object

This class represents an object for manipulating stair-like structures in data.

A StairObj is initialized with a number of bins (steps) and a high knot value, defining a range of activation where the value gradually increases across bins until it reaches the high knot point. The low knot point is automatically set to the opposite of the high knot value, creating a symmetric range of activation around zero. It calculates and stores knot values and their corresponding activation levels.

activate_knots()

Calculates and stores the positions of knots and their respective activation levels based on the stair structure defined by nb_bins and hiknot.

activation(x)

Computes the activation level for a given input x using the sigmoid function.

Parameters:

x (float) – The input value for which to compute the activation.

Returns:

The activation level for x.

Return type:

float

funct(x)

Returns the activation level for a given value x, taking into account the defined stair structure of the object.

Parameters:

x (float) – The input value for which to find the activation.

Returns:

The activation level for x.

Return type:

float

init_member_const_for_ansi()

Initializes member constants of the stair structure.

sigmoid(x)

The sigmoid function, used as the activation function in this stair object. Handles overflow by returning 1.0 for high positive inputs and 0.0 for high negative inputs.

Parameters:

x (float) – The input value.

Returns:

The sigmoid activation for x.

Return type:

float

trainings.svmTrn module

trainings.svmTrn.get_and_check_parameters(init_args)

Processes and validates command-line arguments for a model training with SVM. This function cleans the input arguments by removing None values ensuring no unnecessary arguments are passed to the parser. It initializes the argument parser with basic configurations and adds various arguments required for the normalization process. It deternines which arguments are required or not and defines their default values.

Parameters:

init_args (list) – A list of command-line arguments passed to the program.

Returns:

A namespace object containing all the arguments that have been parsed and validated.

Return type:

argparse.Namespace

trainings.svmTrn.svmTrn(args: str = None)

Trains an SVM model with data preprocessing that includes normalization and a staircase activation function that allows for the characterization of discriminating hyperplanes, which are used in Fidex. This allows us to then use Fidex for comprehensible rule extraction. The function offers a wide range of customization through command-line arguments, allowing for the specification of SVM parameters, output options, and more.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • It’s mandatory to specify the number of attributes and classes in the data, as well as the train and test datasets.

  • True train and test class labels must be provided, either within the data file or separately through a class file.

  • Parameters can be defined directly via the command line or through a JSON configuration file.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

  • It’s not necessary to normalize data before training because a normalization is done during the process.

Outputs:

  • train_pred_outfile : File containing the model’s train predictions.

  • test_pred_outfile : File containing the model’s test predictions.

  • weights_outfile : File containing the model’s trained weights.

  • stats_file : File containing train and test accuracy.

  • console_file : If specified, contains the console output.

  • output_roc : PNG file containing the ROC curve.

File formats:

  • Data files: These files should contain one sample per line, with numbers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Only attributes (floats).

    2. Attributes (floats) followed by an integer class ID.

    3. Attributes (floats) followed by one-hot encoded class.

  • Class files: These files should contain one class sample per line, with integers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Integer class ID.

    2. One-hot encoded class.

Example of how to call the function:

from trainings.svmTrn import svmTrn

svmTrn(‘--train_data_file datanormTrain.txt --train_class_file dataclass2Train.txt --test_data_file datanormTest.txt --test_class_file dataclass2Test.txt --weights_outfile svm/weights.wts --stats_file svm/stats.txt --train_pred_outfile svm/predTrain.out --test_pred_outfile svm/predTest.out --nb_attributes 16 --nb_classes 2 --root_folder dimlp/datafiles’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments, or all arguments for the function formatted like command-line input. This includes file paths, SVM parameters, and options for output and for the staircase activation process.

Returns:

Returns 0 for successful execution, -1 for errors encountered during the process.

trainings.trnFun module

trainings.trnFun.compute_first_hidden_layer(step, input_data, k, nb_stairs, hiknot, weights_outfile=None, mu=None, sigma=None)

Compute the output of the first hidden layer in a neural network model, apply a staircase activation function, and optionally save weights.

This function normalizes the input data and applies a linear transformation based on provided or calculated mean (mu) and standard deviation (sigma). If in train mode, it calculates and optionally saves the weights and biases. The transformed data is then passed through a staircase activation function. The input data should be nbSamples x nbAttributes.

Parameters:
  • step (str) – train for training step, otherwise for testing step.

  • input_data (list[list[float]] or np.ndarray) – Input data to be processed.

  • k (float) – Scaling factor for normalization.

  • nb_stairs (int) – Number of stairs in the staircase activation function.

  • hiknot (list[float]) – Knots for the staircase activation function.

  • weights_outfile (str, optional) – File path to save weights, defaults to None, mandatory for training step.

  • mu (np.ndarray, optional) – Mean for normalization, calculated if None. Defaults to None.

  • sigma (np.ndarray, optional) – Standard deviation for normalization, calculated if None. Defaults to None.

Returns:

Transformed data, and mu and sigma if in train mode.

Return type:

tuple (np.ndarray, np.ndarray, np.ndarray) or np.ndarray

Raises:

ValueError – If file operations fail.

trainings.trnFun.compute_roc(estimator, output_roc, *args)

Compute and save the ROC curve for a given estimator, test and prediction data.

This function calculates the false positive rate (fpr) and true positive rate (tpr) for the provided test classes and predictions. It then generates a ROC curve plot with the computed AUC (Area Under Curve) score and saves it to the specified file.

Parameters:
  • estimator (str) – The name of the estimator used for generating the predictions.

  • output_roc (str) – The file path where the ROC curve plot will be saved.

  • args (tuple) – A variable-length argument list containing the test classes and test predictions. args[0] should be the test classes, and args[1] should be the test predictions.

Returns:

A list containing the false positive rate, true positive rate, and AUC score.

Return type:

list

trainings.trnFun.delete_file(file)

Delete file

Parameters:

file (str) – The name of the file to be deleted.

trainings.trnFun.get_attribute_file(attribute_file, nb_attributes, nb_classes=None)

Reads an attribute file and splits its content into two lists: attributes and classes. The first nb_attributes non-empty lines are stored in attributes, and the remaining non-empty lines in classes. Raises an error if the file does not contain at least nb_attributes non-empty lines and if there is not exactly nb_attributes + nb_classes lines when nb_classes is specified.

Format of attribute_file : one attribute per line followed eventually by one class per line.

Parameters:
  • attribute_file – Path to the file to be read.

  • nb_attributes – Number of non-empty lines to be included in attributes.

Returns:

A tuple of two lists: (attributes, classes).

trainings.trnFun.get_data(file_name, nb_attributes, nb_classes=0, keep_string=False)

Get data from file and separate it into attributes and classes(if there are some classes).

The file should contain one sample per line. Each number in line is separated by a space, a tab, a semicolon or a comma. Each sample can be in one of the following formats:

  1. Attributes only: Each line contains each float attribute.

  2. Attributes with Class ID: Each line contains all the float attributes followed by an integer class ID.

  3. Attributes with One-Hot Class Encoding: Each line contains all the float attributes followed by a one-hot encoding of the class.

    The number of elements in this encoding should match the total number of classes, with exactly one ‘1’ and the rest ‘0’s.

Parameters:
  • file_name (str) – The name of the file to read data from.

  • nb_attributes (int) – The number of attributes.

  • nb_classes (int, optional) – The number of classes, defaults to 0.

  • keep_string (bool) – Whether to keep data on string format (allows non numericals).

Raises:

ValueError – If the file is not found, cannot be opened, or if the data format is incorrect.

Returns:

A tuple containing two lists of list of float, one for data (attributes) and one for classes. Each element in the data list is a list of floats representing the attributes of a sample. Each element in the classes list is an integer representing the class ID.

Return type:

(list[list[float]], list[int])

trainings.trnFun.get_data_class(file_name, nb_classes)

Get class data from file. The file can contain class data in two formats:

  1. Class ID format: Each line contains a single integer representing the class ID. The class ID must be a non-negative integer less than the number of classes.

  2. One-Hot Encoding format: Each line contains numbers separated by spaces, tabs, semicolons or commas, representing a one-hot encoding of the class. The number of elements in each line should match the total number of classes, with exactly one ‘1’ and the rest ‘0’s.

Each number in line is separated by a space, a tab, a semicolon or a comma.

Parameters:
  • file_name (str) – The name of the file to read class data from.

  • nb_classes (int) – The number of classes.

Raises:

ValueError – If the file is not found, cannot be opened, or if the data format is incorrect.

Returns:

A list containing class data.

Return type:

list[int]

trainings.trnFun.get_data_pred(file_name, nb_classes)

Add predictions to the dataset using a prediction file. The prediction file should contain one line per data sample, each line consisting of a series of numerical values separated by spaces, tabs, semicolons or commas, representing the prediction scores scores for each class. The number of values per line should match the specified number of classes.

Parameters:
  • file_name (str) – The path of the file to read data from.

  • nb_classes (int) – The expected number of float values per line.

Returns:

Predictions, a list of lists, where each inner list contains float values from one line of the file.

Return type:

list[list[float]]

Raises:

ValueError – If the file is not found, cannot be opened, if a line does not contain nb_classes floats, or if any value in a line is not a float.

trainings.trnFun.output_data(data, data_file)

Write the provided data to a specified file.

This function takes a list of lists (data) where each inner list represents a series of values, and writes these values to a file specified by data_file. Each inner list is written to a new line in the file with values separated by spaces.

Parameters:
  • data (list[list[float]] or list[list[int]]) – A list of lists, where each inner list contains values to be written to the file.

  • data_file (str) – The path of the file where the data will be saved.

Raises:

ValueError – If the specified file cannot be found or opened.

trainings.trnFun.output_pred(pred, pred_file, nb_classes)

Save the predictions in one-hot encoded format to a specified file.

This function takes a list of predicted class indices and converts them into one-hot encoded vectors, where each vector has a length equal to the number of classes (nb_classes). Each vector is a row in the output file, with ‘1’ in the position of the predicted class and ‘0’s elsewhere. The output is saved to a file specified by pred_file.

Parameters:
  • pred (list[int]) – A list of predicted class indices.

  • pred_file (str) – The path of the file where the one-hot encoded predictions will be saved.

  • nb_classes (int) – The total number of classes, determining the length of the one-hot encoded vectors.

Raises:

ValueError – If the specified file cannot be found or opened.

trainings.trnFun.output_stats(stats_file, acc_train, acc_test)

Write training and testing accuracy statistics to a specified file.

This function takes the training and testing accuracy values and writes them to the file specified by stats_file. The accuracies are written in percentage format.

Parameters:
  • stats_file (str) – The path of the file where the accuracy statistics will be saved.

  • acc_train (float) – The training accuracy percentage.

  • acc_test (float) – The testing accuracy percentage.

Raises:

ValueError – If the specified file cannot be found or opened.

trainings.trnFun.recurse(tree, node, parent_path, feature_names, output_rules_file, k_dict, from_grad_boost)

Recursively traverse the decision tree to generate rules.

This function walks through the decision tree from a given node down to its leaves. For each node, it constructs a rule as a string and writes the rules for leaf nodes to an output file.

Parameters:
  • tree (_tree.Tree) – The decision tree object.

  • node (int) – The current node index in the tree to start from.

  • parent_path (str) – The rule string constructed up to the current node.

  • feature_names (list[str]) – List of feature names used in the tree.

  • output_rules_file – Opened file object to write the rules.

  • k_dict (dict) – Dictionary holding the count of rules processed.

  • from_grad_boost (bool) – Boolean indicating if the tree is from gradient boosting.

trainings.trnFun.trees_to_rules(trees, rules_file, from_grad_boost=False)

Convert a list of decision trees into human-readable rules and save them to a file.

This function takes a list of decision tree objects and converts each tree into a set of rules. These rules are then written to a specified output file.

Parameters:
  • trees (list[DecisionTreeClassifier or DecisionTreeRegressor]) – List of decision tree objects.

  • rules_file (str) – File path where the rules will be saved.

  • from_grad_boost (bool) – Boolean indicating if the trees are from gradient boosting.

Raises:

ValueError – If the specified file cannot be found or opened.

Module contents

trainings.cnnTrn(args: str = None)

Trains a convolutional neural network (CNN) model using the Keras library with optional support for popular architectures like ResNet and VGG. This function processes data preprocessing that includes resizing, normalization and a staircase activation function that allows for the characterization of discriminating hyperplanes, which are used in Fidex. This allows us to then use Fidex for comprehensible rule extraction. It accommodates various types of image datasets including MNIST, CIFAR-10, and CIFAR-100, and allows for extensive customization through command-line arguments. It’s also possible to use other data types.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • It’s mandatory to specify the number of classes in the data, as well as the train and test datasets.

  • Validation data can either be specified directly or split from the training data based on a provided ratio.

  • If validation files are given, and you want to use Fidex algorithms later, you will have to use both train and validation datas given here in the train datas and classes of Fidex.

  • It’s mandatory to specify the size of the original inputs as well as the number of channels (it should be 3 for RGB and 1 for B&W). The number of attributes is inferred from it.

  • It’s mandatory to chose a model. There is a large model, a small one, a VGG16 and a Resnet50 available. You can add any other model you want by modifying the code.

  • It’s mandatory to specify the format of the data values: ‘normalized_01’ if the data are normalized between 0 and 1, ‘classic’ if they are between 0 and 255, and ‘other’ otherwise.

  • Data is reshaped in 3-channels shape if there is only one and usinf VGG or Resnet.

  • If Fidex is meant to be executed afterward for rule extraction, resizing inputs beforhand to a smaller size is recommended as it will take a lot of time because of the number of parameters.

  • It is also possible to resize the inputs just for training with the model_input_size parameter. Training with smaller inputs will be worst but will save a lot of time.

  • Parameters can be specified using the command line or a JSON configuration file.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

  • It’s not necessary to normalize data before training because a normalization is done during the process.

Outputs:

  • train_valid_pred_outfile : File containing the model’s train and validation (in this order) predictions.

  • test_pred_outfile : File containing the model’s test predictions.

  • weights_outfile : File containing the model’s trained weights.

  • stats_file : File containing train and test accuracy.

  • console_file : If specified, contains the console output.

File formats:

  • Data files: These files should contain one sample (input/image) per line, with numbers separated either by spaces, tabs, semicolons or commas. Each pixel must be given one after the other. Supported formats:

    1. Only attributes (floats).

    2. Attributes (floats) followed by an integer class ID.

    3. Attributes (floats) followed by one-hot encoded class.

  • Class files: These files should contain one class sample per line, with integers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Integer class ID.

    2. One-hot encoded class.

Example of how to call the function:

from trainings.cnnTrn import cnnTrn

cnnTrn(‘--model small --train_data_file trainData.txt --train_class_file trainClass.txt --test_data_file testData.txt --test_class_file testClass.txt --original_input_size (28,28) --nb_channels 1 --data_format classic --nb_classes 10 --root_folder dimlp/datafiles/Mnist’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments or all arguments for the function, formatted like command-line input. This includes dataset selection, file paths, training parameters, and options for model architecture and output files.

Returns:

Returns 0 for successful execution, -1 for any errors encountered during the process.

trainings.computeRocCurve(args: str = None)

Computes and plots the Receiver Operating Characteristic (ROC) curve for a given set of test predictions and true class labels. The function supports various customizations through command-line arguments, including specifying input files, choosing the positive class index, and output options.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • The function is not compatible with SVM models directly due to the different process required for generating ROC curves for them.

  • It’s mandatory to specify the number of classes, the index of the positive class, and provide the test class labels and prediction scores.

  • Parameters can be specified using the command line or a JSON configuration file.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

Outputs:

  • stats_file : If specified, contains AUC scores.

  • output_roc : PNG file containing the ROC curve.

File formats:

  • Class file: Thesen files should contain one class sample per line, with integers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Integer class ID.

    2. One-hot encoded class.

  • Prediction file : These files should contain the prediction scores for the test set, with one sample per line, with scores (float) for each class separated either by spaces, tabs, semicolons or commas.

Example of how to call the function:

from trainings.computeRocCurve import computeRocCurve

computeRocCurve(‘--test_class_file dataclass2Test.txt --test_pred_file predTest.out --positive_class_index 1 --output_roc roc_curve.png --stats_file stats.txt --root_folder dimlp/datafiles --nb_classes 2’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments or all arguments for the function, formatted like command-line input. This includes file paths, the positive class index, and options for the output and statistical analysis.

Returns:

Returns 0 for successful execution, -1 for any errors encountered during the process. Additionally, it returns an array containing interpolated false positive rates (FPR), true positive rates (TPR), and the area under the ROC curve (AUC) for further analysis or cross-validation purposes.

trainings.gradBoostTrn(args: str = None)

Trains a gradient boosting decision trees model. The nodes of the trees represent the discriminating hyperplanes used in Fidex. This allows us to then use Fidex for comprehensible rule extraction. The function offers a wide range of customization through command-line arguments, allowing for the specification of gradient boosting parameters, output options, and more.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • It’s mandatory to specify the number of attributes and classes in the data, as well as the train and test datasets.

  • True train and test class labels must be provided, either within the data file or separately through a class file.

  • Parameters can be defined directly via the command line or through a JSON configuration file.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

  • It’s not necessary to normalize data before training because decision trees don’t need normalization.

Outputs:

  • train_pred_outfile : File containing the model’s train predictions.

  • test_pred_outfile : File containing the model’s test predictions.

  • rules_outfile : File containing the model’s trained rules.

  • stats_file : File containing train and test accuracy.

  • console_file : If specified, contains the console output.

File formats:

  • Data files: These files should contain one sample per line, with numbers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Only attributes (floats).

    2. Attributes (floats) followed by an integer class ID.

    3. Attributes (floats) followed by one-hot encoded class.

  • Class files: These files should contain one class sample per line, with integers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Integer class ID.

    2. One-hot encoded class.

Example of how to call the function:

from trainings.gradBoostTrn import gradBoostTrn

gradBoostTrn(‘--train_data_file datanormTrain.txt --train_class_file dataclass2Train.txt --test_data_file datanormTest.txt --test_class_file dataclass2Test.txt --stats_file gb/stats.txt --train_pred_outfile gb/predTrain.out --test_pred_outfile gb/predTest.out --rules_outfile gb/RF_rules.rls --nb_attributes 16 --nb_classes 2 --root_folder dimlp/datafiles’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments, or all arguments for the function formatted like command-line input. This includes file paths, gradient boosting parameters, and options for output.

Returns:

Returns 0 for successful execution, -1 for errors encountered during the process.

trainings.mlpTrn(args: str = None)

Trains an MLP model with data preprocessing that includes normalization and a staircase activation function that allows for the characterization of discriminating hyperplanes, which are used in Fidex. This allows us to then use Fidex for comprehensible rule extraction. The function offers a wide range of customization through command-line arguments, allowing for the specification of MLP parameters, output options, and more.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • It’s mandatory to specify the number of attributes and classes in the data, as well as the train and test datasets.

  • True train and test class labels must be provided, either within the data file or separately through a class file.

  • Parameters can be defined directly via the command line or through a JSON configuration file.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

  • It’s not necessary to normalize data before training because a normalization is done during the process.

Outputs:

  • train_pred_outfile : File containing the model’s train predictions.

  • test_pred_outfile : File containing the model’s test predictions.

  • weights_outfile : File containing the model’s trained weights.

  • stats_file : File containing train and test accuracy.

  • console_file : If specified, contains the console output.

File formats:

  • Data files: These files should contain one sample per line, with numbers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Only attributes (floats).

    2. Attributes (floats) followed by an integer class ID.

    3. Attributes (floats) followed by one-hot encoded class.

  • Class files: These files should contain one class sample per line, with integers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Integer class ID.

    2. One-hot encoded class.

Example of how to call the function:

from trainings.mlpTrn import mlpTrn

mlpTrn(‘--train_data_file datanormTrain.txt --train_class_file dataclass2Train.txt --test_data_file datanormTest.txt --test_class_file dataclass2Test.txt --weights_outfile mlp/weights.wts --stats_file mlp/stats.txt --train_pred_outfile mlp/predTrain.out --test_pred_outfile mlp/predTest.out --nb_attributes 16 --nb_classes 2 --root_folder dimlp/datafiles’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments, or all arguments for the function formatted like command-line input. This includes file paths, MLP parameters, and options for output and for the staircase activation process.

Returns:

Returns 0 for successful execution, -1 for errors encountered during the process.

trainings.normalization(args: str = None)

This function serves two primary purposes: to normalize data files and to denormalize rule files. It offers flexibility in the normalization process through various options.

Normalization can be performed in several ways:

  1. Using a normalization_file file containing normalization parameters along with one or more data files.

  2. Providing data files directly, where the first file is normalized to determine mean/median and standard deviation, which are then applied to other files.

  3. Supplying mean/median (mus) and standard deviations (sigmas) as lists, along with the data files.

In the last two cases, indices of attributes to normalize must be provided, and a normalization_file file is generated for future use.

Denormalization can also be done in multiple ways:

  1. Using a normalization_file file with one or more rule files.

  2. Directly providing mean/median (mus) and standard deviations (sigmas) along with the rule files. Attribute indices to be denormalized must be provided in this case.

The function generates new normalized and/or denormalized files.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • It’s mandatory to specify the number of attributes in the data and the symbol representing missing data.

  • Choose whether to replace missing data or not.

  • If normalizing training data, it is advisable to normalize test/validation files simultaneously for consistency.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

When to use :

  • It’s good to normalize data before training with Dimlp and dimlpBT.

  • It’s not necessary to normalize data before training with cnnTrn, MLP and SVM because a normalization is done during the process.

  • It’s not necessary to normalize data before training with GradientBoosting and RandomForests because decision trees don’t need normalization.

Outputs :

  • output_normalization_file : File containing the mean and std of the normalized attributes.

  • output_data_files : Files containing the original data files normalized.

  • output_rule_files : Files containing the original rule files denormalized.

File formats:

  • Normalization file: Each line contains the mean/median and standard deviation for an attribute.

    Format: ‘2 : original mean: 0.8307, original std: 0.0425’

    Attribute indices (index 2 here) can be replaced with attribute names, then an attribute file is required.

  • Data files: These files should contain one sample per line, with numbers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Only attributes (floats).

    2. Attributes (floats) followed by an integer class ID.

    3. Attributes (floats) followed by one-hot encoded class.

  • Rule files: Contain rules in Dimlp or Fidex format. Formats:

    Dimlp: ‘Rule 1: (x2 > 0.785787) (x5 > 0.591247) (x8 < 0.443135) Class = 1 (187)’

    Fidex: ‘X1>=0.414584 X10<0.507982 X5>=0.314835 X6>=0.356158 -> class 0’

    In both formats, attribute indices (e.g., X1, x2) and class identifiers can be replaced with attribute names and class names, respectively, then an attribute file is required.

  • Attribute file: Each line corresponds to an attribute’s name, with optional class names at the end. Names can’t have spaces inbetween (replace by _).

Examples of how to call the function:

from trainings.normalization import normalization

  • For data files: normalization(‘--data_files [datanormTrain.txt,datanormTest.txt] --normalization_indices [0,2,4] --nb_attributes 16 --missing_values NaN --root_folder dimlp/datafiles’)

  • For rule files: normalization(‘--normalization_file normalization_stats.txt --rule_files globalRulesDatanorm.txt --nb_attributes 16 --root_folder dimlp/datafiles’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments, or all arguments for the function formatted like command-line input. This includes file paths for the normalization/denormalization process and other options.

Returns:

Returns 0 for successful execution, -1 for errors.

trainings.randForestsTrn(args: str = None)

Trains a random forests decision trees model. The nodes of the trees represent the discriminating hyperplanes used in Fidex. This allows us to then use Fidex for comprehensible rule extraction. The function offers a wide range of customization through command-line arguments, allowing for the specification of random forests parameters, output options, and more.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • It’s mandatory to specify the number of attributes and classes in the data, as well as the train and test datasets.

  • True train and test class labels must be provided, either within the data file or separately through a class file.

  • Parameters can be defined directly via the command line or through a JSON configuration file.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

  • It’s not necessary to normalize data before training because decision trees don’t need normalization.

Outputs:

  • train_pred_outfile : File containing the model’s train predictions.

  • test_pred_outfile : File containing the model’s test predictions.

  • rules_outfile : File containing the model’s trained rules.

  • stats_file : File containing train and test accuracy.

  • console_file : If specified, contains the console output.

File formats:

  • Data files: These files should contain one sample per line, with numbers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Only attributes (floats).

    2. Attributes (floats) followed by an integer class ID.

    3. Attributes (floats) followed by one-hot encoded class.

  • Class files: These files should contain one class sample per line, with integers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Integer class ID.

    2. One-hot encoded class.

Example of how to call the function:

from trainings.randForestsTrn import randForestsTrn

randForestsTrn(‘--train_data_file datanormTrain.txt --train_class_file dataclass2Train.txt --test_data_file datanormTest.txt --test_class_file dataclass2Test.txt --stats_file rf/stats.txt --train_pred_outfile rf/predTrain.out --test_pred_outfile rf/predTest.out --rules_outfile rf/RF_rules.rls --nb_attributes 16 --nb_classes 2 --root_folder dimlp/datafiles’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments, or all arguments for the function formatted like command-line input. This includes file paths, random forests parameters, and options for output.

Returns:

Returns 0 for successful execution, -1 for errors encountered during the process.

trainings.svmTrn(args: str = None)

Trains an SVM model with data preprocessing that includes normalization and a staircase activation function that allows for the characterization of discriminating hyperplanes, which are used in Fidex. This allows us to then use Fidex for comprehensible rule extraction. The function offers a wide range of customization through command-line arguments, allowing for the specification of SVM parameters, output options, and more.

Notes:

  • Each file is located with respect to the root folder dimlpfidex or to the content of the root_folder parameter if specified.

  • It’s mandatory to specify the number of attributes and classes in the data, as well as the train and test datasets.

  • True train and test class labels must be provided, either within the data file or separately through a class file.

  • Parameters can be defined directly via the command line or through a JSON configuration file.

  • Providing no command-line arguments or using -h/--help displays usage instructions, detailing both required and optional parameters for user guidance.

  • It’s not necessary to normalize data before training because a normalization is done during the process.

Outputs:

  • train_pred_outfile : File containing the model’s train predictions.

  • test_pred_outfile : File containing the model’s test predictions.

  • weights_outfile : File containing the model’s trained weights.

  • stats_file : File containing train and test accuracy.

  • console_file : If specified, contains the console output.

  • output_roc : PNG file containing the ROC curve.

File formats:

  • Data files: These files should contain one sample per line, with numbers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Only attributes (floats).

    2. Attributes (floats) followed by an integer class ID.

    3. Attributes (floats) followed by one-hot encoded class.

  • Class files: These files should contain one class sample per line, with integers separated either by spaces, tabs, semicolons or commas. Supported formats:

    1. Integer class ID.

    2. One-hot encoded class.

Example of how to call the function:

from trainings.svmTrn import svmTrn

svmTrn(‘--train_data_file datanormTrain.txt --train_class_file dataclass2Train.txt --test_data_file datanormTest.txt --test_class_file dataclass2Test.txt --weights_outfile svm/weights.wts --stats_file svm/stats.txt --train_pred_outfile svm/predTrain.out --test_pred_outfile svm/predTest.out --nb_attributes 16 --nb_classes 2 --root_folder dimlp/datafiles’)

Parameters:

args – A single string containing either the path to a JSON configuration file with all specified arguments, or all arguments for the function formatted like command-line input. This includes file paths, SVM parameters, and options for output and for the staircase activation process.

Returns:

Returns 0 for successful execution, -1 for errors encountered during the process.