Class DataSetFid

Defined in File common/cpp/src/dataSet.h

Class Documentation

class DataSetFid

The DataSetFid class handles the loading, parsing, and storage of datasets including data, predictions, and classes.

Public Functions

inline explicit DataSetFid(const std::string &name): Construct a new DataSetFid object with a name.

DataSetFid(const std::string &name, const std::string &dataFile, const std::string &predFile, int nbAttributes, int nbClasses, double decisionThresh, int indexPositiveCl, const std::string &trueClassFile = "")

Construct a new DataSetFid object using separate data, prediction, and optional class files.

   - The data file should contain the attributes of each sample, and optionally, the class information can also be included in this file.
     If class information is included in the data file, there is no need to provide a separate class file.
   - The prediction file should contain the prediction scores for each class per sample.
   - The class file (optional) should be used only if class information is not included in the data file. It should contain either
     a single class ID or a series of integers in a one-hot encoding scheme per sample.

   Expected file formats:
   - Data file: Each line should contain a series of numerical values representing the attributes of a sample,
     optionally followed by the class information (either as a single integer for class ID or a one-hot encoded vector).
   - Prediction file: Each line should contain a series of numerical values.
   - Class file (optional): Only needed if class information is not included in the data file. Each line should contain either a single class ID
     or a series of integers in a one-hot encoding scheme.

   Each number in a line are separated by a space, a comma(CSV), a semicolon(;) or a tab.
   The number of attributes and classes in the dataset are used to validate the format and content of the data and class files.

Parameters:

name – The name of the dataSet.
dataFile – The data file name.
predFile – The prediction file name.
nbAttributes – The number of attributes.
nbClasses – The number of classes.
decisionThresh – A double indicating the decision threshold, useful when choosing the decision (-1 for no threshold).
positiveClassIndex – An integer corresponding to the index of the positive class for which we have the decision threshold (-1 if no threshold).
trueClassFile – The class file name.

DataSetFid(const std::string &name, const std::string &dataFile, int nbAttributes, int nbClasses, double decisionThresh, int indexPositiveCl)

Construct a new DataSetFid object using a single data file containing data, predictions, and optionally classes.

   The format of each sample in the file is as follows:
   - First Line: Contains data attributes. It may be followed by class information (either as an ID or in one-hot format).
   - Second Line: Contains prediction values.
   - Third Line (optional): Contains class information, only if it was not included in the first line and if present.
   There is a blank line between each sample in the file.
   Each data in a line is separated by a space, a comma(CSV), a semicolon(;) or a tab.

   The presence and format of class data (ID or one-hot) are inferred based on the structure of the lines in the file.

Parameters:

name – A string containing the name of the dataSet.
dataFile – The data file name containing data, predictions and maybe classes(not mandatory).
nbAttributes – The number of attributes.
nbClasses – The number of classes.
decisionThresh – A double indicating the decision threshold, useful when choosing the decision (-1 for no threshold).
positiveClassIndex – An integer corresponding to the index of the positive class for which we have the decision threshold (-1 if no threshold).

explicit DataSetFid(const std::string &name, const std::string &weightFile)

Construct a new DataSetFid object using a weight file.

This constructor is capable of handling both single and multiple network weight files.

Parameters:

name – The name of the dataSet.
weightFile – The name of the weight file.

void parseSingleNetwork(std::fstream &fileWts)

Parses a weight file containing a single network’s weights and stores them in the weights vector.

Parameters:: fileWts – Reference to the file stream opened for reading the weight file.

void parseMultipleNetworks(std::fstream &fileWts)

Parses a weight file containing multiple networks’ weights and stores them in the weights vector.

Parameters:: fileWts – Reference to the file stream opened for reading the weight file.

void setDataFromFile(const std::string &dataFile, int nbAttributes, int nbClasses)

Get data from dataFile and save it in datas and trueClasses if it contains class information.

   The file should contain one sample per line. Each number in line is separated by a space, a comma(CSV), a semicolon(;) or a tab. Each sample can be in one of the following formats:
   1. Attributes only: Each line contains each float attribute.
   2. Attributes with Class ID: Each line contains all the float attributes followed by an integer class ID.
   3. Attributes with One-Hot Class Encoding: Each line contains all the float attributes followed by a one-hot encoding of the class.
      The number of elements in this encoding should match the total number of classes, with exactly one '1' and the rest '0's.

Parameters:

dataFile – A string representing the name of the data file. This file should contain data in one of the supported formats.
nbAttributes – The number of attributes.
nbClasses – The number of classes.

void setPredFromFile(const std::string &predFile, int nbClasses, double decisionThreshold = -1, int positiveClassIndex = -1)

Add predictions to the dataset using a prediction file.

   The prediction file should contain one line per data sample, each line consisting of a series of numerical values separated
   by a space, a comma(CSV), a semicolon(;) or a tab representing the prediction scores for each class.
   The number of values per line should match the specified number of classes.
   If a decision threshold is provided, the function uses it to determine the predicted class based on the threshold.

Parameters:

predFile – The prediction file name.
nbClasses – The number of classes.
decisionThresh – An optional double indicating the decision threshold, useful when choosing the decision (-1 for no threshold).
positiveClassIndex – An optional integer corresponding to the index of the positive class for which we have the decision threshold (-1 if no threshold).

void setClassFromFile(const std::string &classFile, int nbClasses)

Add classes from a specified file into the dataset.

   The class file can contain lines in different formats:
   1. Class ID format: Each line contains a single integer representing the class ID.
   2. One-hot format: Each line contains a sequence of integers in a one-hot encoding scheme,
      where exactly one value is 1 (indicating the class ID) and all others are 0.
   Each number in a line is separated by a space, a comma(CSV), a semicolon(;) or a tab.

   The function determines the format of each line based on the nbClasses parameter and the structure of the data in the line.

Parameters:

classFile – A string representing the name of the class file. This file should contain class data. in one of the supported formats (either class ID or one-hot encoded).
nbClasses – An int specifying the number of classes.

std::vector<std::vector<double>> &getDatas()

Return the samples’ data.

Returns:: The samples’ data.

std::vector<int> &getClasses()

Return the classes of the samples.

Returns:: The classes of the samples.

bool getHasClasses() const

Return whether the dataset contains classes.

Returns:: Whether the dataset contains classes.

std::vector<int> &getPredictions()

Return the predictions of the samples.

Returns:: The predictions of the samples.

std::vector<std::vector<double>> &getOutputValuesPredictions()

Return the prediction output values of the samples.

Returns:: The prediction output values of the samples.

int getNbClasses() const

Return the number of classes in the dataset.

Returns:: The number of classes in the dataset.

int getNbAttributes() const

Return the number of attributes in the dataset.

Returns:: The number of attributes in the dataset.

int getNbSamples() const

Return the number of samples in the dataset.

Returns:: The number of samples in the dataset.

void setAttributes(const std::string &attributesFile, int nbAttributes, int nbClasses = -1)

Add attributes and eventually classes from attribute file in the dataset.

Parameters:

attributesFile – The attribute file name.
nbAttributes – The number of attributes.
nbClasses – The number of classes (optional).

std::vector<std::string> &getAttributeNames()

Return attribute names.

Returns:: Attribute names.

std::vector<std::string> &getClassNames()

Return class names.

Returns:: Class names.

bool getHasAttributeNames() const

Return whether the dataset contains attribute names.

Returns:: Whether the dataset contains attribute names.

bool getHasClassNames() const

Return whether the dataset contains class names.

Returns:: Whether the dataset contains class names.

std::vector<std::vector<std::vector<double>>> getWeights() const

Return the weights.

Returns:: The weights.

std::vector<double> getInBiais(int netId) const

Return the biases of the first layer.

Returns:: The biases of the first layer.

std::vector<double> getInWeights(int netId) const

Return the weights of the first layer.

Returns:: The weights of the first layer.

int getNbNets() const

Return the number of training networks.

Returns:: The number of training networks.