Developer Guide: How To Set Up And Run An Experiment

This tutorial describes how a developer may set up and execute an experiment using the Preference Learning Toolbox API. The tutorial is divided into 9 main steps (some of which are optional):

  1. Create the experiment
  2. Load the data
  3. Normalize the data (OPTIONAL)
  4. Set up a feature selection method (OPTIONAL)
  5. Set up the preference learning algorithm
  6. Set up an evaluation method (OPTIONAL)
  7. Run the experiment
  8. Save the results (OPTIONAL)
  9. Save the model (OPTIONAL)

The explanation of each step in the tutorial is supported by a snippet of code as an example. For more detailed information on any of the methods and parameters used in any step of this tutorial, one may check out the corresponding section of the API Reference.

Step 1. Create an experiment

The first step in setting up an experiment is to instantiate the Experiment class.

In [ ]:
from pyplt.experiment import Experiment

exp = Experiment()

Step 2. Load the data

As explained in more detail in the How To Use section, the single file format should be used for problems where a total order of objects exists whereas the dual file format should be used for problems where a partial order of objects exists.

(i.) Single file format:

The single file format requires that we upload only one Comma-Separated Value (CSV) file (with a .csv extension) containing the feature values for each sample in the dataset (one column per feature) and the rating for each sample (in the last column). The file is uploaded via the exp.load_single_data() method which requires that we specify the path from where the file is to be loaded. Optionally, one may specify which separator is used by the file and whether the file contains sample IDs or feature labels.

In [ ]:
exp.load_single_data("sample data sets\\single_synth.csv", has_ids=True, has_fnames=True)

Set up Rank Derivation Parameters (OPTIONAL)

In the single file format, one may optionally control how the pairwise preferences (ranks) are to be derived from the ratings in the dataset. This may be done via the exp.set_rank_derivation_params() method, through which values for the minimum distance margin (MDM) and memory parameters may be specified. In this example, we set the MDM to 0.0001 and the memory to 2.

In [ ]:
exp.set_rank_derivation_params(mdm=0.0001, memory=2)

(ii.) Dual file format:

The dual file format requires that we upload two CSV files: a file containing the objects (samples) and a file containing the pairwise preferences. The objects file is uploaded via the exp.load_object_data() method whereas the ranks file is uploaded via the exp.load_rank_data() method, both of which operate similarly to the exp.load_single_data() method explained in Section 2(i.) above.

In [ ]:
exp.load_object_data("sample data sets\\objects.csv", has_ids=True, has_fnames=True)
exp.load_rank_data("sample data sets\\ranks.csv", has_ids=False, has_fnames=False)

Step 3. Normalize the data (OPTIONAL)

Next, we may optionally control the way that the data is pre-processed prior to feature selection and preference learning. That is, we may specify if and how the data is normalized with respect to some or all of the features. Currently, PLT offer two normalization methods: MinMax and Z-Score. In this example, we MinMax normalize the data to a range of 0 to 1 with respect to two of the features in the dataset.

In [ ]:
from pyplt.util.enums import NormalizationType

exp.set_normalization([0, 3], NormalizationType.MIN_MAX)

Step 4. Set up a feature selection method (OPTIONAL)

Another optional step is to choose whether or not feature selection is to be carried out prior to the preference learning phase and, if so, using which method. Currently, PLT offers only the Sequential Forward Selection (SFS) me. The class for this method is located within the respective module in the pyplt.fsmethods subpackage. Since SFS is a wrapper method, it constructs models using preference learning and uses their prediction accuracy to test how effective a subset of features is. Therefore, using SFS also requires us to select a preference learning algorithm, and optionally, an evaluation method. In this example we use the SFS method with the RankSVM algorithm (with the RBF kernel and a gamma value of 1).

In [ ]:
from pyplt.fsmethods.sfs import SFS
from pyplt.plalgorithms.ranksvm import RankSVM
from pyplt.util.enums import KernelType

sfs = SFS()
sfs_algorithm = RankSVM(kernel=KernelType.RBF, gamma=1)
exp.set_fs_method(sfs)
exp.set_fs_algorithm(sfs_algorithm)

So far, the SFS method will use the training accuracy resulting from the (RankSVM) training to measure how predictive a given feature set is. Optionally, we can set up an evaluation method to test the accuracy of the resulting RankSVM models on unseen data, thus measuring how predictive a given feature set is. In this example, we use the Holdout method with the default test proportion parameter (0.3).

In [ ]:
from pyplt.evaluation.holdout import HoldOut

sfs_evaluator = HoldOut(test_proportion=0.3)
exp.set_fs_evaluator(sfs_evaluator)

Step 5. Set up the preference learning algorithm

Whether or not feature selection will be applied, we must next choose which algorithm is to be used to infer models of our data during preference learning (modelling). Currently, PLT offers the RankSVM and Backpropagation algorithms. The classes for these algorithms are located within the respective modules in the pyplt.plalgorithms subpackage. Note, that if feature selection was applied, only the features that were selected are used in the modelling phase. In this example we use the Backpropagation algorithm over a neural network with a hidden layer of 5 neurons (each layer using the sigmoid activation function) which is iterated for 50 epochs.

In [ ]:
from pyplt.plalgorithms.backprop_tf import BackpropagationTF
from pyplt.plalgorithms.backprop_tf import ActivationType

pl_algorithm = BackpropagationTF(ann_topology=[5, 1],
                                 activation_functions=[ActivationType.SIGMOID,
                                                       ActivationType.SIGMOID],
                                 epochs=50)
exp.set_pl_algorithm(pl_algorithm)

Step 6. Set up an evaluation method (OPTIONAL)

Optionally, we may choose to specify the evaluation method, i.e., the method with which the model/s inferred via preference learning is/are tested. Currently, PLT offers the Holdout method and the K-Fold Cross Validation (KFCV) method. The classes for these methods are located within the respective modules in the pyplt.evaluation subpackage. In this example, we use the KFCV method with k=3 (i.e., 3-fold cross validation) such that three models are constructed each over a different subset of the data and the average accuracy of the models is considered.

In [ ]:
from pyplt.evaluation.cross_validation import KFoldCrossValidation

pl_evaluator = KFoldCrossValidation(k=3)
exp.set_pl_evaluator(pl_evaluator)

Step 7. Run the experiment

Now that we’ve set up all the experiment components (i.e., the dataset, the data normalization settings, the feature selection method, the preference learning algorithm, and the evaluation method), we can run the experiment by simply calling the exp.run() method. Note that here we may optionally choose to have the data shuffled prior to any pre-processing (normalization). In this example, we shuffle the data without specifying a random seed.

In [ ]:
exp.run(shuffle=True)

Step 8. Save the results (OPTIONAL)

Finally, once the training (and testing) has been completed, we may choose to save the experiment report -- which contains all of the experiment setup details and model performance metrics -- in a human-readable format in a Comma-Separated Value (.csv) file at the desired location.

In [ ]:
import time

t = time.time()
exp.save_exp_log(t, path="logs\\my_results.csv")

Step 9. Save the model (OPTIONAL)

Optionally, we may also choose to save the resulting model/s to a human-readable format in a Comma-Separated Value (.csv) file at the desired location. Since we used K-Fold Cross Validation in the evaluation step example, in this example, we save the model inferred using the second fold of data.

In [ ]:
import time

t = time.time()
exp.save_model(t, fold_idx=1, path="logs\\my_model.csv")