SeleSup

About

One issue in data classification problems is to find an optimal subset of instances to train a classifier. Training sets that represent well the characteristics of each class have better chances to build a successful predictor. Instance selection techniques remove examples from the data set so that classifiers are built faster and, in some cases, with better accuracy.

SeleSup, which is an acronym for selection by suppression, is a simple and fast algorithm (O(n^2)) for reducing the training cardinality of data sets through the elimination of irrelevant instances by mimicking the self-regulatory and suppression mechanism found in the immune system. According to self-regulation mechanisms, those cells unable to neutralize danger tend to disappear from the organism. Therefore, by analogy, data not relevant to the learning of a classifier are eliminated from the training process.

Running SeleSup

Prerequisites

In order to compile SeleSup you must have installed the following tools:

make and cmake
a standard-compliant C++ compiler (GCC is highly recommended)

Compiling

Building using an out-of-source approach is recommended. To do so, within the root selesup directory:

 cd build
 cmake ..
 make

This should leave in the build directory an executable file called selesup.

Usage

Usage: selesup [OPTION] <dataset>

General options:
  -f, --wbc-fraction
     the fraction of instances used as WBCs [default = 0.9]
  -o file, --output-file file
     save selected instances in 'file' [default = stdout]
  -s <n>, --seed <n>
     seed for pseudo-random number generation [default = random]
  -r <n>, --random-sampling <n>
     do random sampling of size 'n' instead of SeleSup
  --shuffle
     shuffle the data set on-the-fly before running SeleSup
  -v, --verbose
     turn on the verbose mode
  --version
     print SeleSup version and exit
  -h, --help
     print this help and exit

Example

Selecting instances from the Iris data set:
```
./selesup -v --shuffle iris.dsff -o iris.dsff.ss
```
This command will create a reduced data set, iris.dsff.ss, by selecting instances from iris.dsff (assuming this file is in the current directory). The option --shuffle tells SeleSup to shuffle the contents of the input data set before running the algorithm itself.

License

SeleSup is licensed under the GNU General Public License (GPL) Version 3 (or later), June 2007

http://www.gnu.org/licenses/gpl.txt