About
One issue in data classification problems is to find an optimal subset of instances to train a classifier. Training sets that represent well the characteristics of each class have better chances to build a successful predictor. Instance selection techniques remove examples from the data set so that classifiers are built faster and, in some cases, with better accuracy.
SeleSup, which is an acronym for selection by suppression,
is a simple and fast algorithm (O(n^2)
) for reducing the training
cardinality of data sets through the elimination of irrelevant instances by
mimicking the self-regulatory and suppression mechanism found in the immune
system. According to self-regulation mechanisms, those cells
unable to neutralize danger tend to disappear from the organism. Therefore, by
analogy, data not relevant to the learning of a classifier are eliminated from
the training process.
Running SeleSup
Prerequisites
In order to compile SeleSup you must have installed the following tools:
Compiling
Building using an out-of-source approach is recommended. To do so, within the root selesup directory:
cd build cmake .. make
This should leave in the build
directory an executable file called selesup
.
Usage
Usage: selesup [OPTION] <dataset> General options: -f, --wbc-fraction the fraction of instances used as WBCs [default = 0.9] -o file, --output-file file save selected instances in 'file' [default = stdout] -s <n>, --seed <n> seed for pseudo-random number generation [default = random] -r <n>, --random-sampling <n> do random sampling of size 'n' instead of SeleSup --shuffle shuffle the data set on-the-fly before running SeleSup -v, --verbose turn on the verbose mode --version print SeleSup version and exit -h, --help print this help and exit
Example
- Selecting instances from the Iris data set:
./selesup -v --shuffle iris.dsff -o iris.dsff.ss
This command will create a reduced data set,iris.dsff.ss
, by selecting instances fromiris.dsff
(assuming this file is in the current directory). The option--shuffle
tells SeleSup to shuffle the contents of the input data set before running the algorithm itself.
License
SeleSup is licensed under the GNU General Public License (GPL) Version 3 (or later), June 2007