SPIDER The Spider : What is it?

What is it?

  • It's a library of objects in Matlab
  • It is meant to handle (reasonably) large unsupervised, supervised or semi-supervised machine learning problems.
  • Aims to become a complete research/analysis toolbox: Includes training, testing, model selection, statistical tests, ...
  • (Some more visualization tools would also be nice ...but not implemented yet)
  • Plugging objects together: e.g perform cross validation on the following system: greedy backward feature selection on a fast base algorithm, training on those features with an SVM for each output in a one-against-the-rest multi-class system, choosing all hyperparameters with a model selection method.

    What objects are in it?

    See the objects page for a full list of the objects in the spider so far.

    Examples: classification (svm, knn, c45,...), multi-class (voting methods, msvm,...), regression (svr, lms, ridge reg.,...), feature selection (0-norm, via svms, fisher, rfe,...), unsupervised (kmeans, hierarchical, spectral, one class, kpca,...), and more: normalization, cross validation, grid search model selection, wilcoxon test, ...

    How do you plug objects together?

  • Both algorithm and data are objects.
  • All objects have two methods: train and test
  • .
  • A data object has an X (input) component and, usually, a Y (output) component.
  • train & test methods take a data object & output another data object.
  • train can access the Y component, but test cannot.
  • in general they transform the X component
  • e.g in supervised learning an object transforms the X to make it close to the Y.

  • a chain object is a list of objects which pass their output to input to next object in the list. Example: a feature selection algorithm feeds output into an SVM. (Actually you can describe almost everything you want to do, from generating data to normalization of the data to training to measuring loss and then statistical testing as a chain network.)
  • a param object trains/tests multiple versions of an object with different parameters. Example: train k-NN with differing k.
  • a group object puts several algorithms or data objects into a set.
    Example: training k-NN and SVM as a group produces a group of two outputs. You can train on groups of data or train groups of algorithms on data (or both).


  • some objects, such as get_mean (get the mean of several results) and wilcoxon (the wilcoxon statistical test) take group objects as input
    Example: get the mean loss of training SVM on 10 different splits of data.
  • a loss object takes data and produces a new data object which stores the loss between the original X and Y components
  • a cv object takes data and produces a group object of data objects for each tested fold.
  • a grid_sel object takes a group of algorithms and chooses the one with the best predicted generalization error (measured by another object, e.g the cv object).

    Can I see a simple example?

    Supposing we wish to train an SVM on some simple data, we can do:

    X=rand(50)-0.5; Y=sign(sum(X,2)); d=data(X,Y) % make simple data

    [res alg]= train(svm,d) % train a support vector machine

    and that's it - the svm is trained! To test it on more data:

    [res]= test(alg,d)

    Can I see some more complicated examples?

    These examples use toy , a simple object for generating toy datasets:

  • loss of cross validated error:
    loss(train(cv(svm),toy))


  • mean of cross validated error:
    get_mean(train(cv(svm),toy))


  • to learn with different dimensionality artificial data:
    loss(train(chain({param(toy,'n',[5:5:20]) cv(svm)})))


  • train on 3 datasets:
    train(svm,group({toy toy toy}))


  • feature selection with 0-norm minimization which is then fed into an svm:
    get_mean(train(cv(chain({l0 svm})),toy))


  • learn the parameter k of knn by crossvalidation and compare the cross validated error of this algorithm with knn with different k (chosen a priori) and svm:
    k=param(knn,'k',1:5); r=train(cv(group({k gridsel(k) svm})),toy)); get_mean(r)


  • Once again, why have you made this?

    We think the main reasons for such a library are:

    Sharing code Once everything is an object one can easily write another object and use all the other tools with it.

    Easier/faster analysis Plug in your dataset to an object and that's it. One can try combinations of pre-processing, feature selection, algorithms with different hyper-parameters without writing new code to do this, just a short script.

    Making large scale analysis possible in Matlab Also, we have written the core objects so that they can work with large datasets in terms of memory and speed (caching and removing memory overhead of pass by value) but these are transparent to the user.

    Easier/faster research Plug in a new object and compare it with existing ones. In the future, we plan to build benchmark objects where you pass in the new object you want to test and the object generates problems, compares your algorithm with standard ones and produces results with significance tests.

    Building large systems To build large systems you need a modular approach, we tried to design a way to do this. We believe in the future in machine learning building big systems combining many aspects of machine learning techniques will be more important. Plugging objects together in a well thought out way is one way to do this.

    Framework for more formal analysis We hope with several standard tools in the spider "environment" it will be easier to perform statistical tests on the test error and to be more precise with things like model selection by using objects to do this. Also, hopefully it might prevent mistakes -- when you rewrite code you can easily introduce a bug.

    Different to other systems We believe our approach is more object orientated than some other approaches we have seen -- every algorithm is an object with a train and a test.

    How can I catch the spider?

    Go here !