F.A.Q
=====
1.What
is it?
------------
The
spider is a set of learning algorithms that have been glued together with the
object orientated environment.
This
allows various standard techniques to be applied when using algorithms, e.g
cross validation, statistical tests, plotting functions, etc.
Version
1.0 contains the following algorithms: SVM, SVR (regression), C4.5, k-NN, LDA,
one-vs-rest, RFE, multiplicative update (zero norm minimization), Golub,
stability algorithms, one-class svm, \nu-svm, multi-class svm. Other included
functionality: hold out testing, cross validation, various loss function
evaluation and the ability to glue algorithms together e.g finding features via
corrlation coefficients (golub) and then training an svm with these features.
Finally, one can easily evaluate an algorithm/combination over many different
hyperparameters.
2.Why
have you made that?
-------------------------
We
think the main reasons for such a library are:
2.1.
Sharing code
Once
everything is an object one can easily write another object and use all the
other tools with it.
2.2
Easier/faster analysis
Plug in
your dataset to an object and that's it. One can try combinations of
pre-processing, feature selection, algorithms with different hyper-parameters
without writing new code to do this, just a short script.
Also, we
have written the core objects so that they can work with large datasets in
terms of memory and speed (caching and removing memory overhead of pass by
value) but these are transparent to the user.
2.3.
Easier/faster research
Plug in
a new object and compare it with existing ones.
We plan
to build benchmark objects where you pass in the new object you want to test
and the object generates problems, compares your algorithm with standard ones
and produces results with significance tests.
2.4.
Building large systems
To
build large systems you need a modular approach, we have thought about the good
way to do this. In bioinformatics it looks like big systems are quite natural,
e.g in secondary structure prediction. Plugging objects together is an easy way
to do this.
2.5
Framework for more formal analysis
We hope
with several standard tools in the spider "environment" it will be
easier to perform statistical tests on the test error and to be more precise
with things like model selection by using objects to do this. Also, hopefully
it might prevent mistakes -- when you rewrite code you can easily introduce a
bug.
3. Is
it different from other systems?
--------------------------------------
Yes.
It's more object orientated -- every algorithm is an object with a train and a
test and e.g cross validation is an object as well. Also, we wanted a library
in matlab which was powerful and not just for toy examples. Alex Smola is
developing something as well but it looks like it is only for kernel algorithms
and not for building experiments and plugging objects together so easily,
although he may eventually develop it to do that.
4. How
can I have a look at it quickly?
---------------------------------------
a) Just
unzip into the directory of choice
b)
Start up matlab and execute "spider_init" which is in the spider
directory
-- this sets a path to the spider directories
at matlab startup
c) You
can now run one of the demos, e.g spider/demos/microarray/go.m
5. Want
to know what objects are available?
-------------------------------------------
Type
"help spider" for a list. Sorry, not all the help has been written for
the individual objects -- we plan to make these available using the matlab
standard soon.
6.
What's an object in matlab?
------------------------------
It's a
directory called something like "@knn" e.g for k-nearest neighbours
which includes all the methods of that objects (which are M-files). Our
training objects have a constructor which sets default hyperparameters and
initials the model and a train and test method.
7. Want
to train and test an algorithm?
---------------------------------------
a)
Prepare your data in a matrix of attributes (e.g X) and a matrix of labels
(e.g Y) such that the rows are the
examples.
b)
Create a data object:
d=data(X,Y)
b)
Train an algorithm e.g svm.knn,c45:
[tr a]=train(svm,d)
tr is the predictions,
a
is the model that you learnt
c) Test
the algorithm on new data d2:
tst=test(a,d2,'class_loss')
--using the classification loss function
Type
"help train" and "help test" for a little more information.
8.How
do I set hyperparameters of algorithms?
---------------------------------------------
When an
algorithm is initialized the last parameter is the set of command strings in a
cell array which setup hyperparameters. E.g a=svm('C=1') sets hyperparameter
C, a=svm({'C=1','ridge=1e-10'}) sets two hyperparameters. This can also be
written as: a=svm('C=1;ridge=1e-10'); which is a bit easier -- you just
separate the instructions by commas. Type "help svm" for a list of
its hyperparameters.
9. Want
to build you own object so it's easy to use it with all the other objects and
to share your code with everyone?
-----------------------------------------------------------------------------------------------------------------------
Please
do it! Take a look at the @template object which is a simple svm object and
copy this directory and just change the training.m and testing.m files andthe
constructor (which should be the same name as the directory) to make your new
algorithm.
10.
What do you mean by plugging objects together?
-------------------------------------------------
Well,
there are a few objects included that can explain that. Consider the
"param" object. Initialising it with e.g p=param(svm,'C',[0.1 1 10
100 1000]) is a quick and easy way to train an SVM with different values of C.
You just type train(p,d) where d is the data.
Other
building blocks include
"feed_fwd" -- for allowing the output of one object to
be the input of another e.g a preprocessing object passing results to a feature
ranking objects which then passes results to a classifier, e.g
f=feed_fwd({preprocess rfe svm});
"alg"
-- a set of algorithms
"cv" -- for cross validation,
You can
use all the building blocks together to create quite complicated constructions
where you perform feature selection, preprocessing and so on, at each stage
tuning different hyperparameters.
11.
What are the upcoming things to be implemented?
---------------------------------------------------
Here's
some things:
*
significance tests
*
improve the user interface
* an
object for graphs and other visualizations
* an
object for turnign results into latex tables/graphs!
* an
object for making toy examples
*
benchmark objects for comparing algorithms/researching
* model
selection objects
*
preprocessing objects
*
subsampling object
*
improve some code by implementing in C
* ridge
regression, nadarya-watson, cart?
*
multi-class golub
* PCA
* ecoc,
1-vs-1?
* r2w2
grad feature selection
* r2w2
kernel choosing algorithms
*
kernel fisher discriminant
12.
What are the known "issues"?
--------------------------------
C4.5 is
not implemented for windows yet -- if you want to fix it you are very welcome!