About

This website provides matlab code and some details about the Rate Adapting Poisson Model. The model was published at ICML 2006. You can find the publication here.
The Rate Adapting Poisson (RAP) model is an undirected probabilistic graphical model suited to learn latent structure in count data. It can be used to find dimensionality reduced representations of such data which subsequently can be used for classification or retrieval algorithms. In the ICML paper it is shown that for some benchmark datasets for this task the RAP model generates superior representations for such tasks than its directed counterpart Probabilisitic Latent Semantic Analysis (pLSI). For more details and a description of the algorithm please have a look in the paper
Below you can see a picture showing a 2 dimensional projection of text data

Code

You can download my maltab implementation of the RAP model:

Matlab source for RAP

Just unzip the package and have a look at sampleRun.m to see how to use the code. The function har_learn.m has also an extensive help which might help you. Please note that you need to install Tom Minkas Lightspeed package. Make sure you compile all files inside it!! Besides the implementation of this model I also have other implementations which might be helpful. These can be found on my code website.

pLSI - probabilistic latent semantic analysis, including a version of the tempered EM algorithm.
ePCA - exponential family PCA.

And of course there is the wonderful spider with implementations of NMF, etc.

Datasets

Text Datasets

For the evaluation of the model we used several benchmark datasets for information retrieval. All of them were obtained from the website of Alessandro Moschitti. For more details and the original sources please have a look at this website. These corpora were processed using the Rainbow toolbox from Andrew McCallum. In the version offered on this website the switches like stemming, pruning, etc. are ignored without warning. I replaced the function lex-simple.c with this version to fix this. You can download the preprocessed matlab data and the scripts for the generation of this data:

20Newsgroups count data dictionary (Scripts used for creation) Details: 10000 words with highest mutual information with class variable, Porter Stemming, pruned all words with < 3 characters, applied a stopword list Ohsumed (Medline) count data dictionary (Scripts used for creation) Porter Stemming, pruned all words with < 3 characters, applied a stopword list, pruned words which occured only once, pruned words which occur in only one document Reuters21578 count data:train test dictionary (Scripts used for creation) Porter Stemming, pruned all words with < 3 characters, applied a stopword list, pruned words which occured only once, pruned words which occur in only one document

Computer Vision Datasets

Caltech 4
Caltech 101

References

The Rate Adapting Poisson (RAP) model for Information Retrieval and Object Recognition - Peter V. Gehler, Alex D. Holub and Max Welling, ICML 2006
Exponential Family Harmoniums with an Application to Information Retrieval - Max Welling,Michal Rosen-Zvi and Geoffrey Hinton, NIPS 2004

Contact

pgehler@tuebingen.mpg.de.