“Matlab Topic Modeling Toolbox”, by Mark Steyvers andTom Griffiths
Datasets:
NIPS
Proceedings from 1988-2000
1,740 papers, 13,649 unique words, 2,301,375 word tokens
13 streams, size from 90 to 250 doc’s per stream
Reuters-21578
News from 26-FEB-1987 to 19-OCT-1987
10,337 documents; 12,112 unique words; 793,936 word tokens
30 streams (29/340 doc’s, 1/517 doc’s)
Baselines:
OLDAfixed: no memory
OLDA (ω(1) ): short memory
Performance Evaluation
measure: Perplexity
Test set: documents of next year or stream