formerly, the site knowceans.org served as a code repository. this is the mirror site and may become the only one in the future. most of this code is gpl or lgpl.
latent dirichlet allocation in java:
- lda-j (version 20050325) is a Java 1.5 port of David Blei's lda-c.
- LdaGibbsSampler.java,
a working "hack" of the MCMC algorithm for LDA in one Java class.
- See primer on parameter estimation for text
- lda.odc, a WinBUGS script
to run LDA and an author-topic model with Gibbs sampling.
- See WinBUGS.
complex topic models in java:>
- Example output of my "meta Gibbs sampler", a system to generate source code
for topic models with Dirichlet-multinomial topic levels and complex inter-relationship
of latent variables.
- See my paper A generic approach to topic models (ECML 2009) for an explanation of generalised inference in complex topic models (alias "mixture networks").
- Example model: 5-level pachinko allocation:
- (1) 5-level PAM Gibbs sampler implements a 5-level pachinko allocation model.
- (2) 5-level PAM with independent samplers, runs much faster but is prone to overfitting.
- (3) 5-level PAM parallel sampling, simple parallelism over documents.
- (4) 5-level PAM parallel independent samplers, a version that combines (2) and (3).
- mixnet specification file (Note: the grammar has been revised compared to the paper above).
- See Li and McCallum's paper Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations (ICML 2006) for an explanation of pachinko allocation, a subset of the models covered by the meta-sampler.
- Example model 2: hierarchical pachinko allocation (HPAM):
- (1) hierarchical PAM (HPAM1), serial implementation
- (3) hierarchical PAM (HPAM1) with parallelisms
- HPAM1 mixnet specification file.
- (1) hierarchical PAM (HPAM2), serial implementation
- (2) hierarchical PAM (HPAM2) with independent samplers
- (3) hierarchical PAM (HPAM2) with parallelisms
- (4) hierarchical PAM (HPAM2) with parallel indep. samplers
- HPAM2 mixnet specification file.
- See Mimno, Li and McCallum's paper Mixtures of Hierarchical Topics with Pachinko Allocation (ICML 2007) for an explanation of hierarchical pachinko allocation.
- See javadoc (preview). More code and information forthcoming. All dependency classes of the example models are contained in knowceans-ilda .
NEW: hierarchical dirichlet processes in java:>
- IldaGibbs.java, a simple Java
implementation of the direct assignment scheme of the hierarchical Dirichlet
process (HDP).
- knowceans-ilda (current version: number see readme.txt) contains IldaGibbs.java and all dependent classes, the corresponding finite model sampler LdaGibbs.java (supersedes LdaGibbsSampler.java), a simple method to generate synthetic data that may be visualised, the NIPS example corpus, as well as background documentation on the HDP and LDA models.
- See the readme.txt
- See the javadoc
- See "Infinite LDA" -- Implementing the HDP with minimum code complexity for background on the HDP and a draft description of the implementation.
- See npbayes-r21 by Yee Whye Teh for another implementation in Matlab/C.
- See the original paper by Teh et al.: Hierarchical Dirichlet Processes
markov graph clustering in java:
- knowceans-mcl(version
20060805), provides a Java implementation of Markov graph clustering
(MCL), which finds hard clusters in a graph.
- See the javadoc.
- See Stijn van Dongens (2000) PhD thesis.
- See faster (but much more complex) C implementation.
adaptive rejection sampling in java:
- arms-java(version
20060516), provides a Java port of the adaptive rejection Metropolis
sampler (ARMS), which can sample from virtually any univariate
distribution.
- See the javadoc.
- See the original C/fortran implementation by Wally Gilks.
- See the cvs on sourceforge project knowceans.
- Samplers and densities / likelihood functions of various probability distributions as well as a Java port of the Mersenne Twister random generator can be found in the package knowceans-tools.jar (see below).
NEW: NIPS topic-modelling dataset
- nips-20110223.zip contains the NIPS0-12 data set in svmlight-style format.
- This data set contains the same 1740 documents of 2037 authors that Sam Roweis provided, in a format that may be read by the knowceans-tools or knowceans-ilda packages, plus semi-automatic extractions from volumes (13), class labels (50), corpus-internal citations (1287) and mentioned authors (21153).
- See the readme.txt
- NEW: knowceans citeseer-fetcher
(version 20100406), simple Java code to construct a corpus from the OAI2 site of
the CiteSeerX digital library. This rather quickly written code is assumed LGPL.
It does not yet clean the high number of duplicates in the document titles and
near-duplicates in the author names (for which I plan to add code later).
- See the javadoc.
- Some of the code requires the knowceans-tools package below.
- NEW: knowceans-tools
(version 20100823), many Java helper classes I frequently use:
command line parser, runtime stop watch, some statistical
distributions, estimators and samplers, helpers for vectors and
matrices, perl-like regular expression usage (reduces Java coding),
thread pool, special invertible, regex and many-to-many
implementations of the Map interface, data output formatters
specialised to commandline output (like histograms and dot-encoded
numbers) and many more.
NEW: added text corpus handling classes and NIPS0-12 corpus in compatible svmlight format. (NOTE: the classes will soon be deleted from the package, see above for new location.)
java dataset manipulation:
some java basis classes: