org.knowceans.topics.simple
Class IldaGibbs

java.lang.Object
  extended by org.knowceans.topics.simple.IldaGibbs
All Implemented Interfaces:
ISimpleGibbs, ISimplePpx, ISimpleQueryGibbs

public class IldaGibbs
extends java.lang.Object
implements ISimpleGibbs, ISimpleQueryGibbs, ISimplePpx

LDA Gibbs sampler with nonparametric prior (HDP):

(m,k | alpha * tau | gamma), k->inf, (k,t | beta)

using Teh et al. (2006) approach for the direct assignment sampler, with modular LDA parametric sampler first published by Griffiths (2002) and explained in Heinrich (2005). For the original LDA paper, see Blei et al. (2002).

The general idea is to retain as much as possible of the standard LDA Gibbs sampler, which is possible by alternatingly sampling the finite case with K + 1 topics and resampling the topic weights taking into account the current assignments of data items to topics and pruning or expanding the topic set accordingly.

I tried to find the (subjectively) best tradeoff between simplicity and the JASA paper (Teh et al. 2006). Therefore I have only used the direct assignment method.

The implementation uses lists instead of primitive arrays, but for performance reasons, this may be changed to have a bound Kmax to allocate fixed-size arrays, similar to a truncated DP.

Caveats: (1) Performance is not a core criterion, and OOP encapsulation is ignored for compactness' sake. (2) Code still uses the likelihood function of LDA, and without the hyperparameter terms.

LICENSE: GPL3, see: http://www.gnu.org/licenses/gpl-3.0.html

References:

D.M. Blei, A. Ng, M.I. Jordan. Latent Dirichlet Allocation. NIPS, 2002

T. Griffiths. Gibbs sampling in the generative model of Latent Dirichlet Allocation. TR, 2002, www-psych.stanford.edu/~gruffydd/cogsci02/lda.ps

G. Heinrich. Parameter estimation for text analysis. TR, 2009, www.arbylon.net/publications/textest2.pdf

G. Heinrich. "Infinite LDA" -- implementing the HDP with minimum code complexity. TN2011/1, www.arbylon.net/publications/ilda.pdf

Y.W. Teh, M.I. Jordan, M.J. Beal, D.M. Blei. Hierarchical Dirichlet Processes. JASA, 101:1566-1581, 2006

Version:
0.94
Author:
(c) 2008-2011 Gregor Heinrich, gregor :: arbylon : net

Field Summary
 int ppstep
          step to increase the sampling array
 
Constructor Summary
IldaGibbs(int[][] w, int[][] wq, int K, int V, double alpha, double beta, double gamma, java.util.Random rand)
          parametrise gibbs sampler
 
Method Summary
 void init()
          initialise Markov chain
 void initq()
          initialise Markov chain for querying
static void main(java.lang.String[] args)
          test driver for mixture network Gibbs sampler
 void packTopics()
          reorders topics such that no gaps exist in the count arrays and topics are ordered with their counts descending.
 double ppx()
           
 void run(int niter)
          run Gibbs sampler
 void runq(int niter)
          query Gibbs sampler.
 java.lang.String toString()
          assemble a string of overview information.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

ppstep

public final int ppstep
step to increase the sampling array

See Also:
Constant Field Values
Constructor Detail

IldaGibbs

public IldaGibbs(int[][] w,
                 int[][] wq,
                 int K,
                 int V,
                 double alpha,
                 double beta,
                 double gamma,
                 java.util.Random rand)
parametrise gibbs sampler

Parameters:
w - word tokens
wq - word tokens (testing)
K - initial number of topics: may be 0 if gamma > 0.
V - number of terms
alpha - node A precision (document DP)
gamma - node A precision (root DP), 0 for fixed K: plain LDA.
beta - node B hyperparam
rand - random number generator
Method Detail

main

public static void main(java.lang.String[] args)
test driver for mixture network Gibbs sampler

Parameters:
args -

init

public void init()
initialise Markov chain

Specified by:
init in interface ISimpleGibbs

initq

public void initq()
initialise Markov chain for querying

Specified by:
initq in interface ISimpleQueryGibbs

run

public void run(int niter)
run Gibbs sampler

Specified by:
run in interface ISimpleGibbs
Parameters:
niter - number of Gibbs iterations

runq

public void runq(int niter)
query Gibbs sampler. This assumes the standard LDA model as we know the dimensionality from the training set, therefore topics need to be pruned.

Specified by:
runq in interface ISimpleQueryGibbs
Parameters:
niter - number of Gibbs iterations

packTopics

public void packTopics()
reorders topics such that no gaps exist in the count arrays and topics are ordered with their counts descending. Removes any gap dimensions.


ppx

public double ppx()
Specified by:
ppx in interface ISimplePpx
Returns:
the perplexity of the last query sample.

toString

public java.lang.String toString()
assemble a string of overview information.

Overrides:
toString in class java.lang.Object