org.knowceans.corpus.analysis
Class VariationOfInformationAnalyser

java.lang.Object
  extended by org.knowceans.corpus.analysis.VariationOfInformationAnalyser
All Implemented Interfaces:
java.io.Serializable

public class VariationOfInformationAnalyser
extends java.lang.Object
implements java.io.Serializable

VariationOfInformationAnalyser (old: TopicCorrelationAnalyser) analyses the distance between the extracted topics and a priori categories.

Probabilities of hierarchical topics need special consideration. If a document has a category on level n in the N-level hierarchy, it should be possible to reflect this by including levels 1..n-1, too. If the option hierup is set, we adopt the method to equally weight the categories w.r.t. document as opposed to w.r.t. topic, which means that if a document as for instance a level-2 topic A and a level-1 topic B, the set of topics is expanded to the pdf {A, parent_of_A, B} / 3, as opposed to {.5 * A, .5 * parent_of_A, 1 * B} / 2. Further, it is possible to include the level below the current hierarchy level in the same manner.

Author:
heinrich
See Also:
Serialized Form

Nested Class Summary
 class VariationOfInformationAnalyser.DistMetric
          DistMetric is a container for the values of the metric.
 
Field Summary
(package private)  org.knowceans.map.HashMultiMap<java.lang.Integer,java.lang.Integer> catDocuments
          sparse transpose of docCategories
private  java.lang.String comment
           
(package private)  int[][] docCategories
          docCategories sparse matrix (will be )
static boolean doDebug
           
private  boolean hierdown
           
private  boolean hierup
           
private  boolean includeunknown
           
(package private)  IptcCategories iptc
          IPTC-Codes
(package private) static double log2
          basis
(package private)  int nCats
          number of categories
(package private)  int nDocs
          number of documents
(package private)  int nTopics
          number of topics
(package private)  int nValidDocs
          number of valid documents
private  java.lang.String outfile
           
private static long serialVersionUID
           
private  double sumPCatDoc
           
(package private)  double[][] theta
          the document--topic associations (theta)
 
Constructor Summary
VariationOfInformationAnalyser(java.lang.String docsfile, java.lang.String thetafile, java.lang.String outfile, java.lang.String comment, boolean hierup, boolean hierdown, boolean includeunknown)
          TopicCorrelationAnalyser
 
Method Summary
private  void checkConsistency()
          checks whether the object has a consistent state.
(package private)  double entropy(double[] p)
          entropy of the distribution
private  void loadCategoryDists(java.lang.String file)
          creates a matrix of a priori probabilities for each document.
static void main(java.lang.String[] args)
           
 VariationOfInformationAnalyser.DistMetric metric()
          "Meila-metric" for a priori and a posteriori relationships.
private  double mutualInfo(double[] pcat, double[] ptopic)
          calculate mutual info for the two clusterings.
private  double mutualInfo(double[] pcat, double[] ptopic, double[][] pjoint)
          calculate mutual info for the two clusterings if pjoint is known.
 double mylog(double arg)
           
 double[] pCat()
          the probability of categories given any document, n_c * sum_d p(c|d)
(package private)  double pCatForDoc(int cat, int doc)
          returns the probability of a category given the document.
private  double[][] pJoint()
          calculate joint probability for the two clusterings.
 double[] pTopic()
          averaged distributions n_d * sum_d p(z|d)
private  void saveCatTopics(double[][] pjoint, double[] pcat, java.lang.String file)
          saves the 10 best topics for each category.
 void saveTopicCats(double[][] pjoint, double[] ptopic, java.lang.String file)
          saves the 10 best categories for each topic: p(r | s) = p(r, s) / p(s)
 void setOutfile(java.lang.String outfile)
           
 void setTheta(double[][] theta)
          Set the current value of theta.
 double sum(double[] v)
           
 double[][] transpose(double[][] mat)
          transpose the matrix
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

serialVersionUID

private static final long serialVersionUID
See Also:
Constant Field Values

doDebug

public static boolean doDebug

log2

static double log2
basis


docCategories

int[][] docCategories
docCategories sparse matrix (will be )


catDocuments

org.knowceans.map.HashMultiMap<java.lang.Integer,java.lang.Integer> catDocuments
sparse transpose of docCategories


nCats

int nCats
number of categories


nDocs

int nDocs
number of documents


nValidDocs

int nValidDocs
number of valid documents


nTopics

int nTopics
number of topics


theta

double[][] theta
the document--topic associations (theta)


iptc

IptcCategories iptc
IPTC-Codes


outfile

private java.lang.String outfile

comment

private java.lang.String comment

sumPCatDoc

private double sumPCatDoc

hierup

private boolean hierup

hierdown

private boolean hierdown

includeunknown

private boolean includeunknown
Constructor Detail

VariationOfInformationAnalyser

public VariationOfInformationAnalyser(java.lang.String docsfile,
                                      java.lang.String thetafile,
                                      java.lang.String outfile,
                                      java.lang.String comment,
                                      boolean hierup,
                                      boolean hierdown,
                                      boolean includeunknown)
TopicCorrelationAnalyser

Parameters:
docsfile - docs file with category information
thetafile - theta.bin file or null if only metric(double[][] is used)
outfile - for topic-category table or null
comment - comment in output files
hierup - whether hierarchical concepts (IPTC) aggregate their parents
hierdown - whether hierarchical concepts (IPTC) aggregate their children
includeunknown - whether unknown concept descriptors are considered valid concepts
Method Detail

main

public static void main(java.lang.String[] args)

setTheta

public void setTheta(double[][] theta)
Set the current value of theta.

Parameters:
theta -

setOutfile

public void setOutfile(java.lang.String outfile)

metric

public VariationOfInformationAnalyser.DistMetric metric()
"Meila-metric" for a priori and a posteriori relationships.

D(X, Y) = H(X) + H(Y) - 2 I(X, Y) with entropy H(X) = - sum p(x) log p(x) and the KL divergence between the x,y considered independent and the actual joint distribution I(X, Y) = KL( p(x,y) || p(x)p(y) )


checkConsistency

private void checkConsistency()
checks whether the object has a consistent state.


saveTopicCats

public void saveTopicCats(double[][] pjoint,
                          double[] ptopic,
                          java.lang.String file)
saves the 10 best categories for each topic: p(r | s) = p(r, s) / p(s)

Parameters:
pjoint -
ptopic -

saveCatTopics

private void saveCatTopics(double[][] pjoint,
                           double[] pcat,
                           java.lang.String file)
saves the 10 best topics for each category. TODO: implement fully.

Parameters:
pjoint -
jcat -

pCat

public double[] pCat()
the probability of categories given any document, n_c * sum_d p(c|d)

Returns:
vector

pTopic

public double[] pTopic()
averaged distributions n_d * sum_d p(z|d)

Returns:

pCatForDoc

double pCatForDoc(int cat,
                  int doc)
returns the probability of a category given the document.

Parameters:
cat -
doc -
Returns:

entropy

double entropy(double[] p)
entropy of the distribution

Parameters:
p -
Returns:

mutualInfo

private double mutualInfo(double[] pcat,
                          double[] ptopic)
calculate mutual info for the two clusterings. Does not store the joint distribution between the clusterings.

Parameters:
pcat - categories distribution p(c=r)
ptopic - topics distribution p(z=s)
Returns:

mutualInfo

private double mutualInfo(double[] pcat,
                          double[] ptopic,
                          double[][] pjoint)
calculate mutual info for the two clusterings if pjoint is known.

Parameters:
pcat - categories distribution p(c=r)
ptopic - topics distribution p(z=s)
pjoint - joint distribution p(c=r, z=s)
Returns:

pJoint

private double[][] pJoint()
calculate joint probability for the two clusterings.

Returns:
pjoint[cat][topic] with all cats

loadCategoryDists

private void loadCategoryDists(java.lang.String file)
creates a matrix of a priori probabilities for each document.

Parameters:
file -

mylog

public double mylog(double arg)

sum

public double sum(double[] v)

transpose

public double[][] transpose(double[][] mat)
transpose the matrix

Parameters:
mat -
Returns: