org.knowceans.corpus
Class NumCorpus

java.lang.Object
  extended by org.knowceans.corpus.NumCorpus
All Implemented Interfaces:
ICorpus, ISplitCorpus, ITermCorpus
Direct Known Subclasses:
LabelNumCorpus

public class NumCorpus
extends java.lang.Object
implements ICorpus, ITermCorpus, ISplitCorpus

Represents a corpus of documents, using numerical data only.

Author:
heinrich

Constructor Summary
NumCorpus()
           
NumCorpus(Document[] docs, int numTerms, int numWords)
           
NumCorpus(java.lang.String dataFilename)
           
NumCorpus(java.lang.String dataFilename, int readlimit)
          init the corpus with a reduced set of documents
 
Method Summary
 Document getDoc(int index)
           
 int[][] getDocParBounds()
          get array of paragraph start indices of the documents (term-based)
 Document[] getDocs()
           
 int[][][] getDocTermsFreqs()
          get array of document terms and frequencies
 int[][] getDocWordParBounds()
          get array of paragraph start indices of the documents (word-based)
 int[] getDocWords(int m, java.util.Random rand)
          Get the words of document doc as a scrambled varseq.
 int[][] getDocWords(java.util.Random rand)
          Get the documents as vectors of bag of words, i.e., per document, a scrambled array of term indices is generated.
 int getNumDocs()
           
 int getNumTerms()
           
 int getNumTerms(int doc)
           
 int getNumWords()
           
 int getNumWords(int doc)
           
 int[][] getOrigDocIds()
          get the original ids of documents according to the corpus file read in.
 ICorpus getTestCorpus()
          return the test corpus split
 ICorpus getTrainCorpus()
          return the training corpus split
static void main(java.lang.String[] args)
          test corpus reading and splitting
 void mergeDocPars()
          merge document paragraphs into a single document each.
 void read(java.lang.String dataFilename)
          read a file in "pseudo-SVMlight" format.
 void reduce(int ndocs, java.util.Random rand)
          reduce the size of the corpus to ndocs maximum.
 void setDoc(int index, Document doc)
           
 void setDocs(Document[] documents)
           
 void split(int order, int split, java.util.Random rand)
          splits two child corpora of size 1/nsplit off the original corpus, which itself is left unchanged (except storing the splits).
 java.lang.String toString()
           
 void write(java.lang.String pathbase)
          write the corpus to to a file.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

NumCorpus

public NumCorpus(java.lang.String dataFilename)

NumCorpus

public NumCorpus(java.lang.String dataFilename,
                 int readlimit)
init the corpus with a reduced set of documents

Parameters:
dataFilename -
readlimit -

NumCorpus

public NumCorpus()

NumCorpus

public NumCorpus(Document[] docs,
                 int numTerms,
                 int numWords)
Method Detail

read

public void read(java.lang.String dataFilename)
read a file in "pseudo-SVMlight" format. The format is extended by a paragraph-aware version that repeats the pattern

nterms (term:freq){nterms}

for each paragraph in the document. This way, each paragraph

Parameters:
dataFilename -

getDocs

public Document[] getDocs()
Returns:

getDocTermsFreqs

public int[][][] getDocTermsFreqs()
get array of document terms and frequencies

Specified by:
getDocTermsFreqs in interface ITermCorpus
Returns:
docs[0 = terms, 1 = frequencies][m][t]

getDocParBounds

public int[][] getDocParBounds()
get array of paragraph start indices of the documents (term-based)

Returns:

getDocWordParBounds

public int[][] getDocWordParBounds()
get array of paragraph start indices of the documents (word-based)

Returns:

mergeDocPars

public void mergeDocPars()
merge document paragraphs into a single document each.


getDoc

public Document getDoc(int index)
Parameters:
index -
Returns:

getDocWords

public int[][] getDocWords(java.util.Random rand)
Get the documents as vectors of bag of words, i.e., per document, a scrambled array of term indices is generated.

Specified by:
getDocWords in interface ICorpus
Parameters:
rand - random number generator or null to use standard generator
Returns:

getNumWords

public int getNumWords()
Specified by:
getNumWords in interface ICorpus

getDocWords

public int[] getDocWords(int m,
                         java.util.Random rand)
Get the words of document doc as a scrambled varseq. For paragraph-based documents, scrambles the paragraphs separately, preserving their boundaries.

Specified by:
getDocWords in interface ICorpus
Parameters:
m -
rand - random number generator or null to omit shuffling
Returns:

setDoc

public void setDoc(int index,
                   Document doc)
Parameters:
index -
doc -

getNumDocs

public int getNumDocs()
Specified by:
getNumDocs in interface ICorpus
Returns:

getNumTerms

public int getNumTerms()
Specified by:
getNumTerms in interface ICorpus
Returns:

getNumTerms

public int getNumTerms(int doc)

getNumWords

public int getNumWords(int doc)

setDocs

public void setDocs(Document[] documents)
Parameters:
documents -

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

reduce

public void reduce(int ndocs,
                   java.util.Random rand)
reduce the size of the corpus to ndocs maximum. This should be called directly after loading as it only reduces the documents and count

Parameters:
ndocs -
rand -

split

public void split(int order,
                  int split,
                  java.util.Random rand)
splits two child corpora of size 1/nsplit off the original corpus, which itself is left unchanged (except storing the splits). The corpora can be retrieved using getTrainCorpus and getTestCorpus after using this function.

Specified by:
split in interface ISplitCorpus
Parameters:
order - number of partitions
split - 0-based split of corpus returned
rand - random source (null for reusing existing splits)

getTrainCorpus

public ICorpus getTrainCorpus()
return the training corpus split

Specified by:
getTrainCorpus in interface ISplitCorpus
Returns:
the training corpus according to the last splitting operation

getTestCorpus

public ICorpus getTestCorpus()
return the test corpus split

Specified by:
getTestCorpus in interface ISplitCorpus
Returns:
the test corpus according to the last splitting operation

getOrigDocIds

public int[][] getOrigDocIds()
get the original ids of documents according to the corpus file read in. If never split, null.

Specified by:
getOrigDocIds in interface ISplitCorpus
Returns:
[training documents, test documents]

write

public void write(java.lang.String pathbase)
           throws java.io.IOException
write the corpus to to a file. TODO: write also document titles and labels (in subclass)

Parameters:
pathbase -
Throws:
java.io.IOException

main

public static void main(java.lang.String[] args)
test corpus reading and splitting

Parameters:
args -