|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.knowceans.corpus.NumCorpus
public class NumCorpus
Represents a corpus of documents, using numerical data only.
Constructor Summary | |
---|---|
NumCorpus()
|
|
NumCorpus(Document[] docs,
int numTerms,
int numWords)
|
|
NumCorpus(java.lang.String dataFilename)
|
|
NumCorpus(java.lang.String dataFilename,
int readlimit)
init the corpus with a reduced set of documents |
Method Summary | |
---|---|
Document |
getDoc(int index)
|
int[][] |
getDocParBounds()
get array of paragraph start indices of the documents (term-based) |
Document[] |
getDocs()
|
int[][][] |
getDocTermsFreqs()
get array of document terms and frequencies |
int[][] |
getDocWordParBounds()
get array of paragraph start indices of the documents (word-based) |
int[] |
getDocWords(int m,
java.util.Random rand)
Get the words of document doc as a scrambled varseq. |
int[][] |
getDocWords(java.util.Random rand)
Get the documents as vectors of bag of words, i.e., per document, a scrambled array of term indices is generated. |
int |
getNumDocs()
|
int |
getNumTerms()
|
int |
getNumTerms(int doc)
|
int |
getNumWords()
|
int |
getNumWords(int doc)
|
int[][] |
getOrigDocIds()
get the original ids of documents according to the corpus file read in. |
ICorpus |
getTestCorpus()
return the test corpus split |
ICorpus |
getTrainCorpus()
return the training corpus split |
static void |
main(java.lang.String[] args)
test corpus reading and splitting |
void |
mergeDocPars()
merge document paragraphs into a single document each. |
void |
read(java.lang.String dataFilename)
read a file in "pseudo-SVMlight" format. |
void |
reduce(int ndocs,
java.util.Random rand)
reduce the size of the corpus to ndocs maximum. |
void |
setDoc(int index,
Document doc)
|
void |
setDocs(Document[] documents)
|
void |
split(int order,
int split,
java.util.Random rand)
splits two child corpora of size 1/nsplit off the original corpus, which itself is left unchanged (except storing the splits). |
java.lang.String |
toString()
|
void |
write(java.lang.String pathbase)
write the corpus to to a file. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
public NumCorpus(java.lang.String dataFilename)
public NumCorpus(java.lang.String dataFilename, int readlimit)
dataFilename
- readlimit
- public NumCorpus()
public NumCorpus(Document[] docs, int numTerms, int numWords)
Method Detail |
---|
public void read(java.lang.String dataFilename)
nterms (term:freq){nterms}
for each paragraph in the document. This way, each paragraph
dataFilename
- public Document[] getDocs()
public int[][][] getDocTermsFreqs()
getDocTermsFreqs
in interface ITermCorpus
public int[][] getDocParBounds()
public int[][] getDocWordParBounds()
public void mergeDocPars()
public Document getDoc(int index)
index
-
public int[][] getDocWords(java.util.Random rand)
getDocWords
in interface ICorpus
rand
- random number generator or null to use standard generator
public int getNumWords()
getNumWords
in interface ICorpus
public int[] getDocWords(int m, java.util.Random rand)
getDocWords
in interface ICorpus
m
- rand
- random number generator or null to omit shuffling
public void setDoc(int index, Document doc)
index
- doc
- public int getNumDocs()
getNumDocs
in interface ICorpus
public int getNumTerms()
getNumTerms
in interface ICorpus
public int getNumTerms(int doc)
public int getNumWords(int doc)
public void setDocs(Document[] documents)
documents
- public java.lang.String toString()
toString
in class java.lang.Object
public void reduce(int ndocs, java.util.Random rand)
ndocs
- rand
- public void split(int order, int split, java.util.Random rand)
split
in interface ISplitCorpus
order
- number of partitionssplit
- 0-based split of corpus returnedrand
- random source (null for reusing existing splits)public ICorpus getTrainCorpus()
getTrainCorpus
in interface ISplitCorpus
public ICorpus getTestCorpus()
getTestCorpus
in interface ISplitCorpus
public int[][] getOrigDocIds()
getOrigDocIds
in interface ISplitCorpus
public void write(java.lang.String pathbase) throws java.io.IOException
pathbase
-
java.io.IOException
public static void main(java.lang.String[] args)
args
-
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |