|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.knowceans.corpus.LuceneCorpus
public class LuceneCorpus
LuceneTermCorpus creates a TermCorpus interface around a lucene index. For this, the lucene index needs a stored field with some document identification (technically, not necessarily unique), and a term vector field with the content. This implementation directly (hence its name) accesses the fields of the lucene index.
The corpus can split the lucene index by a df threshold.
Field Summary | |
---|---|
protected java.lang.String |
contentField
Lucene index field to extract the corpus information from. |
protected java.lang.String |
docNamesField
Lucene index field to read the document names from. |
protected java.util.ArrayList<java.lang.Integer> |
emptyDocs
|
protected static int |
INDEX_UNKNOWN
|
protected java.lang.String |
indexpath
|
protected org.apache.lucene.index.IndexReader |
ir
|
protected int |
minDf
Minimum document frequency for terms allowd in the regular term index. |
protected int |
nTerms
Terms in the regular term index |
protected int |
nTermsLowDf
Terms in the lowDf index |
protected org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> |
termIndex
Index of term<->id |
protected org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> |
termIndexLowDf
Index of lowDf term<->id, the id is above that of the termIndex, i.e., one term-document matrix could created if necessary. |
private boolean |
useLowDf
|
Constructor Summary | |
---|---|
LuceneCorpus(java.lang.String path,
java.lang.String docNamesField)
Initialise the corpus with just access to the IndexReader. |
|
LuceneCorpus(java.lang.String indexPath,
java.lang.String docNameField,
java.lang.String contentField,
int minDf,
boolean useLowDf)
Create a term corpus from the index found at the path, using the content field for the terms and the docNameField for the document names. |
Method Summary | |
---|---|
protected java.util.ArrayList<java.lang.String> |
buildTermIndex(boolean useIgnored)
Create term index from the lucene index |
protected void |
buildTermIndexLowDf(java.util.ArrayList<java.lang.String> ignoredTerms)
Create the hash map from the vector of low-df terms |
protected void |
extract()
Initialise the corpus by extracting the files from the index. |
java.util.Map<java.lang.Integer,java.lang.Integer> |
getDocTerms(int doc)
Get the document terms as a frequency map id->frequency. |
private java.util.Vector<java.lang.Integer> |
getDocWords(int doc,
java.util.Random rand)
Get the words of document doc as a scrambled sequence. |
int[][] |
getDocWords(java.util.Random rand)
Get the documents as vectors of bag of words, i.e., per document, a scrambled array of term indices is generated. |
int[] |
getDocWords(java.lang.String string)
Get the words of an unknown document as a scrambled sequence. |
int |
getNdocs()
Number of documents in corpus |
int |
getNterms()
Number of terms in corpus |
int |
getNwords(int doc)
|
boolean |
isEmptyDoc(int doc)
Whether this document is non-empty after filtering. |
java.lang.String |
lookup(int term)
Get the string for the particular index, either from the regular index or from the lowDf index. |
int |
lookup(java.lang.String term)
Get the index of the particular term, either from the regular index or from the lowDf index, which results in an index >= nTerms. |
java.lang.String |
lookupDoc(int doc)
Get document name from id. |
int |
lookupDoc(java.lang.String docName)
Get the document index of the document with string id docName. |
private boolean |
ok(java.lang.String string)
|
protected void |
setupIndex(boolean useIgnored)
Creates term map and counts. |
void |
writeCorpus(java.lang.String filebase)
Write the corpus to the file. |
void |
writeDocList(java.lang.String file)
Write the document titles in a file (one doc per line) |
void |
writeVocabulary(java.lang.String file,
boolean sort)
Write the vocabulary to the file. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final int INDEX_UNKNOWN
protected java.lang.String indexpath
protected org.apache.lucene.index.IndexReader ir
protected java.util.ArrayList<java.lang.Integer> emptyDocs
protected org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> termIndex
protected org.knowceans.map.IBijectiveMap<java.lang.String,java.lang.Integer> termIndexLowDf
protected int minDf
protected int nTerms
protected int nTermsLowDf
protected java.lang.String contentField
protected java.lang.String docNamesField
private boolean useLowDf
Constructor Detail |
---|
public LuceneCorpus(java.lang.String indexPath, java.lang.String docNameField, java.lang.String contentField, int minDf, boolean useLowDf) throws java.io.IOException
The content field must have been indexed.
The doc names field must have been stored.
indexPath
- docNameField
- contentField
- minDf
- useLowDf
-
java.io.IOException
public LuceneCorpus(java.lang.String path, java.lang.String docNamesField) throws java.io.IOException
path
- docNamesField
-
java.io.IOException
Method Detail |
---|
protected void extract() throws java.io.IOException
java.io.IOException
protected void setupIndex(boolean useIgnored) throws java.io.IOException
useIgnored
-
java.io.IOException
protected java.util.ArrayList<java.lang.String> buildTermIndex(boolean useIgnored) throws java.io.IOException
useIgnored
-
java.io.IOException
private boolean ok(java.lang.String string)
string
-
protected void buildTermIndexLowDf(java.util.ArrayList<java.lang.String> ignoredTerms)
ignoredTerms
- public java.lang.String lookupDoc(int doc)
ITermCorpus
lookupDoc
in interface ITermCorpus
public int lookupDoc(java.lang.String docName)
lookupDoc
in interface ITermCorpus
docName
-
public java.util.Map<java.lang.Integer,java.lang.Integer> getDocTerms(int doc)
ITermCorpus
getDocTerms
in interface ITermCorpus
public int[][] getDocWords(java.util.Random rand)
getDocWords
in interface ITermCorpus
rand
- random number generator or null to use standard generator
private java.util.Vector<java.lang.Integer> getDocWords(int doc, java.util.Random rand)
It seems that the getDocTerms... loop scales badly. Use LuceneMapCorpus for larger documents.
doc
- rand
- random number generator or null to use standard generator
public int[] getDocWords(java.lang.String string)
string
- public java.lang.String lookup(int term)
lookup
in interface ITermCorpus
term
- term index
public int lookup(java.lang.String term)
lookup
in interface ITermCorpus
term
- string
public int getNdocs()
ITermCorpus
getNdocs
in interface ITermCorpus
public int getNterms()
ITermCorpus
getNterms
in interface ITermCorpus
public int getNwords(int doc)
public final boolean isEmptyDoc(int doc)
doc
-
public void writeDocList(java.lang.String file) throws java.io.IOException
file
-
java.io.IOException
public void writeCorpus(java.lang.String filebase) throws java.io.IOException
filebase
-
java.io.IOException
public void writeVocabulary(java.lang.String file, boolean sort) throws java.io.IOException
file
- sort
- sorts the vocabulary in alphabetical order
java.io.IOException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |