org.knowceans.corpus.parsers
Class TxtParser

java.lang.Object
  extended by org.xml.sax.helpers.DefaultHandler
      extended by org.knowceans.corpus.parsers.TxtParser
All Implemented Interfaces:
org.xml.sax.ContentHandler, org.xml.sax.DTDHandler, org.xml.sax.EntityResolver, org.xml.sax.ErrorHandler

public class TxtParser
extends org.xml.sax.helpers.DefaultHandler

SimpleParser parses a set of plain text files into a TextCorpus, one for each document. This implementation uses the "old" parser code and not lucene etc.

Author:
heinrich

Field Summary
private  java.util.Vector<SimpleDocument> allDocs
           
private  java.io.BufferedWriter bw
           
private  int nr
           
private  java.lang.String prevWord
           
private  Stemmer stem
           
private  StopWordFilter stop
           
 boolean useBigrams
           
 boolean useStemming
           
 boolean useUnigrams
           
private  java.lang.String xmlfile
           
 
Constructor Summary
TxtParser()
           
TxtParser(java.lang.String stoplist)
           
 
Method Summary
private  void closeOutfile()
           
 void configure(boolean useStemming, boolean useUnigrams, boolean useBigrams)
          configure the parser.
private  boolean isValid(java.lang.String sourcefile)
          True for txt files.
static void main(java.lang.String[] argv)
           
private  void openOutfile()
           
 java.util.Vector<SimpleDocument> parse(java.lang.String file, int mindl)
          opens the file and parses the content as one document
private  java.util.Vector<SimpleDocument> parseDir(java.lang.String sourcefile, int mindl)
          Parse directory by adding each XML file's content sequentially.
private  void parseString(java.lang.String file, java.lang.String string, int mindl)
          parses the string
private  int parseText(java.lang.String s, java.util.Vector<java.lang.String> words)
          Parse the given text and add terms to the model.
private  java.lang.String removePunct(java.lang.String s)
          Remove all punctuation
private  void setXmlOutput(java.lang.String xmlfile)
           
private  void writeText(java.lang.String file, int id, java.lang.String text)
           
 
Methods inherited from class org.xml.sax.helpers.DefaultHandler
characters, endDocument, endElement, endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startDocument, startElement, startPrefixMapping, unparsedEntityDecl, warning
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

stop

private StopWordFilter stop

useStemming

public boolean useStemming

useBigrams

public boolean useBigrams

useUnigrams

public boolean useUnigrams

prevWord

private java.lang.String prevWord

nr

private int nr

stem

private Stemmer stem

allDocs

private java.util.Vector<SimpleDocument> allDocs

bw

private java.io.BufferedWriter bw

xmlfile

private java.lang.String xmlfile
Constructor Detail

TxtParser

public TxtParser()
Parameters:
argv -

TxtParser

public TxtParser(java.lang.String stoplist)
Parameters:
argv -
Method Detail

main

public static void main(java.lang.String[] argv)

setXmlOutput

private void setXmlOutput(java.lang.String xmlfile)

configure

public void configure(boolean useStemming,
                      boolean useUnigrams,
                      boolean useBigrams)
configure the parser.

Parameters:
useStemming - use stemming
useUnigrams - use unigrams
useBigrams - use bigrams
sentencesAsDocs -
meldungenAsDocs -

parseDir

private java.util.Vector<SimpleDocument> parseDir(java.lang.String sourcefile,
                                                  int mindl)
Parse directory by adding each XML file's content sequentially. Doc IDs are taken from the docId tag from inside the xml document.

Parameters:
sourcefile -
mindl - minimum doc length
Returns:

isValid

private boolean isValid(java.lang.String sourcefile)
True for txt files. Override in subclasses.

Parameters:
sourcefile -
Returns:

parse

public java.util.Vector<SimpleDocument> parse(java.lang.String file,
                                              int mindl)
opens the file and parses the content as one document

Parameters:
mindl - minimum required document length
Returns:

parseString

private void parseString(java.lang.String file,
                         java.lang.String string,
                         int mindl)
parses the string

Parameters:
string -
mindl -

openOutfile

private void openOutfile()

writeText

private void writeText(java.lang.String file,
                       int id,
                       java.lang.String text)

closeOutfile

private void closeOutfile()

parseText

private int parseText(java.lang.String s,
                      java.util.Vector<java.lang.String> words)
Parse the given text and add terms to the model. Here stop-words and stem filtering is located.

Parameters:
s -
Returns:
number of terms added to words.

removePunct

private java.lang.String removePunct(java.lang.String s)
Remove all punctuation

Parameters:
s -
Returns: