org.knowceans.corpus.parsers.nips
Class NipsXmlReader

java.lang.Object
  extended by org.knowceans.corpus.parsers.nips.NipsXmlReader

public class NipsXmlReader
extends java.lang.Object

NipsXmlReader parsers one XML file converted from NIPS PDF documents using XPDF-basd pdftohtml

Author:
heinrich

Field Summary
(package private)  java.lang.String _bold
           
(package private)  java.lang.String _fontspec
           
(package private)  java.lang.String _italics
           
(package private)  java.lang.String _page
           
(package private)  java.lang.String _text
           
(package private)  java.util.regex.Pattern boldpattern
           
(package private)  java.util.HashMap<java.lang.String,java.lang.Double> fontsizes
           
(package private)  java.util.regex.Pattern fontspecpattern
           
private  boolean includerefs
          whether to include references in the text
private  boolean inrefs
           
(package private)  java.util.regex.Pattern italicspattern
           
(package private)  java.util.regex.Pattern pagepattern
           
(package private)  java.util.regex.Pattern textpattern
           
 
Constructor Summary
NipsXmlReader()
          initialise reader (e.g., compile regex patterns)
 
Method Summary
private  java.lang.String clean(java.lang.String in)
          cleans the string.
private  java.lang.String cleanUp(java.lang.String in)
           
 void extract(java.lang.String filename, NipsDocument doc)
           
private  java.util.Vector<java.lang.String> extractPage(java.lang.StringBuffer content)
          extracts the content of a page.
private  java.lang.String[] getHead(java.lang.StringBuffer content)
          extract title, authors and abstract from (the first page of a) document.
private  java.util.Vector<java.lang.String> getPages(java.lang.StringBuffer content)
           
static void main(java.lang.String[] args)
           
private  void processText(java.lang.StringBuffer content, NipsDocument doc)
           
private  java.lang.String replaceUmlauts(java.lang.String in)
          inserts umlauts TODO: direct unicode multicharacter replacements
private  int setFonts(java.lang.StringBuffer content)
          sets the font sizes for the current page
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

_page

java.lang.String _page

_fontspec

java.lang.String _fontspec

_text

java.lang.String _text

_bold

java.lang.String _bold

_italics

java.lang.String _italics

pagepattern

java.util.regex.Pattern pagepattern

fontspecpattern

java.util.regex.Pattern fontspecpattern

textpattern

java.util.regex.Pattern textpattern

boldpattern

java.util.regex.Pattern boldpattern

italicspattern

java.util.regex.Pattern italicspattern

fontsizes

java.util.HashMap<java.lang.String,java.lang.Double> fontsizes

inrefs

private boolean inrefs

includerefs

private boolean includerefs
whether to include references in the text

Constructor Detail

NipsXmlReader

public NipsXmlReader()
initialise reader (e.g., compile regex patterns)

Method Detail

main

public static void main(java.lang.String[] args)

extract

public void extract(java.lang.String filename,
                    NipsDocument doc)
Parameters:
filename -
doc - document record (existing or will be created)

processText

private void processText(java.lang.StringBuffer content,
                         NipsDocument doc)
Parameters:
content -

getHead

private java.lang.String[] getHead(java.lang.StringBuffer content)
extract title, authors and abstract from (the first page of a) document.

Parameters:
content - the content that the heading data is extracted from. The extracted parts are stripped from
Returns:
title, author names (semicolon-separated), abstract.

cleanUp

private java.lang.String cleanUp(java.lang.String in)
Parameters:
in -
Returns:

setFonts

private int setFonts(java.lang.StringBuffer content)
sets the font sizes for the current page

Parameters:
content -
Returns:
end of last group

replaceUmlauts

private java.lang.String replaceUmlauts(java.lang.String in)
inserts umlauts TODO: direct unicode multicharacter replacements

Parameters:
in -
Returns:

clean

private java.lang.String clean(java.lang.String in)
cleans the string.

Parameters:
in -
Returns:

getPages

private java.util.Vector<java.lang.String> getPages(java.lang.StringBuffer content)
Parameters:
content -
Returns:

extractPage

private java.util.Vector<java.lang.String> extractPage(java.lang.StringBuffer content)
extracts the content of a page. Sections are recognised by boldface > 13pt

Parameters:
content -
Returns: