net.java.sen.dictionary
Class Tokenizer

java.lang.Object
  extended by net.java.sen.dictionary.Tokenizer
Direct Known Subclasses:
JapaneseTokenizer

public abstract class Tokenizer
extends Object

A String Tokenizer

The Tokenizer uses a Dictionary to assist the decomposition of strings into potential morphemes


Field Summary
protected  Node bosNode
          A Node representing a beginning-of-string
protected  Dictionary dictionary
          The Dictionary used to find possible morphemes
protected  Node eosNode
          A Node representing an end-of-string
protected  CToken unknownCToken
          A CToken representing an unknown morpheme
protected  String unknownPartOfSpeechDescription
          The part-of-speech code to use for unknown tokens
 
Constructor Summary
Tokenizer(Dictionary dictionary, String unknownPartOfSpeechDescription)
          Constructs a new Tokenizer that uses the specified Dictionary to find possible morphemes within a given string
 
Method Summary
 Node getBOSNode()
          Creates a unique beginning-of-string Node.
 Dictionary getDictionary()
           
 Node getEOSNode()
          Creates a unique end-of-string Node.
 Node getUnknownNode(char[] surface, int start, int length, int span)
          Creates an "unknown morpheme" Node with the specified characteristics.
abstract  Node lookup(SentenceIterator iterator, char[] surface)
          Searches for possible morphemes from the given SentenceIterator.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

dictionary

protected final Dictionary dictionary
The Dictionary used to find possible morphemes


unknownCToken

protected final CToken unknownCToken
A CToken representing an unknown morpheme


bosNode

protected final Node bosNode
A Node representing a beginning-of-string


eosNode

protected final Node eosNode
A Node representing an end-of-string


unknownPartOfSpeechDescription

protected final String unknownPartOfSpeechDescription
The part-of-speech code to use for unknown tokens

Constructor Detail

Tokenizer

public Tokenizer(Dictionary dictionary,
                 String unknownPartOfSpeechDescription)
Constructs a new Tokenizer that uses the specified Dictionary to find possible morphemes within a given string

Parameters:
dictionary - The Dictionary to search within
unknownPartOfSpeechDescription - The part-of-speech code to use for unknown tokens
Method Detail

getDictionary

public Dictionary getDictionary()
Returns:
Returns the dictionary used to find possible morphemes

getBOSNode

public Node getBOSNode()
Creates a unique beginning-of-string Node. The Node returned by this method is freshly cloned and not an alias of any other Node

Returns:
A beginning-of-string Node

getEOSNode

public Node getEOSNode()
Creates a unique end-of-string Node. The Node returned by this method is freshly cloned and not an alias of any other Node

Returns:
An end-of-string Node

getUnknownNode

public Node getUnknownNode(char[] surface,
                           int start,
                           int length,
                           int span)
Creates an "unknown morpheme" Node with the specified characteristics. The Node returned by this method is freshly cloned and not an alias of any other Node

Parameters:
surface - The underlying surface of which the Node is part
start - The index of the first character of the surface within the Node
length - The length of the Node
span - The span of the Node
Returns:
The new "unknown morpheme" Node

lookup

public abstract Node lookup(SentenceIterator iterator,
                            char[] surface)
                     throws IOException
Searches for possible morphemes from the given SentenceIterator. The Node that is returned links through Node.rnext to a list of matches which may be of varying lengths

Parameters:
iterator - The iterator to search from
surface - The underlying character surface
Returns:
The head of a chain of Nodes representing the possible morphemes beginning at the given index
Throws:
IOException


Copyright © 2012. All Rights Reserved.