public class SrxTextIterator extends AbstractTextIterator
1. Rule matcher list is created based on SRX file and language. Each rule
matcher is responsible for matching before break and after break regular
expressions of one break rule.
2. Each rule matcher is matched to the text. If the rule was not found the
rule matcher is removed from the list.
3. First rule matcher in terms of its break position in text is selected.
4. List of exception rules corresponding to break rule is retrieved.
5. If none of exception rules is matching in break position then
the text is marked as split and new segment is created. In addition
all rule matchers are moved so they start after the end of new segment
(which is the same as break position of the matched rule).
6. All the rules that have break position behind last matched rule
break position are moved until they pass it.
7. If segment was not found the whole process is repeated.
In streaming version of this algorithm character buffer is searched.
When the end of it is reached or break position is in the margin
(break position > buffer size - margin) and there is more text,
the buffer is moved in the text until it starts after last found segment.
If this happens rule matchers are reinitialized and the text is searched again.
Streaming version has a limitation that read buffer must be at least as long
as any segment in the text.
As this algorithm uses lookbehind extensively but Java does not permit
infinite regular expressions in lookbehind, so some patterns are finitized.
For example a* pattern will be changed to something like a{0,100}.| Modifier and Type | Field and Description |
|---|---|
static String |
BUFFER_LENGTH_PARAMETER
Reader buffer size.
|
static int |
DEFAULT_BUFFER_LENGTH
Default size of read buffer when using streaming version of this class.
|
static int |
DEFAULT_MARGIN
Default margin size.
|
static int |
DEFAULT_MAX_LOOKBEHIND_CONSTRUCT_LENGTH
Default max lookbehind construct length parameter.
|
static String |
MARGIN_PARAMETER
Margin size.
|
static String |
MAX_LOOKBEHIND_CONSTRUCT_LENGTH_PARAMETER
Maximum length of a regular expression construct that occurs in lookbehind.
|
| Constructor and Description |
|---|
SrxTextIterator(SrxDocument document,
String languageCode,
CharSequence text)
Creates text iterator with no additional parameters.
|
SrxTextIterator(SrxDocument document,
String languageCode,
CharSequence text,
Map<String,Object> parameterMap)
Creates text iterator that obtains language rules form given document
using given language code.
|
SrxTextIterator(SrxDocument document,
String languageCode,
Reader reader)
Creates streaming text iterator with no additional parameters.
|
SrxTextIterator(SrxDocument document,
String languageCode,
Reader reader,
Map<String,Object> parameterMap)
Creates text iterator that obtains language rules from given document
using given language code.
|
| Modifier and Type | Method and Description |
|---|---|
boolean |
hasNext() |
String |
next()
Finds the next segment in the text and returns it.
|
remove, toStringclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitforEachRemainingpublic static final String MARGIN_PARAMETER
public static final String BUFFER_LENGTH_PARAMETER
public static final String MAX_LOOKBEHIND_CONSTRUCT_LENGTH_PARAMETER
public static final int DEFAULT_MARGIN
public static final int DEFAULT_BUFFER_LENGTH
public static final int DEFAULT_MAX_LOOKBEHIND_CONSTRUCT_LENGTH
public SrxTextIterator(SrxDocument document, String languageCode, CharSequence text, Map<String,Object> parameterMap)
MAX_LOOKBEHIND_CONSTRUCT_LENGTH_PARAMETER.document - SRX documentlanguageCode - text language code of text used to retrieve the rulestext - parameterMap - additional segmentation parameterspublic SrxTextIterator(SrxDocument document, String languageCode, CharSequence text)
document - SRX documentlanguageCode - text language code of text used to retrieve the rulestext - SrxTextIterator(SrxDocument, String, CharSequence, Map)public SrxTextIterator(SrxDocument document, String languageCode, Reader reader, Map<String,Object> parameterMap)
BUFFER_LENGTH_PARAMETER,
MARGIN_PARAMETER,
MAX_LOOKBEHIND_CONSTRUCT_LENGTH_PARAMETER.document - SRX documentlanguageCode - text language code of text used to retrieve the rulesreader - reader from which read the textparameterMap - additional segmentation parameterspublic SrxTextIterator(SrxDocument document, String languageCode, Reader reader)
document - SRX documentlanguageCode - text language code of text used to retrieve the rulesreader - reader from which read the textSrxTextIterator(SrxDocument, String, Reader, Map)public String next()
IllegalStateException - if buffer is too small to hold the segmentIORuntimeException - if IO error occurs when reading the textpublic boolean hasNext()
Copyright © 2018. All Rights Reserved.