public class SRXSegmenter extends Object implements ISegmenter
ISegmenter interface for SRX rules.| Constructor and Description |
|---|
SRXSegmenter()
Creates a new SRXSegmenter object.
|
| Modifier and Type | Method and Description |
|---|---|
protected void |
addRule(net.sf.okapi.lib.segmentation.CompiledRule compiledRule)
Adds a compiled rule to this segmenter.
|
boolean |
cascade()
Indicates if cascading must be applied when selecting the rules for a given
language pattern.
|
int |
computeSegments(String text)
Calculate the segmentation of a given plain text string.
|
int |
computeSegments(TextContainer container)
Calculates the segmentation of a given TextContainer object.
|
LocaleId |
getLanguage()
Gets the language used to apply the rules.
|
Range |
getNextSegmentRange(TextContainer container)
Compute the range of the next segment for a given TextContainer object.
|
List<Range> |
getRanges()
Gets the list off all segments ranges calculated when
calling
ISegmenter.computeSegments(String), or
ISegmenter.computeSegments(TextContainer). |
List<Integer> |
getSplitPositions()
Gets the list of all the split positions in the text
that was last segmented.
|
boolean |
includeEndCodes()
Indicates if end codes should be included (See SRX implementation notes).
|
boolean |
includeIsolatedCodes()
Indicates if isolated codes should be included (See SRX implementation notes).
|
boolean |
includeStartCodes()
Indicates if start codes should be included (See SRX implementation notes).
|
boolean |
oneSegmentIncludesAll()
Indicates if, when there is a single segment in a text, it should include
the whole text (no spaces or codes trim left/right)
|
void |
reset()
Resets the options to their defaults, and the compiled rules
to nothing.
|
boolean |
segmentSubFlows()
Indicates if sub-flows must be segmented.
|
protected void |
setCascade(boolean value)
Sets the flag indicating if cascading must be applied when selecting the
rules for a given language pattern.
|
void |
setIncludeEndCodes(boolean includeEndCodes) |
void |
setIncludeIsolatedCodes(boolean includeIsolatedCodes) |
void |
setIncludeStartCodes(boolean includeStartCodes) |
void |
setLanguage(LocaleId languageCode)
Sets the locale used to apply the rules.
|
protected void |
setMaskRule(String pattern)
Sets the pattern for the mask rule.
|
void |
setOneSegmentIncludesAll(boolean oneSegmentIncludesAll) |
void |
setOptions(boolean segmentSubFlows,
boolean includeStartCodes,
boolean includeEndCodes,
boolean includeIsolatedCodes,
boolean oneSegmentIncludesAll,
boolean trimLeadingWS,
boolean trimTrailingWS)
Sets the options for this segmenter.
|
void |
setOptions(boolean segmentSubFlows,
boolean includeStartCodes,
boolean includeEndCodes,
boolean includeIsolatedCodes,
boolean oneSegmentIncludesAll,
boolean trimLeadingWS,
boolean trimTrailingWS,
boolean useJavaRegex,
boolean useIcu4JBreakRules,
boolean treatIsolatedCodesAsWhitespace)
Sets the options for this segmenter.
|
void |
setSegmentSubFlows(boolean segmentSubFlows) |
void |
setTreatIsolatedCodesAsWhitespace(boolean treatIsolatedCodesAsWhitespace) |
void |
setTrimCodes(boolean trimCodes) |
void |
setTrimLeadingWS(boolean trimLeadingWS) |
void |
setTrimTrailingWS(boolean trimTrailingWS) |
void |
setUseJavaRegex(boolean useJavaRegex)
Sets the indicator that tells if this document has rules that are defined for
the Java regular expression engine (vs ICU).
|
boolean |
treatIsolatedCodesAsWhitespace()
Indicate if the segmenter should treat each isolated code as a single
whitespace character (U+0020) when applying segmentation.
|
boolean |
trimLeadingWhitespaces()
Indicates if leading white-spaces should be left outside the segments.
|
boolean |
trimTrailingWhitespaces()
Indicates if trailing white-spaces should be left outside the segments.
|
boolean |
useJavaRegex()
Indicates if this document has rules that are defined for the Java regular
expression engine (vs ICU).
|
public void reset()
ISegmenterreset in interface ISegmenterpublic void setOptions(boolean segmentSubFlows,
boolean includeStartCodes,
boolean includeEndCodes,
boolean includeIsolatedCodes,
boolean oneSegmentIncludesAll,
boolean trimLeadingWS,
boolean trimTrailingWS,
boolean useJavaRegex,
boolean useIcu4JBreakRules,
boolean treatIsolatedCodesAsWhitespace)
segmentSubFlows - true to segment sub-flows, false to no
segment them.includeStartCodes - true to include start codes just before
a break in the 'left' segment, false to
put them in the next segment.includeEndCodes - true to include end codes just before a
break in the 'left' segment, false to
put them in the next segment.includeIsolatedCodes - true to include isolated codes just
before a break in the 'left' segment,
false to put them in the next segment.oneSegmentIncludesAll - true to include everything in segments
that are alone.trimLeadingWS - true to trim leading white-spaces from
the segments, false to keep them.trimTrailingWS - true to trim trailing white-spaces from
the segments, false to keep them.useJavaRegex - true if the rules are for the Java
regular expression engine, false if
they are for ICU.treatIsolatedCodesAsWhitespace - if true then the isolated code markers
in codedText get converted to spaces,
so that they don't get in the way of
the rules. If false, the codes are
simply removed.public void setOptions(boolean segmentSubFlows,
boolean includeStartCodes,
boolean includeEndCodes,
boolean includeIsolatedCodes,
boolean oneSegmentIncludesAll,
boolean trimLeadingWS,
boolean trimTrailingWS)
ISegmentersetOptions in interface ISegmentersegmentSubFlows - true to segment sub-flows, false to no segment them.includeStartCodes - true to include start codes just before a break in the 'left' segment,
false to put them in the next segment.includeEndCodes - true to include end codes just before a break in the 'left' segment,
false to put them in the next segment.includeIsolatedCodes - true to include isolated codes just before a break in the 'left' segment,
false to put them in the next segment.oneSegmentIncludesAll - true to include everything in segments that are alone.trimLeadingWS - true to trim leading white-spaces from the segments, false to keep them.trimTrailingWS - true to trim trailing white-spaces from the segments, false to keep them.public boolean oneSegmentIncludesAll()
ISegmenteroneSegmentIncludesAll in interface ISegmenterpublic boolean segmentSubFlows()
ISegmentersegmentSubFlows in interface ISegmenterpublic boolean cascade()
public boolean trimLeadingWhitespaces()
ISegmentertrimLeadingWhitespaces in interface ISegmenterpublic boolean trimTrailingWhitespaces()
ISegmentertrimTrailingWhitespaces in interface ISegmenterpublic boolean useJavaRegex()
public boolean treatIsolatedCodesAsWhitespace()
ISegmentertreatIsolatedCodesAsWhitespace in interface ISegmenterpublic void setUseJavaRegex(boolean useJavaRegex)
useJavaRegex - true if the rules should be treated as Java regular
expression, false for ICU.public boolean includeStartCodes()
ISegmenterincludeStartCodes in interface ISegmenterpublic boolean includeEndCodes()
ISegmenterincludeEndCodes in interface ISegmenterpublic boolean includeIsolatedCodes()
ISegmenterincludeIsolatedCodes in interface ISegmenterpublic int computeSegments(String text)
ISegmentercomputeSegments in interface ISegmentertext - plain text to segment.public int computeSegments(TextContainer container)
ISegmentercomputeSegments in interface ISegmentercontainer - the object to segment.public Range getNextSegmentRange(TextContainer container)
ISegmentergetNextSegmentRange in interface ISegmentercontainer - the text container where to look for the next segment.public List<Integer> getSplitPositions()
ISegmenterISegmenter.computeSegments(TextContainer)
or ISegmenter.computeSegments(String) before calling this method.
A split position is the first character position of a new segment.
IMPORTANT: The position returned here are the position WITHOUT taking in account any options for trimming or not leading and trailing white-spaces.
getSplitPositions in interface ISegmenterpublic List<Range> getRanges()
ISegmenterISegmenter.computeSegments(String), or
ISegmenter.computeSegments(TextContainer).getRanges in interface ISegmenterRange object where start is the start and end the end of the range.
Returns null if no ranges have been defined yet.public LocaleId getLanguage()
ISegmentergetLanguage in interface ISegmenterpublic void setLanguage(LocaleId languageCode)
ISegmentersetLanguage in interface ISegmenterlanguageCode - Code of the language to use to apply the rules.protected void setCascade(boolean value)
value - true if cascading must be applied, false otherwise.protected void addRule(net.sf.okapi.lib.segmentation.CompiledRule compiledRule)
compiledRule - the compiled rule to add.protected void setMaskRule(String pattern)
pattern - the new pattern to use for the mask rule.public void setSegmentSubFlows(boolean segmentSubFlows)
setSegmentSubFlows in interface ISegmenterpublic void setIncludeStartCodes(boolean includeStartCodes)
setIncludeStartCodes in interface ISegmenterpublic void setIncludeEndCodes(boolean includeEndCodes)
setIncludeEndCodes in interface ISegmenterpublic void setIncludeIsolatedCodes(boolean includeIsolatedCodes)
setIncludeIsolatedCodes in interface ISegmenterpublic void setOneSegmentIncludesAll(boolean oneSegmentIncludesAll)
setOneSegmentIncludesAll in interface ISegmenterpublic void setTrimLeadingWS(boolean trimLeadingWS)
setTrimLeadingWS in interface ISegmenterpublic void setTrimTrailingWS(boolean trimTrailingWS)
setTrimTrailingWS in interface ISegmenterpublic void setTrimCodes(boolean trimCodes)
setTrimCodes in interface ISegmenterpublic void setTreatIsolatedCodesAsWhitespace(boolean treatIsolatedCodesAsWhitespace)
setTreatIsolatedCodesAsWhitespace in interface ISegmenterCopyright © 2021. All rights reserved.