public class SRXDocument extends Object
| Modifier and Type | Field and Description |
|---|---|
static String |
ANYCODE
Marker for INLINECODE_PATTERN in the given pattern.
|
static String |
DEFAULT_SRX_RULES |
static String |
INLINECODE_PATTERN
Represents the pattern for an inline code (both special characters).
|
static String |
NOAUTO
Placed at the end of the 'after' expression, this marker indicates the
given pattern should not have auto-insertion of AUTO_INLINECODES.
|
| Constructor and Description |
|---|
SRXDocument()
Creates an empty SRX document.
|
| Modifier and Type | Method and Description |
|---|---|
void |
addLanguageMap(LanguageMap langMap)
Adds a language map to this document.
|
void |
addLanguageRule(String name,
ArrayList<Rule> langRule)
Adds a language rule to this SRX document.
|
boolean |
cascade()
Indicates if cascading must be applied when selecting the rules for a
given language pattern.
|
ISegmenter |
compileLanguageRules(LocaleId languageCode,
ISegmenter existingSegmenter)
Compiles the all language rules applicable for a given language code, and
assign them to a segmenter.
|
ISegmenter |
compileSingleLanguageRule(String ruleName,
ISegmenter existingSegmenter)
Compiles a single language rule group and assign it to a segmenter.
|
String |
generateRuleRegex(Rule rule) |
LinkedHashMap<String,ArrayList<Rule>> |
getAllLanguageRules()
Gets a map of all the language rules in this document.
|
ArrayList<LanguageMap> |
getAllLanguagesMaps()
Gets the list of all the language maps in this document.
|
String |
getComments()
Gets the comments associated with this document.
|
String |
getHeaderComments()
Gets the comments associated with the header of this document.
|
ArrayList<Rule> |
getLanguageRules(String ruleName)
Gets the list of rules for a given <languagerule7gt; element.
|
String |
getMaskRule()
Gets the current pattern of the mask rule.
|
String |
getSampleLanguage()
Gets the current sample language code.
|
String |
getSampleText()
Gets the current sample text.
|
String |
getVersion()
Gets the version of this SRX document.
|
String |
getWarning()
Gets the last warning that was issued while loading a document.
|
boolean |
hasWarning()
Indicates if a warning was issued last time a document was read.
|
boolean |
includeEndCodes()
Indicates if end codes should be included (See SRX implementation notes).
|
boolean |
includeIsolatedCodes()
Indicates if isolated codes should be included (See SRX implementation
notes).
|
boolean |
includeStartCodes()
Indicates if start codes should be included (See SRX implementation
notes).
|
boolean |
isModified()
Indicates if the document has been modified since the last load or save.
|
void |
loadRules(CharSequence data)
Loads an SRX document from a CharSequence object.
|
void |
loadRules(InputStream inputStream)
Loads an SRX document from an input stream.
|
void |
loadRules(String pathOrURL)
Loads an SRX document from a file.
|
boolean |
oneSegmentIncludesAll()
Indicates if, when there is a single segment in a text, it should include
the whole text (no spaces or codes trim left/right)
|
void |
resetAll()
Resets the document to its default empty initial state.
|
void |
saveRules(String rulesPath,
boolean saveExtensions,
boolean saveNonValidInfo)
Saves the current rules to an SRX rules document.
|
String |
saveRulesToString(boolean saveExtensions,
boolean saveNonValidInfo)
Saves the current rules to an SRX string.
|
boolean |
segmentSubFlows()
Indicates if sub-flows must be segmented.
|
void |
setCascade(boolean value)
Sets the flag indicating if cascading must be applied when selecting the
rules for a given language pattern.
|
void |
setComments(String text)
Sets the comments for this document.
|
void |
setHeaderComments(String text)
Sets the comments for the header of this document.
|
void |
setIncludeEndCodes(boolean value)
Sets the indicator that tells if end codes should be included or not.
|
void |
setIncludeIsolatedCodes(boolean value)
Sets the indicator that tells if isolated codes should be included or
not.
|
void |
setIncludeStartCodes(boolean value)
Sets the indicator that tells if start codes should be included or not.
|
void |
setMaskRule(String pattern)
Sets the pattern for the mask rule.
|
void |
setModified(boolean value)
Sets the flag indicating if the document has been modified since the last
load or save.
|
void |
setOneSegmentIncludesAll(boolean value)
Sets the indicator that tells if when there is a single segment in a text
it should include the whole text (no spaces or codes trim left/right)
text.
|
void |
setSampleLanguage(String value)
Sets the sample language code.
|
void |
setSampleText(String value)
Sets the sample text.
|
void |
setSegmentSubFlows(boolean value)
Sets the flag indicating if sub-flows must be segmented.
|
void |
setTestOnSelectedGroup(boolean value)
Sets the indicator on how to apply rules for samples.
|
void |
setTreatIsolatedCodesAsWhitespace(boolean value)
Sets the indicator if this document should treat isolated codes as
whitespace when matching SRX rules.
|
void |
setTrimLeadingWhitespaces(boolean value)
Sets the indicator that tells if leading white-spaces should be left
outside the segments.
|
void |
setTrimTrailingWhitespaces(boolean value)
Sets the indicator that tells if trailing white-spaces should be left
outside the segments.
|
void |
setUseICU4JBreakRules(boolean value)
Sets the indicator that tells if this document uses ICU4J BreakIterator rules.
|
void |
setUseJavaRegex(boolean value)
Deprecated.
|
boolean |
testOnSelectedGroup()
Indicates that, when sampling the rules, the sample should be computed
using only a selected group of rules.
|
boolean |
treatIsolatedCodesAsWhitespace()
Indicates if this document should treat isolated codes as whitespace when
matching SRX rules.
|
boolean |
trimLeadingWhitespaces()
Indicates if leading white-spaces should be left outside the segments.
|
boolean |
trimTrailingWhitespaces()
Indicates if trailing white-spaces should be left outside the segments.
|
boolean |
useIcu4JBreakRules()
Indicates if this document uses ICU4J break rules.
|
boolean |
useJavaRegex()
Deprecated.
|
public static final String DEFAULT_SRX_RULES
public static final String INLINECODE_PATTERN
public static final String ANYCODE
public static final String NOAUTO
public String getVersion()
public boolean hasWarning()
public String getWarning()
public String getHeaderComments()
public void setHeaderComments(String text)
text - the new comments, use null or empty string for removing the
comments.public String getComments()
public void setComments(String text)
text - the new comments, use null or empty string for removing the
comments.public void resetAll()
public LinkedHashMap<String,ArrayList<Rule>> getAllLanguageRules()
public ArrayList<Rule> getLanguageRules(String ruleName)
ruleName - the name of the <languagerulegt; element to query.public ArrayList<LanguageMap> getAllLanguagesMaps()
public boolean segmentSubFlows()
public void setSegmentSubFlows(boolean value)
value - true if sub-flows must be segmented, false otherwise.public boolean cascade()
public void setCascade(boolean value)
value - true if cascading must be applied, false otherwise.public boolean oneSegmentIncludesAll()
public void setOneSegmentIncludesAll(boolean value)
value - true if a text with a single segment should include the whole
text.@Deprecated public boolean useJavaRegex()
@Deprecated public void setUseJavaRegex(boolean value)
value - true if the rules should be treated as Java regular
expression, false will log an error as the ICU4J regex engine has been removedpublic boolean useIcu4JBreakRules()
public void setUseICU4JBreakRules(boolean value)
BreakIterator break positions are converted to SRX-like rules and used
as default rules for all languages.value - true if ICU4J rules should be used as defaults
expression, false if no ICU4J rules should be usedpublic boolean treatIsolatedCodesAsWhitespace()
public void setTreatIsolatedCodesAsWhitespace(boolean value)
value - true if isolated codes should be treated as whitespacepublic boolean trimLeadingWhitespaces()
public void setTrimLeadingWhitespaces(boolean value)
value - true if the leading white-spaces should be trimmed.public boolean trimTrailingWhitespaces()
public void setTrimTrailingWhitespaces(boolean value)
value - true if the trailing white-spaces should be trimmed.public boolean includeStartCodes()
public void setIncludeStartCodes(boolean value)
value - true if start codes should be included, false otherwise.public boolean includeEndCodes()
public void setIncludeEndCodes(boolean value)
value - true if end codes should be included, false otherwise.public boolean includeIsolatedCodes()
public void setIncludeIsolatedCodes(boolean value)
value - true if isolated codes should be included, false otherwise.public String getMaskRule()
public void setMaskRule(String pattern)
pattern - the new pattern to use for the mask rule.public String getSampleText()
public void setSampleText(String value)
value - the new sample text.public String getSampleLanguage()
public void setSampleLanguage(String value)
value - the new sample language code.public boolean testOnSelectedGroup()
public void setTestOnSelectedGroup(boolean value)
value - true to test using only a selected group of rules. False to
test using all the rules matching a given language.public boolean isModified()
public void setModified(boolean value)
value - true if the document has been changed, false otherwise.public void addLanguageRule(String name, ArrayList<Rule> langRule)
name - name of the language rule to add.langRule - language rule object to add.public void addLanguageMap(LanguageMap langMap)
langMap - the language map object to add.public ISegmenter compileLanguageRules(LocaleId languageCode, ISegmenter existingSegmenter)
cascade() is true.languageCode - the language code. the value should be a BCP-47 value (e.g.
"de", "fr-ca", etc.)existingSegmenter - optional existing SRXSegmenter object to re-use. Use null for
not re-using anything.public ISegmenter compileSingleLanguageRule(String ruleName, ISegmenter existingSegmenter)
ruleName - the name of the rule group to apply.existingSegmenter - optional existing SRXSegmenter object to re-use. Use null for
not re-using anything.public void loadRules(CharSequence data)
data - the string containing the SRX document to load.public void loadRules(String pathOrURL)
For SRXDocument.DEFAULT_SRX_RULES (the string "DEFAULT_SRX_RULES" in serialized parameters)
this will load the (Okapi recommended) .srx file, embedded in the library jar.
pathOrURL - The full path or URL of the document to load.public void loadRules(InputStream inputStream)
inputStream - the input stream to read from.public String saveRulesToString(boolean saveExtensions, boolean saveNonValidInfo)
saveExtensions - true to save Okapi SRX extensions, false otherwise.saveNonValidInfo - true to save non-SRX-valid attributes, false otherwise.public void saveRules(String rulesPath, boolean saveExtensions, boolean saveNonValidInfo)
rulesPath - the full path of the file where to save the rules.saveExtensions - true to save Okapi SRX extensions, false otherwise.saveNonValidInfo - true to save non-SRX-valid attributes, false otherwise.Copyright © 2021. All rights reserved.