public abstract class AbstractMarkupFilter extends AbstractFilter
IFilter around the Jericho parser. Jericho can parse non-wellformed
HTML, XHTML, XML and various server side scripting languages such as PHP, Mason, Perl (all configurable from
Jericho). AbstractMarkupFilter takes care of the parser initialization and provides default handlers for each token
type returned by the parser.
Handling of translatable text, inline tags, translatable and read-only attributes are configurable through a user defined YAML file. See the Okapi HtmlFilter with defaultConfiguration.yml and OpenXml filters for examples.
SUB_FILTER| Constructor and Description |
|---|
AbstractMarkupFilter()
Default constructor for
AbstractMarkupFilter using default AbstractMarkupEventBuilder |
AbstractMarkupFilter(AbstractMarkupEventBuilder eventBuilder)
Default constructor for
AbstractMarkupFilter using specified AbstractMarkupEventBuilder |
| Modifier and Type | Method and Description |
|---|---|
protected void |
addCodeToCurrentTextUnit(net.htmlparser.jericho.Tag tag)
Add an
Code to the current TextUnit. |
protected void |
addCodeToCurrentTextUnit(net.htmlparser.jericho.Tag tag,
boolean endCodeNow)
Add an
Code to the current TextUnit. |
protected void |
addFilterEvent(Event event) |
protected void |
addToDocumentPart(String part) |
protected void |
addToTextUnit(Code code) |
protected void |
addToTextUnit(Code code,
boolean endCodeNow) |
protected void |
addToTextUnit(Code code,
boolean endCodeNow,
List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders) |
protected void |
addToTextUnit(Code code,
List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders) |
protected void |
addToTextUnit(String text) |
protected void |
appendToFirstSkeletonPart(String text) |
protected boolean |
canStartNewTextUnit() |
void |
close()
Close the filter and all used resources.
|
protected AbstractMarkupEventBuilder |
createEventBuilder()
Delayed initialization of the
EventBuilder. |
protected PropertyTextUnitPlaceholder |
createPropertyTextUnitPlaceholder(PropertyTextUnitPlaceholder.PlaceholderAccessType type,
String name,
String value,
net.htmlparser.jericho.Tag tag,
net.htmlparser.jericho.Attribute attribute)
|
protected List<PropertyTextUnitPlaceholder> |
createPropertyTextUnitPlaceholders(net.htmlparser.jericho.StartTag startTag)
For the given Jericho
StartTag parse out all the actionable attributes and and store them as
PropertyTextUnitPlaceholder. |
protected String |
detectEncoding(RawDocument input) |
protected TextFragment.TagType |
determineTagType(net.htmlparser.jericho.Tag tag)
Filter specific method for determining
TextFragment.TagType |
protected void |
endDocumentPart() |
protected void |
endFilter()
End the current filter processing and send the
Ending Event |
protected void |
endGroup(GenericSkeleton endMarker) |
protected void |
endTextUnit(GenericSkeleton endMarker) |
StringBuilder |
getBufferedWhiteSpace() |
protected abstract TaggedFilterConfiguration |
getConfig()
Get the current
TaggedFilterConfiguration. |
protected String |
getCurrentDocName() |
protected long |
getDocumentPartId() |
AbstractMarkupEventBuilder |
getEventBuilder() |
protected long |
getGroupIdSequence() |
protected net.htmlparser.jericho.Source |
getParsedHeader(InputStream inputStream) |
protected ExtractionRuleState |
getRuleState() |
protected long |
getTextUnitId() |
protected void |
handleCdataSection(net.htmlparser.jericho.Tag tag)
Handle CDATA sections.
|
protected void |
handleCharacterEntity(net.htmlparser.jericho.CharacterEntityReference entity)
Handle all numeric entities.
|
protected void |
handleComment(net.htmlparser.jericho.Tag tag)
Handle comments.
|
protected void |
handleDocTypeDeclaration(net.htmlparser.jericho.Tag tag)
Handle the XML doc type declaration (DTD).
|
protected void |
handleDocumentPart(net.htmlparser.jericho.Tag tag)
Handle anything else not classified by Jericho.
|
protected void |
handleEndTag(net.htmlparser.jericho.EndTag endTag)
Handle end tags, including empty tags.
|
protected void |
handleMarkupDeclaration(net.htmlparser.jericho.Tag tag)
Handle an XML markup declaration.
|
protected void |
handleNumericEntity(net.htmlparser.jericho.NumericCharacterReference entity)
Handle all Character entities.
|
protected void |
handleProcessingInstruction(net.htmlparser.jericho.Tag tag)
Handle processing instructions.
|
protected void |
handleServerCommon(net.htmlparser.jericho.Tag tag)
Handle any recognized server tags (i.e., PHP, Mason etc.)
|
protected void |
handleServerCommonEscaped(net.htmlparser.jericho.Tag tag)
Handle any recognized escaped server tags.
|
protected void |
handleStartTag(net.htmlparser.jericho.StartTag startTag)
Handle start tags.
|
protected void |
handleText(CharSequence text)
Handle all text (PCDATA).
|
protected void |
handleXmlDeclaration(net.htmlparser.jericho.Tag tag)
Handle an XML declaration.
|
boolean |
hasNext()
Indicates if there is an event to process.
|
protected boolean |
isBOM()
Does the input have a BOM?
|
protected boolean |
isDocumentEncoding()
Does this document have a document encoding specified?
|
protected boolean |
isInsideTextRun() |
protected boolean |
isPreserveWhitespace() |
protected boolean |
isUtf8Bom()
Does the input have a UTF-8 Byte Order Mark?
|
protected boolean |
isUtf8Encoding()
Is the input encoded as UTF-8?
|
protected boolean |
isWhiteSpace(CharSequence text) |
Event |
next()
Queue up Jericho tokens until we can build an Okapi
Event and return it. |
protected abstract String |
normalizeAttributeName(String attrName,
String attrValue,
net.htmlparser.jericho.Tag tag)
Some attributes names are converted to Okapi standards such as HTML charset to "encoding" and lang to "language"
|
void |
open(RawDocument input)
Start a new
IFilter using the supplied RawDocument. |
void |
open(RawDocument input,
boolean generateSkeleton)
Start a new
IFilter using the supplied RawDocument. |
protected Event |
peekTempEvent() |
protected Event |
popTempEvent() |
protected void |
postProcessTextUnit(ITextUnit textUnit)
|
protected void |
preProcess(net.htmlparser.jericho.Segment segment)
Do any handling needed before the current Segment is processed.
|
protected void |
setCurrentDocName(String currentDocName) |
protected void |
setDocumentPartId(long id) |
protected void |
setGroupIdSequence(long id) |
void |
setMimeType(String mimeType)
Sets the input document mime type.
|
protected void |
setPreserveWhitespace(boolean preserveWhitespace) |
protected void |
setTextUnitId(long id) |
protected void |
setTextUnitMimeType(String mimeType) |
protected void |
setTextUnitName(String name) |
protected void |
setTextUnitPreserveWhitespace(boolean preserveWhitespace) |
protected void |
setTextUnitTranslatable(boolean translatable) |
protected void |
setTextUnitType(String type) |
protected void |
startDocumentPart(String part,
String name,
List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders) |
protected void |
startFilter()
Initialize the filter for every input and send the
StartDocument Event |
protected void |
startGroup(GenericSkeleton startMarker,
String commonTagType) |
protected void |
startGroup(GenericSkeleton startMarker,
String commonTagType,
LocaleId locale,
List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders) |
protected void |
startTextUnit() |
protected void |
startTextUnit(GenericSkeleton startMarker) |
protected void |
startTextUnit(GenericSkeleton startMarker,
List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders) |
protected void |
startTextUnit(String text) |
protected TaggedFilterConfiguration.RULE_TYPE |
updateEndTagRuleState(net.htmlparser.jericho.EndTag endTag) |
protected void |
updateStartTagRuleState(String tag,
TaggedFilterConfiguration.RULE_TYPE ruleType,
String idValue) |
addConfiguration, addConfiguration, addConfiguration, addConfigurations, cancel, createEndFilterEvent, createFilterWriter, createSkeletonWriter, createStartFilterEvent, findConfiguration, getConfiguration, getConfigurations, getDisplayName, getDocumentId, getDocumentName, getEncoderManager, getEncoding, getFilterConfigurationMapper, getFilterWriter, getMimeType, getName, getNewlineType, getParameters, getParameters, getParametersClassName, getParentId, getSrcLoc, getTrgLoc, isCanceled, isGenerateSkeleton, isMultilingual, removeConfiguration, setDisplayName, setDocumentName, setEncoding, setFilterConfigurationMapper, setFilterWriter, setGenerateSkeleton, setMultilingual, setName, setNewlineType, setOptions, setParameters, setParentId, setSrcLoc, setTrgLocclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitforEachRemaining, removepublic AbstractMarkupFilter()
AbstractMarkupFilter using default AbstractMarkupEventBuilderpublic AbstractMarkupFilter(AbstractMarkupEventBuilder eventBuilder)
AbstractMarkupFilter using specified AbstractMarkupEventBuilderprotected abstract TaggedFilterConfiguration getConfig()
TaggedFilterConfiguration. A TaggedFilterConfiguration is the result of reading in a YAML
configuration file and converting it into Java Objects.TaggedFilterConfigurationpublic void close()
close in interface AutoCloseableclose in interface IFilterclose in class AbstractFilterprotected net.htmlparser.jericho.Source getParsedHeader(InputStream inputStream)
protected String detectEncoding(RawDocument input)
public void open(RawDocument input)
IFilter using the supplied RawDocument.input - - input to the IFilter (can be a CharSequence, URI or InputStream)public void open(RawDocument input, boolean generateSkeleton)
IFilter using the supplied RawDocument.open in interface IFilteropen in class AbstractFilterinput - - input to the IFilter (can be a CharSequence, URI or InputStream)generateSkeleton - - true if the IFilter should store non-translatble blocks (aka skeleton), false otherwise.OkapiBadFilterInputExceptionOkapiIOExceptionpublic boolean hasNext()
IFilterImplementer Note: The caller must be able to call this method several times without changing state.
public Event next()
Event and return it.protected AbstractMarkupEventBuilder createEventBuilder()
EventBuilder. This will be called
when the filter is initialized if an EventBuilder was not previously
passed to the constructor.protected void startFilter()
StartDocument Eventprotected void endFilter()
Ending Eventprotected void preProcess(net.htmlparser.jericho.Segment segment)
segment - protected void postProcessTextUnit(ITextUnit textUnit)
protected void handleServerCommonEscaped(net.htmlparser.jericho.Tag tag)
tag - protected void handleServerCommon(net.htmlparser.jericho.Tag tag)
tag - protected void handleMarkupDeclaration(net.htmlparser.jericho.Tag tag)
tag - protected void handleXmlDeclaration(net.htmlparser.jericho.Tag tag)
tag - protected void handleDocTypeDeclaration(net.htmlparser.jericho.Tag tag)
tag - protected void handleProcessingInstruction(net.htmlparser.jericho.Tag tag)
tag - protected void handleComment(net.htmlparser.jericho.Tag tag)
tag - protected void handleCdataSection(net.htmlparser.jericho.Tag tag)
tag - protected void handleText(CharSequence text)
text - protected boolean isWhiteSpace(CharSequence text)
protected void handleNumericEntity(net.htmlparser.jericho.NumericCharacterReference entity)
entity - - the character entityprotected void handleCharacterEntity(net.htmlparser.jericho.CharacterEntityReference entity)
entity - - the numeric entityprotected void handleStartTag(net.htmlparser.jericho.StartTag startTag)
startTag - protected void updateStartTagRuleState(String tag, TaggedFilterConfiguration.RULE_TYPE ruleType, String idValue)
protected TaggedFilterConfiguration.RULE_TYPE updateEndTagRuleState(net.htmlparser.jericho.EndTag endTag)
protected void handleEndTag(net.htmlparser.jericho.EndTag endTag)
endTag - protected void handleDocumentPart(net.htmlparser.jericho.Tag tag)
tag - protected abstract String normalizeAttributeName(String attrName, String attrValue, net.htmlparser.jericho.Tag tag)
attrName - - the attribute nameattrValue - - the attribute valuetag - - the Jericho Tag that contains the attributeprotected void addCodeToCurrentTextUnit(net.htmlparser.jericho.Tag tag)
tag - - the Jericho Tag that is converted to a Okpai Codeprotected TextFragment.TagType determineTagType(net.htmlparser.jericho.Tag tag)
TextFragment.TagTypetag - Jericho Tag start or end tagTextFragment.TagTypeprotected void addCodeToCurrentTextUnit(net.htmlparser.jericho.Tag tag,
boolean endCodeNow)
tag - - the Jericho Tag that is converted to a Okpai CodeendCodeNow - - do we end the code now or delay so we can add more content to the code?protected List<PropertyTextUnitPlaceholder> createPropertyTextUnitPlaceholders(net.htmlparser.jericho.StartTag startTag)
StartTag parse out all the actionable attributes and and store them as
PropertyTextUnitPlaceholder. PropertyTextUnitPlaceholder.PlaceholderAccessType are set based on the filter configuration for
each attribute. for the attribute name and value.startTag - - Jericho StartTagStartTagprotected PropertyTextUnitPlaceholder createPropertyTextUnitPlaceholder(PropertyTextUnitPlaceholder.PlaceholderAccessType type, String name, String value, net.htmlparser.jericho.Tag tag, net.htmlparser.jericho.Attribute attribute)
type - - PropertyTextUnitPlaceholder.PlaceholderAccessType is one of TRANSLATABLE, READ_ONLY_PROPERTY, WRITABLE_PROPERTYname - - attribute namevalue - - attribute valuetag - - Jericho Tag which contains the attributeattribute - - attribute as a Jericho AttributePropertyTextUnitPlaceholder representing the attributeprotected boolean isUtf8Encoding()
isUtf8Encoding in class AbstractFilterprotected boolean isUtf8Bom()
isUtf8Bom in class AbstractFilterprotected boolean isBOM()
protected boolean isDocumentEncoding()
protected boolean isPreserveWhitespace()
protected void setPreserveWhitespace(boolean preserveWhitespace)
protected void setTextUnitPreserveWhitespace(boolean preserveWhitespace)
protected void addToDocumentPart(String part)
protected void addToTextUnit(String text)
protected void startTextUnit(String text)
protected void setTextUnitName(String name)
protected void setTextUnitType(String type)
protected void setTextUnitTranslatable(boolean translatable)
protected void setCurrentDocName(String currentDocName)
protected String getCurrentDocName()
protected boolean canStartNewTextUnit()
protected boolean isInsideTextRun()
protected void addToTextUnit(Code code, boolean endCodeNow)
protected void addToTextUnit(Code code)
protected void addToTextUnit(Code code, boolean endCodeNow, List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders)
protected void addToTextUnit(Code code, List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders)
protected void endDocumentPart()
protected void startDocumentPart(String part, String name, List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders)
protected void startGroup(GenericSkeleton startMarker, String commonTagType)
protected void startGroup(GenericSkeleton startMarker, String commonTagType, LocaleId locale, List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders)
protected void startTextUnit(GenericSkeleton startMarker)
protected void startTextUnit(GenericSkeleton startMarker, List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders)
protected void endTextUnit(GenericSkeleton endMarker)
protected void endGroup(GenericSkeleton endMarker)
protected void startTextUnit()
protected long getTextUnitId()
protected void setTextUnitId(long id)
protected long getGroupIdSequence()
protected void setGroupIdSequence(long id)
protected void setTextUnitMimeType(String mimeType)
protected long getDocumentPartId()
protected void setDocumentPartId(long id)
protected void appendToFirstSkeletonPart(String text)
protected void addFilterEvent(Event event)
protected Event popTempEvent()
protected Event peekTempEvent()
protected ExtractionRuleState getRuleState()
public AbstractMarkupEventBuilder getEventBuilder()
public void setMimeType(String mimeType)
setMimeType in class AbstractFiltermimeType - the new mime typepublic StringBuilder getBufferedWhiteSpace()
Copyright © 2021. All rights reserved.