public class HtmlEntitySettings extends com.univocity.parsers.remote.RemoteEntitySettings<HtmlParsingContext,com.univocity.parsers.common.CommonParserSettings,HtmlParserSettings,HtmlLinkFollower> implements FieldDefinition
A HtmlEntitySettings object manages the configuration of a HTML entity. An entity has a name and one or more fields. These fields have paths to the elements that will have their data collected. In addition, a HtmlParserListener can be associated with an entity to notify the user of actions made by the HtmlParser.
HtmlEntityList,
HtmlParser,
HtmlParserListener,
HtmlParsingContext| Modifier and Type | Method and Description |
|---|---|
PathStart |
addField(String fieldName)
Associates a regular field with an entity.
|
void |
addField(String fieldName,
String constantValue)
Creates a field that with a specified value.
|
PathStart |
addPersistentField(String fieldName)
Associates a persistent field with an entity.
|
RecordTriggerStart |
addRecordTrigger()
Returns a
RecordTriggerStart that is used to specify a path that defines when rows should be created. |
PathStart |
addSilentField(String fieldName)
Associates a “silent” field with an entity.
|
protected HtmlEntitySettings |
clone() |
HtmlLinkFollower |
followLink(String fieldName,
com.univocity.api.net.UrlReaderProvider urlReaderProvider)
Creates a
HtmlLinkFollower a field with the name provided. |
Set<String> |
getFieldNames() |
protected com.univocity.parsers.common.CommonParserSettings |
getInternalSettings() |
HtmlParserListener |
getListener()
Returns the
HtmlParserListener associated with this HTML entity. |
GroupStart |
newGroup()
Returns a
GroupStart that allows for a Group to be defined. |
PartialPathStart |
newPath()
Returns a
PartialPathStart that is used to define a reusable path of HTML elements. |
void |
removeField(String fieldName) |
void |
setListener(HtmlParserListener listener)
Associates a
HtmlParserListener with this HTML entity. |
getEmptyValue, getNesting, getParentEntityList, getRemoteFollower, getRemoteFollowers, ignoreFollowingErrors, isColumnReorderingEnabled, isIgnoreFollowingErrors, setColumnReorderingEnabled, setEmptyValue, setNestingcreateEmptyParserSettings, excludeFields, excludeFields, excludeIndexes, getEntityName, getErrorContentLength, getNullValue, getProcessor, getProcessorErrorHandler, getTrimLeadingWhitespaces, getTrimTrailingWhitespaces, isAutoConfigurationEnabled, isProcessorErrorHandlerDefined, runAutomaticConfiguration, selectFields, selectFields, selectIndexes, setAutoConfigurationEnabled, setErrorContentLength, setNullValue, setParent, setProcessor, setProcessorErrorHandler, setTrimLeadingWhitespaces, setTrimTrailingWhitespaces, toString, trimValuespublic final PathStart addSilentField(String fieldName)
FieldDefinitionAssociates a “silent” field with an entity. A silent field does not trigger new records when values of a field are overwritten, i.e. if the parser collects a value for a field that already contains data, and the field is silent, it won’t submit a new record. The parser will simply replace the previously collected value with the newly parsed value.
A RecordTrigger can be used to force new rows to be generated.
A usage example of silent fields can be shown with this HTML document:
<div>
<article class="feature">
<h1>first</h1>
<p>lorem</p>
<h1>second</h1>
</article>
</div>
To get the text of the p element along with the second header:
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings entity = entities.configureEntity("test");
entity.addSilentField("silent")
.match("h1")
.containedBy("article")
.getText();
entity.addField("text").match("article").match("p").getText();
The parser will return [second, lorem]. When the parser finishes parsing the p element, the row will actually be [first, lorem]. As soon as the parser finds the second h1 element, instead of creating a new row with this value, it will replace the “first” String with “second” generating the row [second, lorem].
If addField was used in this example instead of addSilentField, two rows would be produced: [first, lorem] and [second, null]
addSilentField in interface FieldDefinitionfieldName - name of the silent field to be created. If called more than once, a new PathStart will be returned, allowing multiple paths to be used to collect data into the same field.PathStart, so that a path to the target HTML content to be captured can be definedpublic final PathStart addField(String fieldName)
FieldDefinitionAssociates a regular field with an entity. Regular fields are used by the parser to retain values for a row. When all values of a row are collected, the parser submits the row to the output, and clears all values collected for all fields. If the parser collects a value for a field that already contains data, the record will be submitted to the output and the incoming value will be associated with the given field in a new row.
For example, you could define a field called “headings” then match h1 elements to get their text. When the parser runs, the h1 elements found the HTML document will be returned and be available in the field “headings”, e.g.:
HtmlEntityList entityList = new HtmlEntityList();
entityList.configureEntity("heading)
.addField("headings")
.match("h1")
.getText();
addField in interface FieldDefinitionfieldName - name of the field to be created. If called more than once, a new PathStart will be returned, allowing multiple paths to be used to collect data into the same field.PathStart, so that a path to the target HTML content to be captured can be definedpublic final PathStart addPersistentField(String fieldName)
FieldDefinitionAssociates a persistent field with an entity. A persistent field is a field that retains its value until it is overwritten by the parser. When all values of a row are collected, the parser submits the row to the output, and clears the values collected for all fields, except the persistent ones, so they will be reused in subsequent records.
An example of using persistent fields can be explained by viewing this HTML:
<div id="55">
<article>
<h1>first</h1>
<p>lorem</p>
</article>
<article>
<h1>second</h1>
<p>ipsum</p>
</article>
</div>
In this example, we want get two rows with three columns: [55, first, lorem] and [55, second, ipsum]. The value “55” in both records should come from the id of the div. The following rules can be defined to produce this output:
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings entity = entities.configureEntity("test");
entity.addPersistentField("persistentID").match("div").getAttribute("id");
entity.addField("title").match("h1").getText();
entity.addField("text").match("p").getText();
As the “persistentID” field was created as a persistent field, it will retain its value and the parser will reapply it into subsequent rows. If a regular FieldDefinition.addField(String) were used instead, the output would be [55, first, lorem] and [null, second, ipsum] as the div and its id would be matched once only.
NOTE: A persistent field is also “silent” and does not trigger new rows (see FieldDefinition.addSilentField(String). If a persistent field’s path finds another match while processing the same record, the first value will be replaced by the new one, and no new records will be generated.
A RecordTrigger can be used to force new rows to be generated.
addPersistentField in interface FieldDefinitionfieldName - name of the persistent field to be created. If called more than once, a new PathStart will be returned, allowing multiple paths to be used to collect data into the same field.PathStart, so that a path to the target HTML content to be captured can be definedpublic final PartialPathStart newPath()
Returns a PartialPathStart that is used to define a reusable path of HTML elements. Fields then can added to this path using FieldDefinition.addField(String) and others, which associates the field with this entity.
Example:
HtmlEntityList entityList = new HtmlEntityList();
HtmlEntitySettings items = entityList.configureEntity("items");
PartialPath path = items.newPath()
.match("table").id("productsTable")
.match("td").match("div").classes("productContainer");
//uses the path to add new fields to it and further element matching rules from the initial, common path.
path.addField("name").match("span").classes("prodName", "prodNameTro").getText();
path.addField("URL").match("a").childOf("div").classes("productPadding").getAttribute("href")
PartialPathStart to specify the path of HTML elementspublic final GroupStart newGroup()
Returns a GroupStart that allows for a Group to be defined. A Group demarcates a section of the HTML input that is allowed to be parsed. FieldPaths created from a group will only be executed inside this defined area, ignoring any HTML that exists outside of it. For example, say you wanted to extract the “hello” and “howdy” words from the following HTML:
<div class="parseMe">
<p>hello</p>
</div>
<p>howdy</p>
<h1>No Parsing Area</h1>
<p>don't parse me!</p>
The parsing rules, using groups, can be defined as:
HtmlEntityList entityList = new HtmlEntityList();
HtmlParserSettings settings = new HtmlParserSettings(entityList);
Group group = entityList.configureEntity("test")
.newGroup()
.startAt("div").classes("parseMe")
.endAt("h1");
group.addField("greeting").match("p").getText();
The parser will then ignore the "don't parse me" paragraph as the group restricts the parsing to the area defined from a div with class “parseMe” until an opening h1 tag.
GroupStart used to specify where the Group starts.public final RecordTriggerStart addRecordTrigger()
Returns a RecordTriggerStart that is used to specify a path that defines when rows should be created.
See documentation in Trigger.addRecordTrigger() for a detailed explanation.
RecordTriggerStart that defines the path for the triggerpublic final Set<String> getFieldNames()
getFieldNames in class com.univocity.parsers.remote.RemoteEntitySettings<HtmlParsingContext,com.univocity.parsers.common.CommonParserSettings,HtmlParserSettings,HtmlLinkFollower>public final void removeField(String fieldName)
removeField in class com.univocity.parsers.remote.RemoteEntitySettings<HtmlParsingContext,com.univocity.parsers.common.CommonParserSettings,HtmlParserSettings,HtmlLinkFollower>public final void addField(String fieldName, String constantValue)
FieldDefinitionCreates a field that with a specified value. An example to use this method can be shown with this HTML document:
<div>
<article>
<h1>first</h1>
<p>lorem</p>
</article>
<article>
<h1>second</h1>
<p>ipsum</p>
</article>
<article>
<h1>third</h1>
<p>lol</p>
</article>
</div>
And the following code:
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings entity = entities.configureEntity("test");
// creates a constant field
entity.addField("constant","cool article");
// regular fields
entity.addField("title").match("h1").getText();
entity.addField("content").match("p").getText();
When the parser runs, it will get the text from each article heading and p element. It will also attach the constant “cool article” to the first column of each row, producing:
[cool article, first, lorem]
[cool article, second, ipsum]
[cool article, third, lol]
addField in interface FieldDefinitionfieldName - name of the field to be createdconstantValue - a constant value associated with the given fieldpublic final void setListener(HtmlParserListener listener)
Associates a HtmlParserListener with this HTML entity. The listener methods will be triggered by the HtmlParser while it traverses the HTML structure to collect values for the fields of this entity. In essence, a HtmlParserListener provides information about events that occur during the parsing process.
Important:The listener methods are used in a concurrent environment. If you are using the same instance on multiple entities make sure your listener implementation is thread-safe, or limit the number of threads to be used when parsing to 1 with HtmlParserSettings.setParserThreadCount(int)
listener - the HtmlParserListener to be used when the parser executes to collect values for the fields of this entity.public final HtmlParserListener getListener()
Returns the HtmlParserListener associated with this HTML entity. The listener methods will be triggered by the HtmlParser while it traverses the HTML structure to collect values for the fields of this entity In essence, a HtmlParserListener provides information about events that occur during the parsing process.
Important:The listener methods are used in a concurrent environment. If you are using the same instance on multiple entities make sure your listener implementation is thread-safe, or limit the number of threads to be used when parsing to 1 with HtmlParserSettings.setParserThreadCount(int)
HtmlParserListener to be used when the parser executes to collect values for the fields of this entity.public HtmlLinkFollower followLink(String fieldName, com.univocity.api.net.UrlReaderProvider urlReaderProvider)
Creates a HtmlLinkFollower a field with the name provided. The link follower will access the UrlReaderProvider all values collected from this resource will be joined with the results of the current entity using the RemoteEntitySettings.getNesting() strategy defined.
A parametrized URL can be used here so values from each record produced by this entity can replace parameters in the URL. Use HtmlLinkFollower.assigning(java.lang.String, java.lang.Object) to replace the URL parameters.
fieldName - the name of the field associated with HtmlLinkFollowerurlReaderProvider - the url that the HtmlLinkFollower will followHtmlLinkFollower to allow for method chainingprotected final com.univocity.parsers.common.CommonParserSettings getInternalSettings()
getInternalSettings in class com.univocity.parsers.remote.RemoteEntitySettings<HtmlParsingContext,com.univocity.parsers.common.CommonParserSettings,HtmlParserSettings,HtmlLinkFollower>protected HtmlEntitySettings clone()
clone in class com.univocity.parsers.remote.RemoteEntitySettings<HtmlParsingContext,com.univocity.parsers.common.CommonParserSettings,HtmlParserSettings,HtmlLinkFollower>Copyright © 2018 uniVocity Software Pty Ltd. All rights reserved.