public class HtmlLinkFollower extends com.univocity.parsers.remote.RemoteFollower<HtmlEntitySettings,HtmlEntityList,HtmlParserSettings> implements FieldDefinition
A class that allows the addition of fields which are used by the HtmlParser to parse and return information from a linked page.
HtmlParser| Modifier | Constructor and Description |
|---|---|
protected |
HtmlLinkFollower(HtmlEntitySettings parentEntitySettings)
Creates a HtmlLinkFollower using
parentEntitySettings as a basis for the settings |
| Modifier and Type | Method and Description |
|---|---|
PathStart |
addField(String fieldName)
Associates a regular field with an entity.
|
void |
addField(String fieldName,
String constantValue)
Creates a field that with a specified value.
|
PathStart |
addPersistentField(String fieldName)
Associates a persistent field with an entity.
|
PathStart |
addSilentField(String fieldName)
Associates a “silent” field with an entity.
|
HtmlLinkFollower |
assigning(String parameterName,
Object parameterValue) |
HtmlLinkFollower |
assigning(String parameterName,
com.univocity.api.common.ValueGetter<?> valueGetter) |
HtmlPaginator |
getPaginator()
|
GroupStart |
newGroup()
Returns a
GroupStart that allows for a Group to be defined. |
PartialPathStart |
newPath()
Returns a
PartialPathStart that is used to define a reusable path of HTML elements. |
protected HtmlLinkFollower(HtmlEntitySettings parentEntitySettings)
Creates a HtmlLinkFollower using parentEntitySettings as a basis for the settings
parentEntitySettings - the parent entity settings to be used as a basis for this instances settingspublic PathStart addPersistentField(String fieldName)
FieldDefinitionAssociates a persistent field with an entity. A persistent field is a field that retains its value until it is overwritten by the parser. When all values of a row are collected, the parser submits the row to the output, and clears the values collected for all fields, except the persistent ones, so they will be reused in subsequent records.
An example of using persistent fields can be explained by viewing this HTML:
<div id="55">
<article>
<h1>first</h1>
<p>lorem</p>
</article>
<article>
<h1>second</h1>
<p>ipsum</p>
</article>
</div>
In this example, we want get two rows with three columns: [55, first, lorem] and [55, second, ipsum]. The value “55” in both records should come from the id of the div. The following rules can be defined to produce this output:
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings entity = entities.configureEntity("test");
entity.addPersistentField("persistentID").match("div").getAttribute("id");
entity.addField("title").match("h1").getText();
entity.addField("text").match("p").getText();
As the “persistentID” field was created as a persistent field, it will retain its value and the parser will reapply it into subsequent rows. If a regular FieldDefinition.addField(String) were used instead, the output would be [55, first, lorem] and [null, second, ipsum] as the div and its id would be matched once only.
NOTE: A persistent field is also “silent” and does not trigger new rows (see FieldDefinition.addSilentField(String). If a persistent field’s path finds another match while processing the same record, the first value will be replaced by the new one, and no new records will be generated.
A RecordTrigger can be used to force new rows to be generated.
addPersistentField in interface FieldDefinitionfieldName - name of the persistent field to be created. If called more than once, a new PathStart will be returned, allowing multiple paths to be used to collect data into the same field.PathStart, so that a path to the target HTML content to be captured can be definedpublic PathStart addSilentField(String fieldName)
FieldDefinitionAssociates a “silent” field with an entity. A silent field does not trigger new records when values of a field are overwritten, i.e. if the parser collects a value for a field that already contains data, and the field is silent, it won’t submit a new record. The parser will simply replace the previously collected value with the newly parsed value.
A RecordTrigger can be used to force new rows to be generated.
A usage example of silent fields can be shown with this HTML document:
<div>
<article class="feature">
<h1>first</h1>
<p>lorem</p>
<h1>second</h1>
</article>
</div>
To get the text of the p element along with the second header:
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings entity = entities.configureEntity("test");
entity.addSilentField("silent")
.match("h1")
.containedBy("article")
.getText();
entity.addField("text").match("article").match("p").getText();
The parser will return [second, lorem]. When the parser finishes parsing the p element, the row will actually be [first, lorem]. As soon as the parser finds the second h1 element, instead of creating a new row with this value, it will replace the “first” String with “second” generating the row [second, lorem].
If addField was used in this example instead of addSilentField, two rows would be produced: [first, lorem] and [second, null]
addSilentField in interface FieldDefinitionfieldName - name of the silent field to be created. If called more than once, a new PathStart will be returned, allowing multiple paths to be used to collect data into the same field.PathStart, so that a path to the target HTML content to be captured can be definedpublic void addField(String fieldName, String constantValue)
FieldDefinitionCreates a field that with a specified value. An example to use this method can be shown with this HTML document:
<div>
<article>
<h1>first</h1>
<p>lorem</p>
</article>
<article>
<h1>second</h1>
<p>ipsum</p>
</article>
<article>
<h1>third</h1>
<p>lol</p>
</article>
</div>
And the following code:
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings entity = entities.configureEntity("test");
// creates a constant field
entity.addField("constant","cool article");
// regular fields
entity.addField("title").match("h1").getText();
entity.addField("content").match("p").getText();
When the parser runs, it will get the text from each article heading and p element. It will also attach the constant “cool article” to the first column of each row, producing:
[cool article, first, lorem]
[cool article, second, ipsum]
[cool article, third, lol]
addField in interface FieldDefinitionfieldName - name of the field to be createdconstantValue - a constant value associated with the given fieldpublic PathStart addField(String fieldName)
FieldDefinitionAssociates a regular field with an entity. Regular fields are used by the parser to retain values for a row. When all values of a row are collected, the parser submits the row to the output, and clears all values collected for all fields. If the parser collects a value for a field that already contains data, the record will be submitted to the output and the incoming value will be associated with the given field in a new row.
For example, you could define a field called “headings” then match h1 elements to get their text. When the parser runs, the h1 elements found the HTML document will be returned and be available in the field “headings”, e.g.:
HtmlEntityList entityList = new HtmlEntityList();
entityList.configureEntity("heading)
.addField("headings")
.match("h1")
.getText();
addField in interface FieldDefinitionfieldName - name of the field to be created. If called more than once, a new PathStart will be returned, allowing multiple paths to be used to collect data into the same field.PathStart, so that a path to the target HTML content to be captured can be definedpublic HtmlLinkFollower assigning(String parameterName, Object parameterValue)
assigning in class com.univocity.parsers.remote.RemoteFollower<HtmlEntitySettings,HtmlEntityList,HtmlParserSettings>HtmlLinkFollower to allow for method chainingpublic HtmlLinkFollower assigning(String parameterName, com.univocity.api.common.ValueGetter<?> valueGetter)
assigning in class com.univocity.parsers.remote.RemoteFollower<HtmlEntitySettings,HtmlEntityList,HtmlParserSettings>HtmlLinkFollower to allow for method chainingpublic final HtmlPaginator getPaginator()
Returns the HtmlPaginator associated with the HtmlParserSettings of this HtmlEntityList
HtmlPaginator stored the HtmlParserSettings of this HtmlEntityListpublic final PartialPathStart newPath()
Returns a PartialPathStart that is used to define a reusable path of HTML elements. Fields then can added to this path using FieldDefinition.addField(String) and others, which associates the field with this entity.
Example:
HtmlEntityList entityList = new HtmlEntityList();
HtmlEntitySettings items = entityList.configureEntity("items");
PartialPath path = items.newPath()
.match("table").id("productsTable")
.match("td").match("div").classes("productContainer");
//uses the path to add new fields to it and further element matching rules from the initial, common path.
path.addField("name").match("span").classes("prodName", "prodNameTro").getText();
path.addField("URL").match("a").childOf("div").classes("productPadding").getAttribute("href")
PartialPathStart to specify the path of HTML elementspublic final GroupStart newGroup()
Returns a GroupStart that allows for a Group to be defined. A Group demarcates a section of the HTML input that is allowed to be parsed. FieldPaths created from a group will only be executed inside this defined area, ignoring any HTML that exists outside of it. For example, say you wanted to extract the “hello” and “howdy” words from the following HTML:
<div class="parseMe">
<p>hello</p>
</div>
<p>howdy</p>
<h1>No Parsing Area</h1>
<p>don't parse me!</p>
The parsing rules, using groups, can be defined as:
HtmlEntityList entityList = new HtmlEntityList();
HtmlParserSettings settings = new HtmlParserSettings(entityList);
Group group = entityList.configureEntity("test")
.newGroup()
.startAt("div").classes("parseMe")
.endAt("h1");
group.addField("greeting").match("p").getText();
The parser will then ignore the "don't parse me" paragraph as the group restricts the parsing to the area defined from a div with class “parseMe” until an opening h1 tag.
GroupStart used to specify where the Group starts.Copyright © 2018 uniVocity Software Pty Ltd. All rights reserved.