first
lorem
public interface Group extends ElementFilter<Group>, ElementFilterStart<Group>, FieldDefinition
A group defines the boundaries where a given set of fields should be processed. The parser will only collect values from elements of a group if a matching condition is satisfied.
For example, given the following HTML:
<div id="55">
<h1>random text</h1>
<p>random paragraph</p>
<article>
<h1>first</h1>
<p>lorem</p>
</article>
<h1>random text 2</h1>
<p>random paragraph 2</p>
<article>
<h1>second</h1>
<p>ipsum</p>
</article>
</div>
To get all values of every h1 and p element inside an article, the following rules can be created:
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings entity = entities.configureEntity("test");
Group group = entity
.newGroup()
.startAt("article")
.endAtClosing("article");
group.addField("title").match("h1").getText();
group.addField("text").match("p").getText();
In the example above, the group starts at every article element, and ends when the parser finds a closing article element. Fields must added to the group. The h1 and p elements outside of article are simply ignored by the parser.
The example above should produce the following two records:
[first, lorem]
[second, ipsum]
| Modifier and Type | Method and Description |
|---|---|
RecordTriggerStart |
addRecordTrigger()
Starts the definition of a
RecordTrigger, which is essentially a path to an element that when found
makes the parser generate a record will all values accumulated so far. |
T |
copyPath()
Copies the current path allowing new matching to be added to a common path without changing the original
one.
|
attribute, attribute, childOf, classes, containedBy, containedBy, containing, containing, containing, filter, followedBy, followedBy, followedByText, followedImmediatelyBy, id, matchNext, not, parentOf, precededBy, precededBy, precededByText, precededImmediatelyBy, under, underHeader, withExactText, withExactTextMatchCase, withText, withTextMatchCasedownTo, downToFooter, upTo, upToHeadermatch, match, match, matchCurrent, matchFirst, matchLast, selectaddField, addField, addPersistentField, addSilentFieldT copyPath()
lorem
ispum
article
PartialPath that allows the specification of a path and does not affect the path that it
is built upon.RecordTriggerStart addRecordTrigger()
RecordTrigger, which is essentially a path to an element that when found
makes the parser generate a record will all values accumulated so far.
For example, assume you have to capture email address and home address fields from a HTML document of customer
details. In the document, any of the customers fields may or may not exist (i.e. be `null` when parsed).
An example of this HTML is shown below:
```html
| Email Address | bla@email.com |
| Home Address |
| Email Address | |
| Home Address | 123 real street |
RecordTrigger:
```java
HtmlEntityList entityList = new HtmlEntityList();
HtmlEntitySettings contact = entityList.configureEntity("contact");
contact.addField("emailAddress")
.match("td").precededBy("td").withText("Email Address").getText();
contact.addField("homeAddress")
.match("td").precededBy("td").withText("Home Address").getText();
```
This would produce one row with mixed results:
```
[bla.@email.com, 123 real street]
```
Instead of the expected output:
```
[bla.@email.com, null]
[null, 123 real street]
```
The incorrect output is produced because there is no `RecordTriggerStart that defines the path for the triggerCopyright © 2018 uniVocity Software Pty Ltd. All rights reserved.