public interface ElementFilterStart<T extends ElementFilter<T>>
Provides the first step of an ElementFilter. Essentially, the ElementFilterStart defines which HTML element should be matched when the HtmlParser is run. Elements matched can be subsequently filtered using the rules available from ElementFilter, or have their data retrieved using the options provided by ContentReader.
This is the first step in creating a FieldPath for a field of an entity configured via HtmlEntitySettings. When the parser processes an input HTML, it will run all filtering rules applied over the elements whose tag names match with the rules defined using the methods provided by this class.
FieldPath,
PartialPath,
ElementFilter| Modifier and Type | Method and Description |
|---|---|
T |
match(HtmlElementMatcher customHtmlElementMatcher)
Specifies what element the parser must match based on the return value supplied by the given
HtmlElementMatcher. |
T |
match(String tagName)
Matches a given tag name at any distance from the current element.
|
T |
match(String tagName,
int occurrence)
Matches a given tag name and its occurrence index among neighboring nodes within the same parent.
|
T |
matchCurrent()
Matches the current node defined in the path.
|
T |
matchFirst(String tagName)
Matches the first occurrence of the given tag name among neighboring nodes within the same parent.
|
T |
matchLast(String tagName)
Matches the last occurrence of the given tag name among neighboring nodes within the same parent.
|
T |
select(String cssQuery)
Selects what HTML element the parser must match using a CSS query.
|
T match(String tagName)
Matches a given tag name at any distance from the current element. Navigates through sibling and children nodes.
For example, to get the text of all span elements on a HTML document, one would have to simply write:
//Set up
HtmlEntityList htmlEntityList = new HtmlEntityList();
//Matching Rule
htmlEntityList.configureEntity("test")
.addField("allSpanElements")
.match("span")
.getText();
When the parser runs, it will match every span element found on the HTML document and return their text content.
Multiple matching rules can be chained together which creates a more specific path. Given a sequence of elements to be matched, the parser will traverse the HTML structure looking for the first element in the sequence to match.
Once the first element of the sequence is found, the parser will then look for the next element in the sequence. This next element must be in the hierarchy of the previous element found or be one of the next siblings of the previous element.
For example, consider the HTML below:
<div>
<h1>Bad Title</h1>
</div>
<article>
<h1>Good Title</h1>
</article>
<h1>Also Good Title</h1>
The code to only get the text from h1 elements inside or next to article elements is:
htmlEntityList.configureEntity("test")
.addField("headers")
.match("article")
.match("h1")
.getText();
The parser will capture “Good Title” and “Also Good Title”. This is because the matching rules will look for article elements, then h1 elements inside the corresponding article, and finally h1 elements following the article.
As this method returns an ElementFilter, the user can provide further matching rules to more precisely match the given element, or other elements associated to it. Very specific paths can be created to capture data from virtually anywhere in a HTML document.
tagName - tag name of the element that will be matched.ElementFilter so that filtering rules over HTML elements with the given tag name can be definedT match(String tagName, int occurrence)
Matches a given tag name and its occurrence index among neighboring nodes within the same parent.
For example, consider the following HTML table:
<table>
<tr>
<td>Email Address</td>
<td>bla@email.com</td>
</tr>
<tr>
<td>Home Address</td>
<td>123 Some St</td>
</tr>
</table>
<table>
<tr>
<td>Email Address</td>
<td>some@one.com</td>
</tr>
<tr>
<td>Home Address</td>
<td>456 Another St</td>
</tr>
</table>
To capture the contents under “Home address”, one could write:
HtmlEntitySettings address = entityList.configureEntity("address");
address.addField("line1")
.match("tr", 2) //matches the second tr element of each table
.match("td", 2) //inside the previously matched tr, match the second occurrence of the td element
.getText(); //gets the text of the matched td element.
The example above should collect the values:
123 Some St
456 Another St
tagName - tag name of the element that will be matched.occurrence - occurrence index of elements whose tag name matches within a given parent node.ElementFilter so that filtering rules over HTML elements with the given tag name can be definedT matchFirst(String tagName)
Matches the first occurrence of the given tag name among neighboring nodes within the same parent.
For example, consider the following HTML table:
<table>
<tr>
<td>Email Address</td>
<td>bla@email.com</td>
</tr>
<tr>
<td>Home Address</td>
<td>123 Some St</td>
</tr>
</table>
<table>
<tr>
<td>Email Address</td>
<td>some@one.com</td>
</tr>
<tr>
<td>Home Address</td>
<td>456 Another St</td>
</tr>
</table>
To capture the contents under “Email Address”, one could write:
HtmlEntitySettings email = entityList.configureEntity("email");
email.addField("email")
.matchFirst("tr") //matches the first tr element of each table
.matchLast("td") //inside the previously matched tr, match the last occurrence of the td element
.getText(); //gets the text of the matched td element.
This will capture “bla@email.com” but not “123 Some st” as this td is not inside the first tr matched in the table.
It will also capture “some@one.com” as this td is inside a different parent table.
tagName - tag name of the first element to be matched inside its parent.ElementFilter so that filtering rules over HTML elements with the given tag name can be definedT matchLast(String tagName)
Matches the last occurrence of the given tag name among neighboring nodes within the same parent.
For example, consider the following HTML table:
<table>
<tr>
<td>Email Address</td>
<td>bla@email.com</td>
</tr>
<tr>
<td>Home Address</td>
<td>123 Some St</td>
</tr>
</table>
<table>
<tr>
<td>Email Address</td>
<td>some@one.com</td>
</tr>
<tr>
<td>Home Address</td>
<td>456 Another St</td>
</tr>
</table>
To capture the contents under “Home address”, one could write:
HtmlEntitySettings address = entityList.configureEntity("address");
address.addField("line1")
.matchLast("tr") //matches the last tr element of each table
.matchLast("td") //inside the previously matched tr, match the last occurrence of the td element
.getText(); //gets the text of the matched td element.
This will capture “123 Some St” from the first table. It will also capture “456 Another St” as this td is inside a different parent table.
tagName - tag name of the last element to be matched inside its parent.ElementFilter so that filtering rules over HTML elements with the given tag name can be definedT select(String cssQuery)
Selects what HTML element the parser must match using a CSS query. The query is applied over the children of the current element in the path.
Matched elements may include the current element itself, or any of its children, depending on the query.
A selector is a chain of simple selectors, separated by combinators. Selectors are case insensitive (including against elements, attributes, and attribute values).
The universal selector (*) is implicit when no element selector is supplied (i.e.*.header and .header are equivalent).
| Pattern | Matches | Example | |
|---|---|---|---|
* | any element | * | |
tag | elements with the given tag name | div | |
*|E | elements of type E in any namespace ns | *|name finds <fb:name> elements | |
ns|E | elements of type E in the namespace ns | fb|name finds <fb:name> elements | |
#id | elements with attribute ID of "id" | div#wrap, #logo | |
.class | elements with a class name of "class" | div.left, .result | |
[attr] | elements with an attribute named "attr" (with any value) | a[href], [title] | |
[^attrPrefix] | elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets | [^data-], div[^data-] | |
[attr=val] | elements with an attribute named "attr", and value equal to "val" | img[width=500], a[rel=nofollow] | |
[attr="val"] | elements with an attribute named "attr", and value equal to "val" | span[hello="Cleveland"][goodbye="Columbus"], a[rel="nofollow"] | |
[attr^=valPrefix] | elements with an attribute named "attr", and value starting with "valPrefix" | a[href^=http:] | |
[attr$=valSuffix] | elements with an attribute named "attr", and value ending with "valSuffix" | img[src$=.png] | |
[attr*=valContaining] | elements with an attribute named "attr", and value containing "valContaining" | a[href*=/search/] | |
[attr~=regex] | elements with an attribute named "attr", and value matching the regular expression | img[src~=(?i)\\.(png|jpe?g)] | |
| The above may be combined in any order | div.header[title] | ||
Combinators | |||
E F | an F element descended from an E element | div a, .logo h1 | |
E > F | an F direct child of E | ol > li | |
E + F | an F element immediately preceded by sibling E | li + li, div.head + div | |
E ~ F | an F element preceded by sibling E | h1 ~ p | |
E, F, G | all matching elements E, F, or G | a[href], div, h3 | |
Pseudo selectors | |||
:lt(n) | elements whose sibling index is less than n | td:lt(3) finds the first 3 cells of each row | |
:gt(n) | elements whose sibling index is greater than n | td:gt(1) finds cells after skipping the first two | |
:eq(n) | elements whose sibling index is equal to n | td:eq(0) finds the first cell of each row | |
:has(selector) | elements that contains at least one element matching the selector | div:has(p) finds divs that contain p elements | |
:not(selector) | elements that do not match the selector. | div:not(.logo) finds all divs that do not have the "logo" class.div:not(:has(div)) finds divs that do not contain divs. | |
:contains(text) | elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants. | p:contains(univocity) finds p elements containing the text "univocity". | |
:matches(regex) | elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants. | td:matches(\\d+) finds table cells containing digits. div:matches((?i)login) finds divs containing the text, case insensitively. | |
:containsOwn(text) | elements that directly contain the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants. | p:containsOwn(univocity) finds p elements with own text "univocity". | |
:matchesOwn(regex) | elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants. | td:matchesOwn(\\d+) finds table cells directly containing digits. div:matchesOwn((?i)login) finds divs containing the text, case insensitively. | |
:containsData(data) | elements that contains the specified data. The contents of script and style elements, and comment nodes (etc) are considered data nodes, not text nodes. The search is case insensitive. The data may appear in the found element, or any of its descendants. | script:contains(univocity) finds script elements containing the data "univocity". | |
| The above may be combined in any order and with other selectors | .light:contains(name):eq(0) | ||
:matchText | treats text nodes as elements, and so allows you to match against and select text nodes. | p:matchText:firstChild with input One<br />Two will return one element with text "One". | |
Structural pseudo selectors | |||
:root | The element that is the root of the document. In HTML, this is the html element | :root | |
:nth-child(an+b) | elements that have an+b-1 siblings before it in the document tree, for any positive integer or zero value of n, and has a parent element. For values of a and b greater than zero, this effectively divides the element's children into groups of a elements (the last group taking the remainder), and selecting the bth element of each group. For example, this allows the selectors to address every other row in a table, and could be used to alternate the color of paragraph text in a cycle of four. The a and b values must be integers (positive, negative, or zero). The index of the first child of an element is 1.
In addition to this, :nth-child() can take odd and even as arguments instead. odd has the same signification as 2n+1, and even has the same signification as 2n. | tr:nth-child(2n+1) finds every odd row of a table. :nth-child(10n-1) the 9th, 19th, 29th, etc, element. li:nth-child(5) the 5h li | |
:nth-last-child(an+b) | elements that have an+b-1 siblings after it in the document tree. Otherwise like :nth-child() | tr:nth-last-child(-n+2) the last two rows of a table | |
:nth-of-type(an+b) | pseudo-class notation represents an element that has an+b-1 siblings with the same expanded element name before it in the document tree, for any zero or positive integer value of n, and has a parent element | img:nth-of-type(2n+1) | |
:nth-last-of-type(an+b) | pseudo-class notation represents an element that has an+b-1 siblings with the same expanded element name after it in the document tree, for any zero or positive integer value of n, and has a parent element | img:nth-last-of-type(2n+1) | |
:first-child | elements that are the first child of some other element. | div > p:first-child | |
:last-child | elements that are the last child of some other element. | ol > li:last-child | |
:first-of-type | elements that are the first sibling of its type in the list of children of its parent element | dl dt:first-of-type | |
:last-of-type | elements that are the last sibling of its type in the list of children of its parent element | tr > td:last-of-type | |
:only-child | elements that have a parent element and whose parent element hasve no other element children | ||
:only-of-type | an element that has a parent element and whose parent element has no other element children with the same expanded element name | ||
:empty | elements that have no children at all | ||
cssQuery - the CSS-like query to be used for matching elements of interest.ElementFilter so that filtering rules can be applied over the HTML elements matched with the CSS queryT match(HtmlElementMatcher customHtmlElementMatcher)
Specifies what element the parser must match based on the return value supplied by the given HtmlElementMatcher. When the parser runs it will invoke the HtmlElementMatcher.match(HtmlElement, HtmlElement) method for each node visited from the current matched node in path. If the result is true, then the node will be matched and the parser will proceed trying to match the next element in the matching sequence, if any.
Note: the lastMatchedElement parameter in HtmlElementMatcher.match(HtmlElement, HtmlElement) will always be null if you use this method at the beginning of a matching sequence. Also note that in this case all elements of the HTML tree will be sent to your HtmlElementMatcher so you’d likely want its implementation to execute as fast as possible.
customHtmlElementMatcher - the filter that will be used to determine if a visited HTML element should be matchedElementFilter so that further filtering rules over HTML elements that were matched by the supplied HtmlElementMatcher can be specifiedT matchCurrent()
Matches the current node defined in the path. Allows continuation of rules specified using a PartialPath, for example:
<table>
<tr class="footer_grid">
<td>
<a class="link_paging current_page" href="results/page_1.html">1</a>
<a class="link_paging" href="./results/page_2.html">2</a>
</td>
</tr>
</table>
PartialPath path = entity.newPath() //creates path to selected <a> with current page number
.match("tr").classes("footer_grid")
.match("a").classes("current_page");
// creates field from path. Matches current node to get its text ("1" in the example).
path.addField("pageNumber").matchCurrent().getText();
// expands the path to match the next <a> element
path = path.matchFirst("a").classes("link_paging");
// matches the current <a> element to get its text ("2" in th example"
path.addField("nextPageNumber").matchCurrent().getText();
// matches the current <a> element to get the value of its "href" attribute
path.addField("nextPageUrl").matchCurrent().getAttribute("href");
ElementFilter so that filtering rules over HTML elements specified so far can beCopyright © 2018 uniVocity Software Pty Ltd. All rights reserved.