public interface HtmlElement
A HtmlElement contains information about HTML elements collected by the parser
| Modifier and Type | Method and Description |
|---|---|
String |
attribute(String attributeName)
Returns an attribute value by its name as a
String. |
Set<String> |
attributeNames()
Returns all the attribute names contained within this element as a set of
String. |
List<HtmlElement> |
children()
Returns a copy of all children of this element in an array.
|
Set<String> |
classes()
Returns the set of CSS classes of this element, or an empty set if has element has no
class attribute defined. |
boolean |
containsElementInHierarchy(HtmlElement element)
Returns
true if the specified element is a descendant of the current element. |
String |
data()
Get the data content of this element and all its children.
|
FetchOutput |
fetchResources(File saveFile,
Charset encoding,
FetchOptions fetchOptions)
Save the element to a local file using the
File, searching all child nodes for external resources (e.g. |
FetchOutput |
fetchResources(File saveFile,
FetchOptions fetchOptions)
Save the element to a local file using the
File, searching all child nodes for external resources (e.g. |
FetchOutput |
fetchResources(com.univocity.api.io.FileProvider saveFile,
FetchOptions fetchOptions)
Saves the element to a local file using the
FileProvider, searching all child nodes for external resources (e.g. |
FetchOutput |
fetchResources(File saveFile,
String encoding,
FetchOptions fetchOptions)
Save the element to a local file using the
File, searching all child nodes for external resources (e.g. |
FetchOutput |
fetchResources(String pathToFile,
Charset encoding,
FetchOptions fetchOptions)
Save the element to a local file at the path
pathToFile, searching all child nodes for external resources (e.g. |
FetchOutput |
fetchResources(String pathToFile,
FetchOptions fetchOptions)
Save the element to a local file at the path
pathToFile, searching all child nodes for external resources (e.g. |
FetchOutput |
fetchResources(String pathToFile,
String encoding,
FetchOptions fetchOptions)
Save the element a local file at the path
pathToFile, searching all child nodes for external resources (e.g. |
String |
id()
Returns the id of this element or an empty
String if the element does not have an id attribute. |
Map<String,String[]> |
inputValues()
Runs through the hierarchy of this element and collects the values of any input elements, including select lists, radio buttons and checkboxes.
|
Map<String,String> |
inputValuesById()
Runs through the hierarchy of this element and collects the values of any input elements, including select lists, radio buttons and checkboxes.
|
boolean |
isComment()
Returns
true if this HtmlElement consists of comments, i.e. |
boolean |
isData()
Returns
true if this HtmlElement consists of data, i.e. |
boolean |
isText()
Returns
true if this HtmlElement consists solely of text and false otherwise. |
HtmlElement |
nextSibling()
Returns the
HtmlElement that is located just after this element. |
HtmlElement |
parent()
Returns the parent of this Element.
|
HtmlElement |
previousSibling()
Returns the
HtmlElement that is located just before this element. |
ElementPathStart |
query()
Starts a matching sequence so chaining selector methods can be used to traverse the HtmlElement
|
List<HtmlElement> |
query(String cssQuery)
Searches for elements that match a CSS query, with the current
HtmlElement as the starting context. |
String |
tagName()
Returns the HTML tag name associated with the element.
|
String |
text()
Gets the combined text of this element and all its children.
|
Document |
toW3CDocument()
Generates a W3C DOM document from the current HTML element.
|
boolean isText()
Returns true if this HtmlElement consists solely of text and false otherwise. For example, consider <p class="highlight">cool text</p>; If this HtmlElement is the ‘p’ element, isText() will return false. If this HtmlElement is just a node with “cool text”, isText() will return true.
Note: this method will return false if the HtmlElement contains text that is not meant to be rendered, such as comments and <script> tags. Use isData() to identify such elements.
true if this element is just text, otherwise falseboolean isData()
Returns true if this HtmlElement consists of data, i.e. content in comments or tags such as <style> or <script>, which should not render as text.
true if this element is just invisible text, otherwise falseboolean isComment()
Returns true if this HtmlElement consists of comments, i.e. free text between <!-- and -->, which should not render as text.
true if element is just comments, otherwise falseString tagName()
Returns the HTML tag name associated with the element. For instance the tag name of the element <span title="fan" id="electric"></span> would be “span”.
If the current HTML element is not a tag, i.e. it is a text, comment or data node, then "#text", "#comment" or "#data" will be returned, respectively.
HtmlElement parent()
Returns the parent of this Element. A parent is defined as the element which directly contains the current element. For instance, given <div> <h1>header</h1> <p>text</p> </div>, the parent of <p> would be <div>. If there is no parent, null is returned.
HtmlElement’s parent or null if no parent is availableList<HtmlElement> children()
Returns a copy of all children of this element in an array. If there are no children, it will return an empty array. For example,<article> <h1>header</h1> <p>text</p> </article>, running children() on article will return an array of size 2 with the contents being the <h1> element and the <p>.
String attribute(String attributeName)
Returns an attribute value by its name as a String. If the element has no attributes or if the supplied attribute name does not exist within the element, then an empty String will be returned. For example, <div title="hello"></div><footer>feet</footer>, running attribute("title") on the <div> will return “hello”. Running the same method on <footer> will return "".
To get an absolute URL from a link that could be relative, prefix the attribute name with “abs:”. For example running attribute("abs:href") on <a href="contact.html will return the absolute URL of contact.html, rather than simply returning the String “contact.html”.
attributeName - the name of the attributeString if supplied attribute doesn’t exist.Set<String> attributeNames()
Returns all the attribute names contained within this element as a set of String. Returns an empty set if there are no attributes.
String text()
Gets the combined text of this element and all its children. Whitespace is normalized so the only separation between words is a single ' ' character.
For example, given HTML <p>Hello <b>there</b> now!</p>, the call to p.text() returns "Hello there now!"
String data()
Get the data content of this element and all its children. Data consists of textual content inside comments, or tags such as style or script, for example, where the contents should not render as text.
HtmlElement nextSibling()
Returns the HtmlElement that is located just after this element. Returns null if there is no next sibling. For instance, given <div> <h1>hello</h1> <p>text <span>saucepan<span> </p> </div>, the next sibling of <h1> is <p>. The next sibling of <p> is null.
HtmlElement or null if no such element.HtmlElement previousSibling()
Returns the HtmlElement that is located just before this element. Returns null if there is no previous sibling. For instance, given <div> <h1>hello</h1> <p>text</p> </div>, the previous sibling of <p> is <h1> and previous sibling of <h1> is null.
HtmlElement or null if no such element.String id()
Returns the id of this element or an empty String if the element does not have an id attribute. For example, in <span id="test"></span> calling id() from the <span> element will return "test".
String if availableSet<String> classes()
Returns the set of CSS classes of this element, or an empty set if has element has no class attribute defined.
boolean containsElementInHierarchy(HtmlElement element)
Returns true if the specified element is a descendant of the current element. Returns false if otherwise. For example, in this HTML document:
<table>
<tr>
<td>
<span>First Row</span>
</td>
</tr>
<tr>
<td>Second Row</td>
</tr>
</table>
Writing table.containsElementInHierarchy(span) would return true as the <span> is a descendant of the <table>. Inverting the code to span.containsElementInHierarchy(table) would return false.
element - the element to find in the hierarchy of the current element.true if the specified element is a descendant of the current element.List<HtmlElement> query(String cssQuery)
Searches for elements that match a CSS query, with the current HtmlElement as the starting context. Matched elements may include this HtmlElement, or any of its children.
el.query("a[href]") - finds links (a tags with href attributes)el.query("a[href*=example.com]") - finds links pointing to example.com (loosely)A selector is a chain of simple selectors, separated by combinators. Selectors are case insensitive (including against elements, attributes, and attribute values).
The universal selector (*) is implicit when no element selector is supplied (i.e. *.header and .header is equivalent).
| Pattern | Matches | Example | |
|---|---|---|---|
* | any element | * | |
tag | elements with the given tag name | div | |
*|E | elements of type E in any namespace ns | *|name finds <fb:name> elements | |
ns|E | elements of type E in the namespace ns | fb|name finds <fb:name> elements | |
#id | elements with attribute ID of "id" | div#wrap, #logo | |
.class | elements with a class name of "class" | div.left, .result | |
[attr] | elements with an attribute named "attr" (with any value) | a[href], [title] | |
[^attrPrefix] | elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets | [^data-], div[^data-] | |
[attr=val] | elements with an attribute named "attr", and value equal to "val" | img[width=500], a[rel=nofollow] | |
[attr="val"] | elements with an attribute named "attr", and value equal to "val" | span[hello="Cleveland"][goodbye="Columbus"], a[rel="nofollow"] | |
[attr^=valPrefix] | elements with an attribute named "attr", and value starting with "valPrefix" | a[href^=http:] | |
[attr$=valSuffix] | elements with an attribute named "attr", and value ending with "valSuffix" | img[src$=.png] | |
[attr*=valContaining] | elements with an attribute named "attr", and value containing "valContaining" | a[href*=/search/] | |
[attr~=regex] | elements with an attribute named "attr", and value matching the regular expression | img[src~=(?i)\\.(png|jpe?g)] | |
| The above may be combined in any order | div.header[title] | ||
Combinators | |||
E F | an F element descended from an E element | div a, .logo h1 | |
E > F | an F direct child of E | ol > li | |
E + F | an F element immediately preceded by sibling E | li + li, div.head + div | |
E ~ F | an F element preceded by sibling E | h1 ~ p | |
E, F, G | all matching elements E, F, or G | a[href], div, h3 | |
Pseudo selectors | |||
:lt(n) | elements whose sibling index is less than n | td:lt(3) finds the first 3 cells of each row | |
:gt(n) | elements whose sibling index is greater than n | td:gt(1) finds cells after skipping the first two | |
:eq(n) | elements whose sibling index is equal to n | td:eq(0) finds the first cell of each row | |
:has(selector) | elements that contains at least one element matching the selector | div:has(p) finds divs that contain p elements | |
:not(selector) | elements that do not match the selector. | div:not(.logo) finds all divs that do not have the "logo" class.div:not(:has(div)) finds divs that do not contain divs. | |
:contains(text) | elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants. | p:contains(univocity) finds p elements containing the text "univocity". | |
:matches(regex) | elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants. | td:matches(\\d+) finds table cells containing digits. div:matches((?i)login) finds divs containing the text, case insensitively. | |
:containsOwn(text) | elements that directly contain the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants. | p:containsOwn(univocity) finds p elements with own text "univocity". | |
:matchesOwn(regex) | elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants. | td:matchesOwn(\\d+) finds table cells directly containing digits. div:matchesOwn((?i)login) finds divs containing the text, case insensitively. | |
:containsData(data) | elements that contains the specified data. The contents of script and style elements, and comment nodes (etc) are considered data nodes, not text nodes. The search is case insensitive. The data may appear in the found element, or any of its descendants. | script:contains(univocity) finds script elements containing the data "univocity". | |
| The above may be combined in any order and with other selectors | .light:contains(name):eq(0) | ||
:matchText | treats text nodes as elements, and so allows you to match against and select text nodes. | p:matchText:firstChild with input One<br />Two will return one element with text "One". | |
Structural pseudo selectors | |||
:root | The element that is the root of the document. In HTML, this is the html element | :root | |
:nth-child(an+b) | elements that have an+b-1 siblings before it in the document tree, for any positive integer or zero value of n, and has a parent element. For values of a and b greater than zero, this effectively divides the element's children into groups of a elements (the last group taking the remainder), and selecting the bth element of each group. For example, this allows the selectors to address every other row in a table, and could be used to alternate the color of paragraph text in a cycle of four. The a and b values must be integers (positive, negative, or zero). The index of the first child of an element is 1.
In addition to this, :nth-child() can take odd and even as arguments instead. odd has the same signification as 2n+1, and even has the same signification as 2n. | tr:nth-child(2n+1) finds every odd row of a table. :nth-child(10n-1) the 9th, 19th, 29th, etc, element. li:nth-child(5) the 5h li | |
:nth-last-child(an+b) | elements that have an+b-1 siblings after it in the document tree. Otherwise like :nth-child() | tr:nth-last-child(-n+2) the last two rows of a table | |
:nth-of-type(an+b) | pseudo-class notation represents an element that has an+b-1 siblings with the same expanded element name before it in the document tree, for any zero or positive integer value of n, and has a parent element | img:nth-of-type(2n+1) | |
:nth-last-of-type(an+b) | pseudo-class notation represents an element that has an+b-1 siblings with the same expanded element name after it in the document tree, for any zero or positive integer value of n, and has a parent element | img:nth-last-of-type(2n+1) | |
:first-child | elements that are the first child of some other element. | div > p:first-child | |
:last-child | elements that are the last child of some other element. | ol > li:last-child | |
:first-of-type | elements that are the first sibling of its type in the list of children of its parent element | dl dt:first-of-type | |
:last-of-type | elements that are the last sibling of its type in the list of children of its parent element | tr > td:last-of-type | |
:only-child | elements that have a parent element and whose parent element hasve no other element children | ||
:only-of-type | an element that has a parent element and whose parent element has no other element children with the same expanded element name | ||
:empty | elements that have no children at all | ||
cssQuery - a CSS-like queryIllegalArgumentException - if the CSS query is invalid.Document toW3CDocument()
Generates a W3C DOM document from the current HTML element.
The resulting document is guaranteed to have a <head> and a <body> element. If the current HTML element is the root of the HTML tree, it will be directly mapped to the output Document object. Otherwise, it will be added into automatically generated <head> or <body> elements, if it is not a <head> or <body>.
FetchOutput fetchResources(com.univocity.api.io.FileProvider saveFile, FetchOptions fetchOptions)
Saves the element to a local file using the FileProvider, searching all child nodes for external resources (e.g. href, src) and saving them to local files. All references to those resources in the resulting HTML will point to the locally saved references.
saveFile - provides the path of where to save the filesfetchOptions - options used to control the fetching of resourcesFetchOutput instance with different options to access the downloaded contents.FetchOptions,
FetchOutputFetchOutput fetchResources(File saveFile, FetchOptions fetchOptions)
Save the element to a local file using the File, searching all child nodes for external resources (e.g. href, src) and saving them to local files. All references to those resources in the resulting HTML will point to the locally saved references.
saveFile - provides the path of where to save the filesfetchOptions - options used to control the fetching of resourcesFetchOptions,
FetchOutputFetchOutput fetchResources(File saveFile, String encoding, FetchOptions fetchOptions)
Save the element to a local file using the File, searching all child nodes for external resources (e.g. href, src) and saving them to local files. All references to those resources in the resulting HTML will point to the locally saved references.
saveFile - provides the path of where to save the filesencoding - the desired character encoding for the destination filefetchOptions - options used to control the fetching of resourcesFetchOptions,
FetchOutputFetchOutput fetchResources(File saveFile, Charset encoding, FetchOptions fetchOptions)
Save the element to a local file using the File, searching all child nodes for external resources (e.g. href, src) and saving them to local files. All references to those resources in the resulting HTML will point to the locally saved references.
saveFile - provides the path of where to save the filesencoding - the desired character encoding for the destination filefetchOptions - options used to control the fetching of resourcesFetchOptions,
FetchOutputFetchOutput fetchResources(String pathToFile, FetchOptions fetchOptions)
Save the element to a local file at the path pathToFile, searching all child nodes for external resources (e.g. href, src) and saving them to local files. All references to those resources in the resulting HTML will point to the locally saved references.
pathToFile - the string path to the output filefetchOptions - options used to control the fetching of resourcesFetchOptions,
FetchOutputFetchOutput fetchResources(String pathToFile, String encoding, FetchOptions fetchOptions)
Save the element a local file at the path pathToFile, searching all child nodes for external resources (e.g. href, src) and saving them to local files. All references to those resources in the resulting HTML will point to the locally saved references.
pathToFile - the string path to the output fileencoding - the desired character encoding for the destination filefetchOptions - options used to control the fetching of resourcesFetchOptions,
FetchOutputFetchOutput fetchResources(String pathToFile, Charset encoding, FetchOptions fetchOptions)
Save the element to a local file at the path pathToFile, searching all child nodes for external resources (e.g. href, src) and saving them to local files. All references to those resources in the resulting HTML will point to the locally saved references.
pathToFile - the string path to the output fileencoding - the desired character encoding for the destination filefetchOptions - options used to control the fetching of resourcesFetchOptions,
FetchOutputElementPathStart query()
Starts a matching sequence so chaining selector methods can be used to traverse the HtmlElement
Example:
HtmlElement root = HtmlParser.parse(new UrlReaderProvider("https://www.univocity.com/pages/about-parsers"));
String pageTitle = root.query().match("title").getText();
The String variable pageTitle would have the text inside the <title> element of the HTML page available at “https://www.univocity.com/pages/about-parsers”
ElementPathStartMap<String,String[]> inputValues()
Runs through the hierarchy of this element and collects the values of any input elements, including select lists, radio buttons and checkboxes. Inputs without a name attribute are ignored.
Map<String,String> inputValuesById()
Runs through the hierarchy of this element and collects the values of any input elements, including select lists, radio buttons and checkboxes. Inputs without an id attribute are ignored. If more than one element exists with the same ID, the value of the first element found will be considered.
Copyright © 2018 uniVocity Software Pty Ltd. All rights reserved.