public final class HtmlPaginator extends com.univocity.parsers.remote.Paginator<HtmlEntitySettings,HtmlPaginationContext>
Used by the HtmlParser to collect multiple pages of results in a website and to handle the files that have been downloaded for each page.
HtmlParser,
PaginationContext,
NextInputHandler| Modifier | Constructor and Description |
|---|---|
protected |
HtmlPaginator(HtmlParserSettings parserSettings)
Creates a new
HtmlPaginator and sets the currentPageNumber to 0 |
| Modifier and Type | Method and Description |
|---|---|
PathStart |
addField(String fieldName)
Creates a new field on this
HtmlPaginator and returns a PathStart that allows the user to define a path to the field. |
PathStart |
addRequestParameter(String paramName)
Creates a new request parameter and returns a
PathStart that allows the user to define path to the parameter. |
protected HtmlEntitySettings |
newEntitySettings(com.univocity.parsers.remote.RemoteParserSettings htmlParserSettings)
Creates a new
HtmlEntitySettings which will be used to create fields specifically for this HtmlPaginator. |
PaginationGroupStart |
newGroup()
Creates a new
PaginationGroup group for this paginator. |
PaginationPathStart |
newPath()
Returns a
PartialPathStart that is used to define a reusable path of HTML elements. |
PathStart |
setCurrentPage()
Creates a new field for the current page and returns a
PathStart which can be used to define the path to the ‘current page’ element. |
PathStart |
setCurrentPageNumber()
Creates a new field for the current page and returns a
PathStart which can be used to define the path to the ‘current page’ element as a number. |
PathStart |
setNextPage()
Creates a new field for the next page and returns a
PathStart which can be used to define the path to the next page element. |
PathStart |
setNextPageNumber()
Creates a new field for the next page number and returns a
PathStart which can be used to define the path to the next page number element. |
void |
setRequestParameter(String paramName,
String value)
Associates a constant value to a request parameter.
|
void |
setRequestParameterData(String fieldForParamName,
Object value)
Defines a request parameter name and data value to be used when requesting the next page.
|
void |
toRequestParameter(String fieldForParamName,
String fieldForParamValue)
Assigns values captured for two fields declared in this
HtmlPaginator to a request parameter. |
protected HtmlPaginator(HtmlParserSettings parserSettings)
Creates a new HtmlPaginator and sets the currentPageNumber to 0
parserSettings - the parser settings to useprotected final HtmlEntitySettings newEntitySettings(com.univocity.parsers.remote.RemoteParserSettings htmlParserSettings)
Creates a new HtmlEntitySettings which will be used to create fields specifically for this HtmlPaginator.
newEntitySettings in class com.univocity.parsers.remote.Paginator<HtmlEntitySettings,HtmlPaginationContext>HtmlEntitySettings to be used by this HtmlPaginator.public final PathStart setCurrentPage()
Creates a new field for the current page and returns a PathStart which can be used to define the path to the ‘current page’ element. The current page is a HTML element that indicates which page among a series of pages is being currently parsed.
PathStart used to define the path to the current page elementpublic final PathStart setCurrentPageNumber()
Creates a new field for the current page and returns a PathStart which can be used to define the path to the ‘current page’ element as a number. The current page is a HTML element that indicates which page among a series of pages is being currently parsed.
PathStart used to define the path to the current page numberpublic final PathStart setNextPageNumber()
Creates a new field for the next page number and returns a PathStart which can be used to define the path to the next page number element. The next page number indicates that there are more pages after the current page. When the parser runs and completes the parsing of the page, it will read the next page number to decide whether to proceed a try to obtain the next page to parse. The parser will continue to access the next page until the next page number does not exist or the follow count set by Paginator.setFollowCount(int) is reached.
PathStart used to define the path to the next page numberpublic final PathStart setNextPage()
Creates a new field for the next page and returns a PathStart which can be used to define the path to the next page element. The next page is a HTML element that changes the current page to the next page in series. When the parser runs and completes the parsing of the page, it will ‘click’ on the next page element and process the result. The parser will continue to access the next page until a next page element does not exist or the follow count set by Paginator.setFollowCount(int) is reached.
An example of setting the next page can be demonstrated using this HTML:
<html>
<body>
<article>
<h1>Water: The Truth</h1>
<p>It's good for you!</p>
<a href="paginationTarget.html">Next Page</a>
</article>
</body>
</html>
Assume that the paginationTarget.html linked above contains the following HTML:
<html>
<body>
<article>
<h1>Bananas</h1>
<p>An excellent source of potassium</p>
</article>
</body>
</html>
You can get the text in the h1 and p elements from both pages with:
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings entity = entities.configureEntity("entity");
entity.addField("header").match("h1").containedBy("article").getText();
entity.addField("text").match("p").containedBy("article").getText();
entities.configurePaginator()
.setNextPage()
.match("a")
.containedBy("article")
.getAttribute("href");
When the parser runs, it will parse the first page, collecting the first row of data:
[Water: The Truth, It's good for you!]
The paginator will then run, accessing the link’s URL provided by the href attribute and opening the next page. This next page will then be parsed to collect the next row:
[Bananas, An excellent source of potassium]
As there is no a element on the second page with a link to the next, the paginator will be unable to run and the parsing will finish, returning the two records parsed from each page.
PathStart is used to define the path to the next page elementpublic final PathStart addField(String fieldName)
Creates a new field on this HtmlPaginator and returns a PathStart that allows the user to define a path to the field. Use method toRequestParameter(java.lang.String, java.lang.String) of this class to send the field name and its value in the body of the POST request to the next page.
Any fields added to the paginator and their values can be read from a NextInputHandler through its PaginationContext.
fieldName - name of the new fieldPathStart is used to define the path of the value for the given fieldpublic final PathStart addRequestParameter(String paramName)
Creates a new request parameter and returns a PathStart that allows the user to define path to the parameter. Request parameters are values set on a page, usually internally in hidden fields of a form element, and which contain details about the page state. These are usually used to be sent back to the server in a POST HttpRequest, to validate the current session and return a next page of results.
If you assign the path to the request parameters you are interested in, the paginator will automatically set them for you when requesting for the next page of results. Otherwise, you’d have to manually set the parameters using a NextInputHandler.
paramName - name of the request parameterPathStart is used to define the path of the value for the given parameterpublic final void setRequestParameter(String paramName, String value)
Associates a constant value to a request parameter. Parameter values are submitted in POST requests to load the next page.
paramName - the name that will be associated with the parameter, which will be sent in the request for the next page.value - the value of the corresponding parameterpublic final void toRequestParameter(String fieldForParamName, String fieldForParamValue)
Assigns values captured for two fields declared in this HtmlPaginator to a request parameter. For example, if fields “scriptName” and “scriptValue” have been defined in this paginator, and and their values are collected as “thescript” and “myscript.js” respectively, the HttpRequest used to invoke the next page will have its HttpRequest.addDataParameter(String, Object) invoked with “thescript” as the parameter name and “myscript.js” as the parameter value.
fieldForParamName - name of the field available in this paginator whose value will be used as a the request parameter name.fieldForParamValue - name of the field available in this paginator whose value will be used as the request parameter value.public final void setRequestParameterData(String fieldForParamName, Object value)
Defines a request parameter name and data value to be used when requesting the next page.
fieldForParamName - name of the request parameter to add to this paginatorvalue - the value associated with the given request parameterpublic final PaginationPathStart newPath()
Returns a PartialPathStart that is used to define a reusable path of HTML elements. Fields then can added to this path using FieldDefinition.addField(String) and others, which associates the field with this entity.
Example:
HtmlEntityList entityList = new HtmlEntityList();
HtmlEntitySettings items = entityList.configureEntity("items");
PartialPath path = items.newPath()
.match("table").id("productsTable")
.match("td").match("div").classes("productContainer");
//uses the path to add new fields to it and further element matching rules from the initial, common path.
path.addField("name").match("span").classes("prodName", "prodNameTro").getText();
path.addField("URL").match("a").childOf("div").classes("productPadding").getAttribute("href")
PartialPathStart to specify the path of HTML elementspublic final PaginationGroupStart newGroup()
Creates a new PaginationGroup group for this paginator. Refer to the Group documentation to learn more about how element groups are used.
PaginationGroupStart which is the first step in determining which element demarcates the start of a PaginationGroup.Copyright © 2018 uniVocity Software Pty Ltd. All rights reserved.