public interface FieldContentTransform
Allows the content captured for a given field, by a ContentReader, to be transformed by a StringTransformation to clean up or transform values or to obtain very specific textual content from the original value.
Also provides operations for linking to/downloading content from absolute or relative URLs extracted into values of a field.
| Modifier and Type | Method and Description |
|---|---|
void |
download()
Specifies that the parser will download content from the URL in the HTML element defined by the path.
|
void |
download(com.univocity.api.net.HttpResponseReader contentReader)
Specifies that the parser will download content from the URL in the HTML element defined by the
path.
|
void |
download(com.univocity.api.net.UrlReaderProvider baseUrlProvider)
Specifies that the parser will download content from the URL in the HTML element defined by the path.
|
void |
download(com.univocity.api.net.UrlReaderProvider baseUrlProvider,
com.univocity.api.net.HttpResponseReader contentReader)
Specifies that the parser will download content from the URL in the HTML element defined by the
path.
|
HtmlLinkFollower |
followLink()
Creates a
HtmlLinkFollower that will parse linked pages, each linked page URL is defined by the values retrieved by this field. |
HtmlLinkFollower |
followLink(com.univocity.api.net.UrlReaderProvider urlReaderProvider)
Creates a
HtmlLinkFollower that will parse linked pages, each linked page URL is defined by inserting the value retrieved by this field into the supplied UrlReaderProvider as a parameter. |
T |
transform(com.univocity.api.common.StringTransformation transformation)
Assigns a
StringTransformation to the current field. |
void download()
Specifies that the parser will download content from the URL in the HTML element defined by the path. This is useful for downloading binary files such as images and videos linked with ‘src’ or ‘href’ attributes.
Content will be downloaded to the directory specified by RemoteParserSettings.setDownloadContentDirectory(File). If the download directory is not set, the content will be stored in a temporary directory.
void download(com.univocity.api.net.UrlReaderProvider baseUrlProvider)
Specifies that the parser will download content from the URL in the HTML element defined by the path. This is useful for downloading binary files such as images and videos linked by ‘src’ or ‘href’ attributes.
Content will be downloaded to the directory specified by RemoteParserSettings.setDownloadContentDirectory(File). If the download directory is not set, the content will be stored in a temporary directory.
baseUrlProvider - the base URL and associated configuration to be used for downloading the content. Required for downloading content wile parsing data from local files.HtmlLinkFollower followLink()
Creates a HtmlLinkFollower that will parse linked pages, each linked page URL is defined by the values retrieved by this field. Each URL returned by this field will be parsed by the associated HtmlLinkFollower. A HtmlLinkFollower is essentially a special type of HtmlEntityList that allows fields and entities to be added to it.
For each parsed row, the HtmlParser will combine it with the results from all associated link followers, using the nesting strategy defined by RemoteFollower.getNesting().
If a value parsed in this field is not a valid URL, the parsing process will stop unless RemoteFollower.ignoreFollowingErrors(boolean) has been set to true.
HtmlLinkFollower which allows the addition of fields and entities to define what data from the linked page should be returned.HtmlLinkFollower followLink(com.univocity.api.net.UrlReaderProvider urlReaderProvider)
Creates a HtmlLinkFollower that will parse linked pages, each linked page URL is defined by inserting the value retrieved by this field into the supplied UrlReaderProvider as a parameter.
For example, if the current entity has a field named “query”, which retrieves the values ‘cat’ and ‘dog’, and the UrlReaderProvider provided here has the parameterized URL "https://www.google.com/search?q={query}", two pages will be parsed as the values replace the {query} parameter in the URL:
https://www.google.com/search?q=cat andhttps://www.google.com/search?q=dogA HtmlLinkFollower is essentially a special type of HtmlEntityList that allows additional fields to be added to the current parent entity. It also allows more entities to be added to it. For each parsed row of the parent page, the HtmlParser will combine their results with the rows of all associated link followers, using the nesting strategy defined by RemoteFollower.getNesting().
If a value parsed in this field is not a valid URL, the parsing process will stop unless RemoteFollower.ignoreFollowingErrors(boolean) has been set to true.
urlReaderProvider - the parameterized URL that values parsed from this field will be inserted into to get a linked pageHtmlLinkFollower which allows the addition of fields and entities to define what data from the linked page will be returned.void download(com.univocity.api.net.HttpResponseReader contentReader)
HttpResponseReader, provided by the user.contentReader - a user-provided callback to process the remote content.void download(com.univocity.api.net.UrlReaderProvider baseUrlProvider,
com.univocity.api.net.HttpResponseReader contentReader)
HttpResponseReader, provided by the user.baseUrlProvider - the base URL and associated configuration to be used for downloading the content.
Required for downloading content wile parsing data from local files.contentReader - a user-provided callback to process the remote content.T transform(com.univocity.api.common.StringTransformation transformation)
StringTransformation to the current field. Once the parser collects a value for the field,
it will invoke the Transformation.transform(Object) to modify it. The result of the transformation
will be assigned to the fieldtransformation - the transformation to be applied over the content parsed for a given field.Copyright © 2018 uniVocity Software Pty Ltd. All rights reserved.