|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectcom.univocity.parsers.common.EntityParserSettings<S,L,C>
com.univocity.parsers.remote.RemoteParserSettings<S,L,C>
S - an internal configuration object that extends from CommonParserSettings, and is used to
manage configuration of elements shared with univocity-parsersL - the RemoteEntityList implementation supported by an EntityParserInterface.C - the Context implementation which provides specific details about the the parsing process performed
by this parser.public abstract class RemoteParserSettings<S extends com.univocity.parsers.common.CommonParserSettings,L extends RemoteEntityList,C extends com.univocity.parsers.common.Context>
Base configuration class of a parser that can connect to a remote location, obtain data to parse and produce records
for one or more entities. The settings available in a RemoteParserSettings configure the remote content access,
the parsing process, and provide default configuration options that individual implementations of
RemoteEntitySettings can override.
Entities are managed from an RemoteEntityList implementation.
DataTransfer,
RemoteEntitySettings,
RemoteEntityList,
Paginator| Field Summary | |
|---|---|
protected Boolean |
downloadBeforeParsingEnabled
|
protected Paginator |
paginator
|
| Fields inherited from class com.univocity.parsers.common.EntityParserSettings |
|---|
entitiesToRead, entitiesToSkip, globalSettings |
| Constructor Summary | |
|---|---|
RemoteParserSettings()
Creates a new configuration object for an implementation of EntityParserInterface, which will process
an input to produce records for entities defined by a RemoteEntityList. |
|
| Method Summary | |
|---|---|
void |
clearFileNameParameters()
Clears all values from the filename pattern defined in setFileNamePattern(String) |
protected RemoteParserSettings<S,L,C> |
clone()
|
String |
getBatchId()
Returns the custom batch ID to be used in the file name pattern specified by getFileNamePattern(). |
abstract String |
getDefaultFileExtension()
Returns the default file extension to use when saving files to the directory specified by getDownloadContentDirectory(), in case the the file name pattern taken from getFileNamePattern()
doesn't include a file extension. |
com.univocity.api.io.FileProvider |
getDownloadContentDirectory()
Returns the directory where downloaded content should be stored |
com.univocity.api.statistics.DownloadListener |
getDownloadListener()
Returns the DownloadListener associated with the parser and which will receive updates on the
progress of downloads made by the parser. |
int |
getDownloadThreads()
Sets the number of threads that will be used to download remote content (e.g. |
String |
getEmptyValue()
Returns the value to be used when the content parsed for a field of some record evaluates to an empty String
Defaults to null |
ExecutorService |
getExecutorService()
Returns the ExecutorService to be used by the parser for managing the multiple threads that can be
started. |
Object |
getFileNameParameter(String parameterName)
Gets the value of a parameter in the filename pattern defined in setFileNamePattern(String). |
Set<String> |
getFileNameParameters()
|
String |
getFileNamePattern()
Gets the pattern that names of downloaded files should follow. |
Nesting |
getNesting()
Returns the nesting strategy to apply to rows associated to a "parent" row, such as results parsed from a link accessed by a RemoteFollower. |
Paginator |
getPaginator()
Configures a Paginator to handle multiple pages of remote content that needs to parsed. |
String |
getParseDate()
Returns the formatted parse date to associate with any downloaded files for future re-parsing. |
long |
getRemoteInterval()
Returns the minimum interval of time to wait between remote requests. |
Charset |
getTextEncoding()
Returns the character set to use when writing text/html files downloaded by the parser Defaults to the system encoding if not provided |
void |
ignoreFollowingErrors(boolean ignoreLinkFollowingErrors)
Configures the parser to ignore (or not) invalid, malformed or unavailable links when following urls to collect additional data associated to a current result. |
boolean |
isColumnReorderingEnabled()
Identifies whether fields should be reordered when field selection methods of an entity's EntitySettings
(such as EntitySettings.selectFields(String...)) are used. |
boolean |
isDownloadBeforeParsingEnabled()
Verifies whether the parser will download the remote content before parsing it. |
boolean |
isDownloadEnabled()
Flags whether remote downloads are enabled. |
boolean |
isDownloadOverwritingEnabled()
Returns a flag indicating whether the parser will overwrite content already downloaded. |
boolean |
isIgnoreFollowingErrors()
Returns a flag indicating whether the parser will ignore invalid, malformed or unavailable links when following urls to collect additional data associated to a current result. |
protected abstract Paginator |
newPaginator(RemoteParserSettings parserSettings)
Creates an instance of a concrete implementation of Paginator |
void |
setBatchId(String batchId)
Defines a custom batch ID to be used in the file name pattern specified by getFileNamePattern(). |
void |
setColumnReorderingEnabled(boolean columnReorderingEnabled)
Defines whether fields should be reordered when field selection methods of an entity's EntitySettings
(such as EntitySettings.selectFields(String...)) are used. |
void |
setDownloadBeforeParsingEnabled(boolean downloadBeforeParsingEnabled)
Instructs the parser to download the remote content before parsing it. |
void |
setDownloadContentDirectory(File directory)
Configures the parser to store a local copy of the remote content in the filesystem. |
void |
setDownloadContentDirectory(String path)
Configures the parser to store a local copy of the remote content in the filesystem. |
void |
setDownloadEnabled(boolean downloadEnabled)
Enables/disables any remote download operation. |
void |
setDownloadListener(com.univocity.api.statistics.DownloadListener downloadListener)
Associates a DataTransfer with the parser, which will receive updates on the progress of downloads
made by the parser. |
void |
setDownloadOverwritingEnabled(boolean downloadOverwritingEnabled)
Configures the parser to overwrite content already downloaded. |
void |
setDownloadThreads(int downloadThreads)
Sets the number of threads that will be used to download remote content (e.g. |
void |
setEmptyValue(String emptyValue)
Defines the value to be used when the content parsed for a field of some record evaluates to an empty String
Defaults to null |
void |
setExecutorService(ExecutorService executorService)
Assigns an ExecutorService to be parser, which will be used to manage the multiple threads that can be
started. |
void |
setFileNameParameter(String parameterName,
Object parameterValue)
Sets the value of a parameter in the filename pattern defined in setFileNamePattern(String). |
void |
setFileNamePattern(String pattern)
Sets the pattern that names of downloaded files should follow. |
void |
setNesting(Nesting nesting)
Configures the nesting strategy to apply to rows associated to a "parent" row, such as results parsed from a link accessed by a RemoteFollower. |
void |
setPaginator(Paginator paginator)
Configures a Paginator to handle multiple pages of remote content that needs to parsed. |
void |
setParseDate(Calendar parseDate)
Defines a parse date to process historical files. |
void |
setParseDate(Date parseDate)
Defines a parse Date to process historical files. |
void |
setParseDate(String parseDate)
Defines a parse Date to process historical files. |
void |
setRemoteInterval(long remoteInterval)
Defines the minimum interval of time to wait between remote requests. |
void |
setTextEncoding(Charset encoding)
Defines the character set to use when writing text/html files downloaded by the parser By default the system encoding is used. |
void |
setTextEncoding(String charsetName)
Defines the character set to use when writing text/html files downloaded by the parser By default the system encoding is used. |
| Methods inherited from class java.lang.Object |
|---|
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
protected Paginator paginator
protected Boolean downloadBeforeParsingEnabled
| Constructor Detail |
|---|
public RemoteParserSettings()
EntityParserInterface, which will process
an input to produce records for entities defined by a RemoteEntityList. The
RemoteEntityList is used to manage RemoteEntitySettings for each entity whose records
will be parsed.
| Method Detail |
|---|
public final void setDownloadContentDirectory(String path)
path - the path to the target directory. It can contain system variables enclosed within { and }
(e.g. {user.home}/Downloads"). Subdirectories that don't exist will be created if required.public final void setDownloadContentDirectory(File directory)
directory - the target directory. Subdirectories that don't exist will be created if required.public final com.univocity.api.io.FileProvider getDownloadContentDirectory()
FileProvider pointing to the configured download content directorypublic final void setFileNamePattern(String pattern)
{page, <padding>} prints the current page number from the paginator.
Number can be padded with leading zeroes if the optional padding number is provided.
Examples:
/tmp/page{page, 4}: prints /tmp/page0001.html, /tmp/page0321.html, etc./tmp/page{page}: prints /tmp/page1.html, /tmp/page2.html, /tmp/page543.html, etc./tmp/page{page, 2}: prints /tmp/page01.html, /tmp/page89.html, /tmp/page289.html, etc{date, <mask>} prints the current date as a timestamp. A date mask can be provided to configure
how the date should be displayed (refer to SimpleDateFormat for valid patterns).
Examples:
/tmp/file_{date, yyyy-MMM-dd}: prints /tmp/file_2016-Dec-25.pdf, /tmp/file_2020-Feb-28.html, etc/tmp/file_{date}: prints /tmp/file_23423423423.pdf, /tmp/file_234234324231.html, etc{$query} prints the value associated with the supplied query located in the HTML page's URL. Examples:
/tmp/search_{$q} on HTML page with url 'http://google.com/search?q=cup': prints /tmp/search_cup.html{entry, <padding>} prints the number of followers that have been parsed. Basically the same as page
except for link followers.
Examples:
/tmp/file_{entry, 3}: prints /tmp/file_001, /tmp/file_014, etc.{parent} prints the name of the "parent" file without the extension. The "parent" file is the one that
the "parent" entity saved to. In the case of a link follower the "parent" entity would be the entity that parsed
the row which triggered the link follower to start parsing.
Examples:
{parent}/followedPage with a parent entity saving to the file /tmp/page_1.html would print
/tmp/page_1/followedPage.html
{parent}/file_{entry} with a parent entity saving to the file /tmp/page_4.html would print
/tmp/page_4/file_1.html, /tmp/page_4/file_2.html, etc.
{batch} prints the the custom batch ID provided by getBatchId()
Example:
/tmp/{batch}/page_{page} where the batch ID is set to "abc" would print
/tmp/abc/page_1.html
{url, <option>} prints part of the current URL being visited, the url itself where each part is a
directory, or a flattened representation of the URL. For example, given the relative
url:
"/Property/307634/EST6886/Springfield"
/tmp/{url, 2}: prints the third section of the URL /tmp/est6886.html/tmp/{url, flat}: prints /tmp/property_307634_est6886_springfield.html/tmp/{url}: prints /tmp/Property/307634/EST6886/Springfield.htmlfile_{page}
pattern - the pattern used to generate file names for downloaded content.public final String getFileNamePattern()
{page, <padding>} prints the current page number from the paginator.
Number can be padded with leading zeroes if the optional padding number is provided.
Examples:
/tmp/page{page, 4}: prints /tmp/page0001.html, /tmp/page0321.html, etc./tmp/page{page}: prints /tmp/page1.html, /tmp/page2.html, /tmp/page543.html, etc./tmp/page{page, 2}: prints /tmp/page01.html, /tmp/page89.html, /tmp/page289.html, etc{date, <mask>} prints the current date as a timestamp. A date mask can be provided to configure
how the date should be displayed (refer to SimpleDateFormat for valid patterns).
Examples:
/tmp/file_{date, yyyy-MMM-dd}: prints /tmp/file_2016-Dec-25.pdf, /tmp/file_2020-Feb-28.html, etc/tmp/file_{date}: prints /tmp/file_23423423423.pdf, /tmp/file_234234324231.html, etc{$query} prints the value associated with the supplied query located in the HTML page's URL. Examples:
/tmp/search_{$q} on HTML page with url 'http://google.com/search?q=cup': prints /tmp/search_cup.html{entry, <padding>} prints the number of followers that have been parsed. Basically the same as page
except for link followers.
Examples:
/tmp/file_{entry, 3}: prints /tmp/file_001, /tmp/file_014, etc.{parent} prints the name of the "parent" file without the extension. The "parent" file is the one that
the "parent" entity saved to. In the case of a link follower the "parent" entity would be the entity that parsed
the row which triggered the link follower to start parsing.
Examples:
{parent}/followedPage with a parent entity saving to the file /tmp/page_1.html would print
/tmp/page_1/followedPage.html
{parent}/file_{entry} with a parent entity saving to the file /tmp/page_4.html would print
/tmp/page_4/file_1.html, /tmp/page_4/file_2.html, etc.
{batch} prints the the custom batch ID provided by getBatchId()
Example:
/tmp/{batch}/page_{page} where the batch ID is set to "abc" would print
/tmp/abc/page_1.html
{url, <option>} prints part of the current URL being visited, the url itself where each part is a
directory, or a flattened representation of the URL. For example, given the relative
url:
"/Property/307634/EST6886/Springfield"
/tmp/{url, 2}: prints the third section of the URL /tmp/est6886.html/tmp/{url, flat}: prints /tmp/property_307634_est6886_springfield.html/tmp/{url}: prints /tmp/Property/307634/EST6886/Springfield.htmlfile_{page}
public final Charset getTextEncoding()
public final void setTextEncoding(Charset encoding)
encoding - the encoding to use for writing downloaded filespublic final void setTextEncoding(String charsetName)
charsetName - the name of the charset to use for writing downloaded files
public final void setFileNameParameter(String parameterName,
Object parameterValue)
setFileNamePattern(String).
parameterName - the name of the parameterparameterValue - the value of the parameterpublic final Object getFileNameParameter(String parameterName)
setFileNamePattern(String).
parameterName - the name of the parameter to get
public final Set<String> getFileNameParameters()
setFileNamePattern(String).public final void clearFileNameParameters()
setFileNamePattern(String)
public void setPaginator(Paginator paginator)
Paginator to handle multiple pages of remote content that needs to parsed.
paginator - a Paginator to be associated with the current RemoteParserSettingspublic Paginator getPaginator()
Paginator to handle multiple pages of remote content that needs to parsed.
Paginator associated with the current RemoteParserSettingsprotected abstract Paginator newPaginator(RemoteParserSettings parserSettings)
Paginator
parserSettings - the parser settings that should be used for the new paginator
Paginator instancepublic final String getEmptyValue()
String
Defaults to null
String (i.e. "") when the content of a field is empty.public final void setEmptyValue(String emptyValue)
String
Defaults to null
emptyValue - the value to be used instead of empty String (i.e. "") when the content of a field is empty.public final boolean isColumnReorderingEnabled()
EntitySettings
(such as EntitySettings.selectFields(String...)) are used.
When enabled, each parsed record will contain values only for the selected columns. The values will be ordered according to the selection.
When disabled, each parsed record will contain values for all columns, in their original sequence.
Fields which were not selected will contain null values, as defined in EntitySettings.getNullValue().
Defaults to true
public final void setColumnReorderingEnabled(boolean columnReorderingEnabled)
EntitySettings
(such as EntitySettings.selectFields(String...)) are used.
When enabled, each parsed record will contain values only for the selected columns. The values will be ordered according to the selection.
When disabled, each parsed record will contain values for all columns, in their original sequence.
Fields which were not selected will contain null values, as defined in EntitySettings.getNullValue().
Defaults to true
columnReorderingEnabled - the flag indicating whether or not selected fields should be reorderedpublic com.univocity.api.statistics.DownloadListener getDownloadListener()
DownloadListener associated with the parser and which will receive updates on the
progress of downloads made by the parser.
NoopDataTransfer will be returned.public void setDownloadListener(com.univocity.api.statistics.DownloadListener downloadListener)
DataTransfer with the parser, which will receive updates on the progress of downloads
made by the parser.
downloadListener - the listener that should receive notifications regarding the progress of downloads
performed by the parser.public abstract String getDefaultFileExtension()
getDownloadContentDirectory(), in case the the file name pattern taken from getFileNamePattern()
doesn't include a file extension.
public boolean isDownloadOverwritingEnabled()
true
Has no effect if isDownloadEnabled() evaluates to false
public void setDownloadOverwritingEnabled(boolean downloadOverwritingEnabled)
true
downloadOverwritingEnabled - flag to enable or disable overwriting of downloaded content.public boolean isDownloadBeforeParsingEnabled()
setDownloadContentDirectory(String), this method will always return true
and the parser will download the remote content into the given directory. If no directory has been defined,
the contents will be downloaded into a temporary directory.
Defaults to false
public void setDownloadBeforeParsingEnabled(boolean downloadBeforeParsingEnabled)
setDownloadContentDirectory(String), this method has no effect
and the parser will download the remote content into the given directory. If this flag is set to true and
no directory has been defined, the contents will be downloaded into a temporary directory.
Defaults to false
downloadBeforeParsingEnabled - flag enable the parser to download remote content into a local file before parsing it.public final void setDownloadThreads(int downloadThreads)
downloadThreads - the maximum number of threads to be used for downloading contentpublic final int getDownloadThreads()
public final Nesting getNesting()
RemoteFollower.
Defaults to the parent entity's RemoteEntitySettings.getNesting() or if undefined,
the getNesting() setting.
public final void setNesting(Nesting nesting)
RemoteFollower.
Defaults to the parent entity's RemoteEntitySettings.getNesting() or if undefined,
the getNesting() setting.
nesting - the nesting strategy to use when processing results associated with a parent row.public final void setExecutorService(ExecutorService executorService)
ExecutorService to be parser, which will be used to manage the multiple threads that can be
started. These threads are used to parse/download data from a given input and any remote resources associated with it.
Defaults to: Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
executorService - the executor service to be used by the parser for the creation of new threads.public final ExecutorService getExecutorService()
ExecutorService to be used by the parser for managing the multiple threads that can be
started. These threads are used to parse/download data from a given input and any remote resources associated with it.
Defaults to: Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
public void ignoreFollowingErrors(boolean ignoreLinkFollowingErrors)
false, the parser will throw an Exception when attempting to follow a link that is invalid,
malformed or unavailable. If true, the parser will simply ignore the error and proceed.
Defaults to true
ignoreLinkFollowingErrors - true if the parser will ignore errors when accessing linked page, false otherwise.public boolean isIgnoreFollowingErrors()
true
true if the parser is set to ignore errors when accessing linked pageprotected RemoteParserSettings<S,L,C> clone()
clone in class EntityParserSettings<S extends com.univocity.parsers.common.CommonParserSettings,L extends RemoteEntityList,C extends com.univocity.parsers.common.Context>public final long getRemoteInterval()
RemoteFollowers are
used.
Defaults to 15 ms
<= 0 mean the internal RateLimiter is disabled.public final void setRemoteInterval(long remoteInterval)
RemoteFollowers are
used.
Defaults to 15 ms
remoteInterval - minimum time (in milliseconds) to wait between remote requests.
Any value <= 0 will disable the internal RateLimiter.public final void setParseDate(Calendar parseDate)
getFileNamePattern() contains a date parameter,
for example: @code{"{date, yyyy-MMM-dd}/results_{page}.html")"}. If the parse date is set to 2015-10-10, the
parser will look for existing files under the directory named "2015-Oct-10" inside
getDownloadContentDirectory()
If the parse date is not null, downloads will be disabled automatically unless explicitly enabled with
setDownloadEnabled(true);
parseDate - the date to use for loading files downloaded in the past that will be re-parsed.public final void setParseDate(Date parseDate)
getFileNamePattern() contains a date parameter,
for example: @code{"{date, yyyy-MMM-dd}/results_{page}.html")"}. If the parse Date is set to 2015-10-10, the
parser will look for existing files under the directory named "2015-Oct-10" inside
getDownloadContentDirectory()
If the parse Date is not null, downloads will be disabled automatically unless explicitly enabled with
setDownloadEnabled(true);
parseDate - the date to use for loading files downloaded in the past that will be re-parsed.public final void setParseDate(String parseDate)
getFileNamePattern() contains a date parameter,
for example: @code{"{date, yyyy-MMM-dd}/results_{page}.html")"}. If the parse Date is set to "2015-Oct-10", the
parser will look for existing files under the directory named "2015-Oct-10" inside
getDownloadContentDirectory().
If the parse Date is not null, downloads will be disabled automatically unless explicitly enabled with
setDownloadEnabled(true);
parseDate - the formatted representation of the date to use for loading files downloaded in the past that
will be re-parsed. Must match the date pattern used in
getFileNamePattern()public final String getParseDate()
getFileNamePattern() contains a date parameter such as
"{date, yyyy-MMM-dd}/results_{page}.html")", any downloaded files will be stored under the
directory named after the date. If the parse date is set manually to "2015-Oct-10", the
parser will look for existing files under the directory named "2015-Oct-10" inside
getDownloadContentDirectory(). If no format is defined, a String
representing the time in milliseconds will be returned.
If no date has been set explicitly, the current date and time of the system will be used.
If given parse date is not null, downloads will be disabled automatically unless
explicitly enabled with setDownloadEnabled(true);
String representing the parse date.public final String getBatchId()
getFileNamePattern(). Used to process files stored locally.
If a {batch} parameter is not present in the pattern, the given batch ID will be simply ignored.
If the batch ID is not null, downloads will be disabled automatically unless explicitly enabled with
setDownloadEnabled(true);
public final void setBatchId(String batchId)
getFileNamePattern(). Used to process files stored locally.
If a {batch} parameter is not present in the pattern, the given batch ID will be simply ignored.
If the batch ID is not null, downloads will be disabled automatically unless explicitly enabled with
setDownloadEnabled(true);
batchId - the user-specific batch IDpublic final void setDownloadEnabled(boolean downloadEnabled)
isDownloadOverwritingEnabled() is set to false to prevent
downloading and overwriting existing files.
downloadEnabled - flag indicating whether downloads are enabled.public final boolean isDownloadEnabled()
true by default. It's recommended to disable downloads when
processing historical files offline to ensure no accidental download will occur and overwrite old files.
If enabled, when processing stored files any missing file that was not downloaded previously will be downloaded.
Make sure that isDownloadOverwritingEnabled() is set to false to prevent
downloading and overwriting existing files.
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||