IN - Type of the elements emitted by this sink@PublicEvolving public class StreamingFileSink<IN> extends RichSinkFunction<IN> implements CheckpointedFunction, org.apache.flink.runtime.state.CheckpointListener, ProcessingTimeCallback
FileSystem files within buckets. This is
integrated with the checkpointing mechanism to provide exactly once semantics.
When creating the sink a basePath must be specified. The base directory contains
one directory for every bucket. The bucket directories themselves contain several part files,
with at least one for each parallel subtask of the sink which is writing data to that bucket.
These part files contain the actual output data.
The sink uses a BucketAssigner to determine in which bucket directory each element should
be written to inside the base directory. The BucketAssigner can, for example, use time or
a property of the element to determine the bucket directory. The default BucketAssigner is a
DateTimeBucketAssigner which will create one new bucket every hour. You can specify
a custom BucketAssigner using the setBucketAssigner(bucketAssigner) method, after calling
forRowFormat(Path, Encoder) or
forBulkFormat(Path, BulkWriter.Factory).
The filenames of the part files could be defined using OutputFileConfig, this configuration contain
a part prefix and part suffix, that will be used with the parallel subtask index of the sink
and a rolling counter. For example for a prefix "prefix" and a suffix ".ext" the file create will have a name
"prefix-1-17.ext" containing the data from subtask 1 of the sink and is the 17th bucket
created by that subtask.
Part files roll based on the user-specified RollingPolicy. By default, a DefaultRollingPolicy
is used for row-encoded sink output; a OnCheckpointRollingPolicy is used for bulk-encoded sink output.
In some scenarios, the open buckets are required to change based on time. In these cases, the user
can specify a bucketCheckInterval (by default 1m) and the sink will check periodically and roll
the part file if the specified rolling policy says so.
Part files can be in one of three states: in-progress, pending or finished.
The reason for this is how the sink works together with the checkpointing mechanism to provide exactly-once
semantics and fault-tolerance. The part file that is currently being written to is in-progress. Once
a part file is closed for writing it becomes pending. When a checkpoint is successful the currently
pending files will be moved to finished.
If case of a failure, and in order to guarantee exactly-once semantics, the sink should roll back to the state it
had when that last successful checkpoint occurred. To this end, when restoring, the restored files in pending
state are transferred into the finished state while any in-progress files are rolled back, so that
they do not contain data that arrived after the checkpoint from which we restore.
| Modifier and Type | Class and Description |
|---|---|
static class |
StreamingFileSink.BulkFormatBuilder<IN,BucketID,T extends StreamingFileSink.BulkFormatBuilder<IN,BucketID,T>>
A builder for configuring the sink for bulk-encoding formats, e.g.
|
static class |
StreamingFileSink.RowFormatBuilder<IN,BucketID,T extends StreamingFileSink.RowFormatBuilder<IN,BucketID,T>>
A builder for configuring the sink for row-wise encoding formats.
|
SinkFunction.Context<T>| Modifier | Constructor and Description |
|---|---|
protected |
StreamingFileSink(StreamingFileSink.BulkFormatBuilder<IN,?,? extends org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.BucketsBuilder<IN,?,?>> bucketsBuilder,
long bucketCheckInterval)
Creates a new
StreamingFileSink that writes files in bulk-encoded format to the given base directory. |
protected |
StreamingFileSink(StreamingFileSink.RowFormatBuilder<IN,?,? extends org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.BucketsBuilder<IN,?,?>> bucketsBuilder,
long bucketCheckInterval)
Creates a new
StreamingFileSink that writes files in row-based format to the given base directory. |
| Modifier and Type | Method and Description |
|---|---|
void |
close() |
static <IN> StreamingFileSink.BulkFormatBuilder<IN,String,? extends StreamingFileSink.BulkFormatBuilder<IN,String,?>> |
forBulkFormat(org.apache.flink.core.fs.Path basePath,
org.apache.flink.api.common.serialization.BulkWriter.Factory<IN> writerFactory)
Creates the builder for a
StreamingFileSink with row-encoding format. |
static <IN> StreamingFileSink.RowFormatBuilder<IN,String,? extends StreamingFileSink.RowFormatBuilder<IN,String,?>> |
forRowFormat(org.apache.flink.core.fs.Path basePath,
org.apache.flink.api.common.serialization.Encoder<IN> encoder)
Creates the builder for a
StreamingFileSink with row-encoding format. |
void |
initializeState(org.apache.flink.runtime.state.FunctionInitializationContext context)
This method is called when the parallel function instance is created during distributed
execution.
|
void |
invoke(IN value,
SinkFunction.Context context)
Writes the given value to the sink.
|
void |
notifyCheckpointComplete(long checkpointId) |
void |
onProcessingTime(long timestamp)
This method is invoked with the timestamp for which the trigger was scheduled.
|
void |
open(org.apache.flink.configuration.Configuration parameters) |
void |
snapshotState(org.apache.flink.runtime.state.FunctionSnapshotContext context)
This method is called when a snapshot for a checkpoint is requested.
|
getIterationRuntimeContext, getRuntimeContext, setRuntimeContextclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitinvokeprotected StreamingFileSink(StreamingFileSink.RowFormatBuilder<IN,?,? extends org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.BucketsBuilder<IN,?,?>> bucketsBuilder, long bucketCheckInterval)
StreamingFileSink that writes files in row-based format to the given base directory.protected StreamingFileSink(StreamingFileSink.BulkFormatBuilder<IN,?,? extends org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.BucketsBuilder<IN,?,?>> bucketsBuilder, long bucketCheckInterval)
StreamingFileSink that writes files in bulk-encoded format to the given base directory.public static <IN> StreamingFileSink.RowFormatBuilder<IN,String,? extends StreamingFileSink.RowFormatBuilder<IN,String,?>> forRowFormat(org.apache.flink.core.fs.Path basePath, org.apache.flink.api.common.serialization.Encoder<IN> encoder)
StreamingFileSink with row-encoding format.IN - the type of incoming elementsbasePath - the base path where all the buckets are going to be created as sub-directories.encoder - the Encoder to be used when writing elements in the buckets.StreamingFileSink.RowFormatBuilder.build() after specifying the desired parameters.public static <IN> StreamingFileSink.BulkFormatBuilder<IN,String,? extends StreamingFileSink.BulkFormatBuilder<IN,String,?>> forBulkFormat(org.apache.flink.core.fs.Path basePath, org.apache.flink.api.common.serialization.BulkWriter.Factory<IN> writerFactory)
StreamingFileSink with row-encoding format.IN - the type of incoming elementsbasePath - the base path where all the buckets are going to be created as sub-directories.writerFactory - the BulkWriter.Factory to be used when writing elements in the buckets.StreamingFileSink.RowFormatBuilder.build() after specifying the desired parameters.public void initializeState(org.apache.flink.runtime.state.FunctionInitializationContext context)
throws Exception
CheckpointedFunctioninitializeState in interface CheckpointedFunctioncontext - the context for initializing the operatorExceptionpublic void notifyCheckpointComplete(long checkpointId)
throws Exception
notifyCheckpointComplete in interface org.apache.flink.runtime.state.CheckpointListenerExceptionpublic void snapshotState(org.apache.flink.runtime.state.FunctionSnapshotContext context)
throws Exception
CheckpointedFunctionFunctionInitializationContext when
the Function was initialized, or offered now by FunctionSnapshotContext itself.snapshotState in interface CheckpointedFunctioncontext - the context for drawing a snapshot of the operatorExceptionpublic void open(org.apache.flink.configuration.Configuration parameters)
throws Exception
open in interface org.apache.flink.api.common.functions.RichFunctionopen in class org.apache.flink.api.common.functions.AbstractRichFunctionExceptionpublic void onProcessingTime(long timestamp)
throws Exception
ProcessingTimeCallbackIf the triggering is delayed for whatever reason (trigger timer was blocked, JVM stalled due to a garbage collection), the timestamp supplied to this function will still be the original timestamp for which the trigger was scheduled.
onProcessingTime in interface ProcessingTimeCallbacktimestamp - The timestamp for which the trigger event was scheduled.Exceptionpublic void invoke(IN value, SinkFunction.Context context) throws Exception
SinkFunctionYou have to override this method when implementing a SinkFunction, this is a
default method for backward compatibility with the old-style method only.
invoke in interface SinkFunction<IN>value - The input record.context - Additional context about the input record.Exception - This method may throw exceptions. Throwing an exception will cause the operation
to fail and may trigger recovery.Copyright © 2014–2020 The Apache Software Foundation. All rights reserved.