Class DataflowRunner


  • public class DataflowRunner
    extends org.apache.beam.sdk.PipelineRunner<DataflowPipelineJob>
    A PipelineRunner that executes the operations in the pipeline by first translating them to the Dataflow representation using the DataflowPipelineTranslator and then submitting them to a Dataflow service for execution.

    Permissions

    When reading from a Dataflow source or writing to a Dataflow sink using DataflowRunner, the Google cloudservices account and the Google compute engine service account of the GCP project running the Dataflow Job will need access to the corresponding source/sink.

    Please see Google Cloud Dataflow Security and Permissions for more details.

    DataflowRunner now supports creating job templates using the --templateLocation option. If this option is set, the runner will generate a template instead of running the pipeline immediately.

    Example:

    
     --runner=DataflowRunner
     --templateLocation=gs://your-bucket/templates/my-template
     
    • Field Detail

      • UNSAFELY_ATTEMPT_TO_PROCESS_UNBOUNDED_DATA_IN_BATCH_MODE

        public static final java.lang.String UNSAFELY_ATTEMPT_TO_PROCESS_UNBOUNDED_DATA_IN_BATCH_MODE
        Experiment to "unsafely attempt to process unbounded data in batch mode".
        See Also:
        Constant Field Values
      • PROJECT_ID_REGEXP

        public static final java.lang.String PROJECT_ID_REGEXP
        Project IDs must contain lowercase letters, digits, or dashes. IDs must start with a letter and may not end with a dash. This regex isn't exact - this allows for patterns that would be rejected by the service, but this is sufficient for basic validation of project IDs.
        See Also:
        Constant Field Values
    • Method Detail

      • replaceGcsFilesWithLocalFiles

        public static java.util.List<java.lang.String> replaceGcsFilesWithLocalFiles​(java.util.List<java.lang.String> filesToStage)
        Replaces GCS file paths with local file paths by downloading the GCS files locally. This is useful when files need to be accessed locally before being staged to Dataflow.
        Parameters:
        filesToStage - List of file paths that may contain GCS paths (gs://) and local paths
        Returns:
        List of local file paths where any GCS paths have been downloaded locally
        Throws:
        java.lang.RuntimeException - if there are errors copying GCS files locally
      • fromOptions

        public static DataflowRunner fromOptions​(org.apache.beam.sdk.options.PipelineOptions options)
        Construct a runner from the provided options.
        Parameters:
        options - Properties that configure the runner.
        Returns:
        The newly created runner.
      • applySdkEnvironmentOverrides

        protected org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline applySdkEnvironmentOverrides​(org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline pipeline,
                                                                                                    DataflowPipelineOptions options)
      • resolveArtifacts

        protected org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline resolveArtifacts​(org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline pipeline)
      • stageArtifacts

        protected java.util.List<com.google.api.services.dataflow.model.DataflowPackage> stageArtifacts​(org.apache.beam.model.pipeline.v1.RunnerApi.Pipeline pipeline)
      • hasExperiment

        public static boolean hasExperiment​(DataflowPipelineDebugOptions options,
                                            java.lang.String experiment)
        Returns true if the specified experiment is enabled, handling null experiments.
      • replaceV1Transforms

        protected void replaceV1Transforms​(org.apache.beam.sdk.Pipeline pipeline)
      • getTranslator

        public DataflowPipelineTranslator getTranslator()
        Returns the DataflowPipelineTranslator associated with this object.
      • setHooks

        public void setHooks​(DataflowRunnerHooks hooks)
        Sets callbacks to invoke during execution see DataflowRunnerHooks.
      • toString

        public java.lang.String toString()
        Overrides:
        toString in class java.lang.Object