Class AvroSource<T>

  • Type Parameters:
    T - The type of records to be read from the source.
    All Implemented Interfaces:
    java.io.Serializable, org.apache.beam.sdk.transforms.display.HasDisplayData

    public class AvroSource<T>
    extends org.apache.beam.sdk.io.BlockBasedSource<T>
    Do not use in pipelines directly: most users should use AvroIO.Read.

    A FileBasedSource for reading Avro files.

    To read a PCollection of objects from one or more Avro files, use from(org.apache.beam.sdk.options.ValueProvider<java.lang.String>) to specify the path(s) of the files to read. The AvroSource that is returned will read objects of type GenericRecord with the schema(s) that were written at file creation. To further configure the AvroSource to read with a user-defined schema, or to return records of a type other than GenericRecord, use withSchema(Schema) (using an Avro Schema), withSchema(String) (using a JSON schema), or withSchema(Class) (to return objects of the Avro-generated class specified).

    An AvroSource can be read from using the Read transform. For example:

    
     AvroSource<MyType> source = AvroSource.from(file.toPath()).withSchema(MyType.class);
     PCollection<MyType> records = Read.from(mySource);
     

    This class's implementation is based on the Avro 1.7.7 specification and implements parsing of some parts of Avro Object Container Files. The rationale for doing so is that the Avro API does not provide efficient ways of computing the precise offsets of blocks within a file, which is necessary to support dynamic work rebalancing. However, whenever it is possible to use the Avro API in a way that supports maintaining precise offsets, this class uses the Avro API.

    Avro Object Container files store records in blocks. Each block contains a collection of records. Blocks may be encoded (e.g., with bzip2, deflate, snappy, etc.). Blocks are delineated from one another by a 16-byte sync marker.

    An AvroSource for a subrange of a single file contains records in the blocks such that the start offset of the block is greater than or equal to the start offset of the source and less than the end offset of the source.

    To use XZ-encoded Avro files, please include an explicit dependency on xz-1.8.jar, which has been marked as optional in the Maven sdk/pom.xml.

    
     <dependency>
       <groupId>org.tukaani</groupId>
       <artifactId>xz</artifactId>
       <version>1.8</version>
     </dependency>
     

    Permissions

    Permission requirements depend on the PipelineRunner that is used to execute the pipeline. Please refer to the documentation of corresponding PipelineRunners for more details.

    See Also:
    Serialized Form
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  AvroSource.AvroReader<T>
      A BlockBasedSource.BlockBasedReader for reading blocks from Avro files.
      static interface  AvroSource.DatumReaderFactory<T>  
      • Nested classes/interfaces inherited from class org.apache.beam.sdk.io.BlockBasedSource

        org.apache.beam.sdk.io.BlockBasedSource.Block<T extends java.lang.Object>, org.apache.beam.sdk.io.BlockBasedSource.BlockBasedReader<T extends java.lang.Object>
      • Nested classes/interfaces inherited from class org.apache.beam.sdk.io.FileBasedSource

        org.apache.beam.sdk.io.FileBasedSource.FileBasedReader<T extends java.lang.Object>
      • Nested classes/interfaces inherited from class org.apache.beam.sdk.io.OffsetBasedSource

        org.apache.beam.sdk.io.OffsetBasedSource.OffsetBasedReader<T extends java.lang.Object>
      • Nested classes/interfaces inherited from class org.apache.beam.sdk.io.BoundedSource

        org.apache.beam.sdk.io.BoundedSource.BoundedReader<T extends java.lang.Object>
      • Nested classes/interfaces inherited from class org.apache.beam.sdk.io.Source

        org.apache.beam.sdk.io.Source.Reader<T extends java.lang.Object>
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods 
      Modifier and Type Method Description
      org.apache.beam.sdk.io.BlockBasedSource<T> createForSubrangeOfFile​(java.lang.String fileName, long start, long end)
      Deprecated.
      Used by Dataflow worker
      org.apache.beam.sdk.io.BlockBasedSource<T> createForSubrangeOfFile​(org.apache.beam.sdk.io.fs.MatchResult.Metadata fileMetadata, long start, long end)  
      protected org.apache.beam.sdk.io.BlockBasedSource.BlockBasedReader<T> createSingleFileReader​(org.apache.beam.sdk.options.PipelineOptions options)  
      static AvroSource<org.apache.avro.generic.GenericRecord> from​(java.lang.String fileNameOrPattern)
      static AvroSource<org.apache.avro.generic.GenericRecord> from​(org.apache.beam.sdk.io.fs.MatchResult.Metadata metadata)  
      static AvroSource<org.apache.avro.generic.GenericRecord> from​(org.apache.beam.sdk.options.ValueProvider<java.lang.String> fileNameOrPattern)
      Reads from the given file name or pattern ("glob").
      org.apache.beam.sdk.coders.Coder<T> getOutputCoder()  
      void validate()  
      AvroSource<T> withCoder​(org.apache.beam.sdk.coders.Coder<T> coder)
      Specifies the coder for the result of the AvroSource.
      AvroSource<T> withDatumReaderFactory​(AvroSource.DatumReaderFactory<?> factory)
      Sets a custom AvroSource.DatumReaderFactory for reading.
      AvroSource<T> withEmptyMatchTreatment​(org.apache.beam.sdk.io.fs.EmptyMatchTreatment emptyMatchTreatment)  
      AvroSource<T> withMinBundleSize​(long minBundleSize)
      Sets the minimum bundle size.
      <X> AvroSource<X> withParseFn​(org.apache.beam.sdk.transforms.SerializableFunction<org.apache.avro.generic.GenericRecord,​X> parseFn, org.apache.beam.sdk.coders.Coder<X> coder)
      Reads GenericRecord of unspecified schema and maps them to instances of a custom type using the given parseFn and encoded using the given coder.
      <X> AvroSource<X> withSchema​(java.lang.Class<X> clazz)
      Reads files containing records of the given class.
      AvroSource<org.apache.avro.generic.GenericRecord> withSchema​(java.lang.String schema)
      Reads files containing records that conform to the given schema.
      AvroSource<org.apache.avro.generic.GenericRecord> withSchema​(org.apache.avro.Schema schema)
      • Methods inherited from class org.apache.beam.sdk.io.FileBasedSource

        createReader, createSourceForSubrange, getEmptyMatchTreatment, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, isSplittable, populateDisplayData, split, toString
      • Methods inherited from class org.apache.beam.sdk.io.OffsetBasedSource

        getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
      • Methods inherited from class org.apache.beam.sdk.io.Source

        getDefaultOutputCoder
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
    • Method Detail

      • from

        public static AvroSource<org.apache.avro.generic.GenericRecord> from​(org.apache.beam.sdk.options.ValueProvider<java.lang.String> fileNameOrPattern)
        Reads from the given file name or pattern ("glob"). The returned source needs to be further configured by calling withSchema(java.lang.String) to return a type other than GenericRecord.
      • from

        public static AvroSource<org.apache.avro.generic.GenericRecord> from​(org.apache.beam.sdk.io.fs.MatchResult.Metadata metadata)
      • withEmptyMatchTreatment

        public AvroSource<T> withEmptyMatchTreatment​(org.apache.beam.sdk.io.fs.EmptyMatchTreatment emptyMatchTreatment)
      • withSchema

        public AvroSource<org.apache.avro.generic.GenericRecord> withSchema​(java.lang.String schema)
        Reads files containing records that conform to the given schema.
      • withSchema

        public AvroSource<org.apache.avro.generic.GenericRecord> withSchema​(org.apache.avro.Schema schema)
      • withSchema

        public <X> AvroSource<X> withSchema​(java.lang.Class<X> clazz)
        Reads files containing records of the given class.
      • withParseFn

        public <X> AvroSource<X> withParseFn​(org.apache.beam.sdk.transforms.SerializableFunction<org.apache.avro.generic.GenericRecord,​X> parseFn,
                                             org.apache.beam.sdk.coders.Coder<X> coder)
        Reads GenericRecord of unspecified schema and maps them to instances of a custom type using the given parseFn and encoded using the given coder.
      • withMinBundleSize

        public AvroSource<T> withMinBundleSize​(long minBundleSize)
        Sets the minimum bundle size. Refer to OffsetBasedSource for a description of minBundleSize and its use.
      • withCoder

        public AvroSource<T> withCoder​(org.apache.beam.sdk.coders.Coder<T> coder)
        Specifies the coder for the result of the AvroSource.
      • validate

        public void validate()
        Overrides:
        validate in class org.apache.beam.sdk.io.FileBasedSource<T>
      • createForSubrangeOfFile

        @Deprecated
        public org.apache.beam.sdk.io.BlockBasedSource<T> createForSubrangeOfFile​(java.lang.String fileName,
                                                                                  long start,
                                                                                  long end)
                                                                           throws java.io.IOException
        Deprecated.
        Used by Dataflow worker
        Used by the Dataflow worker. Do not introduce new usages. Do not delete without confirming that Dataflow ValidatesRunner tests pass.
        Throws:
        java.io.IOException
      • createForSubrangeOfFile

        public org.apache.beam.sdk.io.BlockBasedSource<T> createForSubrangeOfFile​(org.apache.beam.sdk.io.fs.MatchResult.Metadata fileMetadata,
                                                                                  long start,
                                                                                  long end)
        Specified by:
        createForSubrangeOfFile in class org.apache.beam.sdk.io.BlockBasedSource<T>
      • createSingleFileReader

        protected org.apache.beam.sdk.io.BlockBasedSource.BlockBasedReader<T> createSingleFileReader​(org.apache.beam.sdk.options.PipelineOptions options)
        Specified by:
        createSingleFileReader in class org.apache.beam.sdk.io.BlockBasedSource<T>
      • getOutputCoder

        public org.apache.beam.sdk.coders.Coder<T> getOutputCoder()
        Overrides:
        getOutputCoder in class org.apache.beam.sdk.io.Source<T>