Class SimpleMetricsCachedBatchSerializer

Object
org.apache.spark.sql.columnar.SimpleMetricsCachedBatchSerializer
All Implemented Interfaces:
Serializable, org.apache.spark.internal.Logging, CachedBatchSerializer, scala.Serializable

public abstract class SimpleMetricsCachedBatchSerializer extends Object implements CachedBatchSerializer, org.apache.spark.internal.Logging
Provides basic filtering for CachedBatchSerializer implementations. The requirement to extend this is that all of the batches produced by your serializer are instances of SimpleMetricsCachedBatch. This does not calculate the metrics needed to be stored in the batches. That is up to each implementation. The metrics required are really just min and max values and those are optional especially for complex types. Because those metrics are simple and it is likely that compression will also be done on the data we thought it best to let each implementation decide on the most efficient way to calculate the metrics, possibly combining them with compression passes that might also be done across the data.
See Also:
  • Constructor Details

    • SimpleMetricsCachedBatchSerializer

      public SimpleMetricsCachedBatchSerializer()
  • Method Details

    • buildFilter

      public scala.Function2<Object,scala.collection.Iterator<CachedBatch>,scala.collection.Iterator<CachedBatch>> buildFilter(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> predicates, scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> cachedAttributes)
      Description copied from interface: CachedBatchSerializer
      Builds a function that can be used to filter batches prior to being decompressed. In most cases extending SimpleMetricsCachedBatchSerializer will provide the filter logic necessary. You will need to provide metrics for this to work. SimpleMetricsCachedBatch provides the APIs to hold those metrics and explains the metrics used, really just min and max. Note that this is intended to skip batches that are not needed, and the actual filtering of individual rows is handled later.
      Specified by:
      buildFilter in interface CachedBatchSerializer
      Parameters:
      predicates - the set of expressions to use for filtering.
      cachedAttributes - the schema/attributes of the data that is cached. This can be helpful if you don't store it with the data.
      Returns:
      a function that takes the partition id and the iterator of batches in the partition. It returns an iterator of batches that should be decompressed.