public abstract class SimpleMetricsCachedBatchSerializer
implements CachedBatchSerializer, org.apache.spark.internal.Logging
Provides basic filtering for CachedBatchSerializer implementations.
The requirement to extend this is that all of the batches produced by your serializer are
instances of SimpleMetricsCachedBatch.
This does not calculate the metrics needed to be stored in the batches. That is up to each
implementation. The metrics required are really just min and max values and those are optional
especially for complex types. Because those metrics are simple and it is likely that compression
will also be done on the data we thought it best to let each implementation decide on the most
efficient way to calculate the metrics, possibly combining them with compression passes that
might also be done across the data.
Builds a function that can be used to filter batches prior to being decompressed.
In most cases extending SimpleMetricsCachedBatchSerializer will provide the filter logic
necessary. You will need to provide metrics for this to work. SimpleMetricsCachedBatch
provides the APIs to hold those metrics and explains the metrics used, really just min and max.
Note that this is intended to skip batches that are not needed, and the actual filtering of
individual rows is handled later.