Package org.apache.spark.sql.columnar
Class SimpleMetricsCachedBatchSerializer
Object
org.apache.spark.sql.columnar.SimpleMetricsCachedBatchSerializer
- All Implemented Interfaces:
Serializable
,org.apache.spark.internal.Logging
,CachedBatchSerializer
,scala.Serializable
public abstract class SimpleMetricsCachedBatchSerializer
extends Object
implements CachedBatchSerializer, org.apache.spark.internal.Logging
Provides basic filtering for
CachedBatchSerializer
implementations.
The requirement to extend this is that all of the batches produced by your serializer are
instances of SimpleMetricsCachedBatch
.
This does not calculate the metrics needed to be stored in the batches. That is up to each
implementation. The metrics required are really just min and max values and those are optional
especially for complex types. Because those metrics are simple and it is likely that compression
will also be done on the data we thought it best to let each implementation decide on the most
efficient way to calculate the metrics, possibly combining them with compression passes that
might also be done across the data.- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.SparkShellLoggingFilter
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionscala.Function2<Object,
scala.collection.Iterator<CachedBatch>, scala.collection.Iterator<CachedBatch>> buildFilter
(scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> predicates, scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> cachedAttributes) Builds a function that can be used to filter batches prior to being decompressed.Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.apache.spark.sql.columnar.CachedBatchSerializer
convertCachedBatchToColumnarBatch, convertCachedBatchToInternalRow, convertColumnarBatchToCachedBatch, convertInternalRowToCachedBatch, supportsColumnarInput, supportsColumnarOutput, vectorTypes
Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq
-
Constructor Details
-
SimpleMetricsCachedBatchSerializer
public SimpleMetricsCachedBatchSerializer()
-
-
Method Details
-
buildFilter
public scala.Function2<Object,scala.collection.Iterator<CachedBatch>, buildFilterscala.collection.Iterator<CachedBatch>> (scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> predicates, scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Attribute> cachedAttributes) Description copied from interface:CachedBatchSerializer
Builds a function that can be used to filter batches prior to being decompressed. In most cases extendingSimpleMetricsCachedBatchSerializer
will provide the filter logic necessary. You will need to provide metrics for this to work.SimpleMetricsCachedBatch
provides the APIs to hold those metrics and explains the metrics used, really just min and max. Note that this is intended to skip batches that are not needed, and the actual filtering of individual rows is handled later.- Specified by:
buildFilter
in interfaceCachedBatchSerializer
- Parameters:
predicates
- the set of expressions to use for filtering.cachedAttributes
- the schema/attributes of the data that is cached. This can be helpful if you don't store it with the data.- Returns:
- a function that takes the partition id and the iterator of batches in the partition. It returns an iterator of batches that should be decompressed.
-