Packages

trait AggregateFunction[S <: Serializable, R] extends BoundFunction

Interface for a function that produces a result value by aggregating over multiple input rows.

For each input row, Spark will call the #update method which should evaluate the row and update the aggregation state. The JVM type of result values produced by #produceResult must be the type used by Spark's InternalRow API for the SQL data type returned by #resultType(). Please refer to class documentation of ScalarFunction for the mapping between DataType and the JVM type.

All implementations must support partial aggregation by implementing merge so that Spark can partially aggregate and shuffle intermediate results, instead of shuffling all rows for an aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the result.

Intermediate aggregation state must be Serializable so that state produced by parallel tasks can be serialized, shuffled, and then merged to produce a final result.

Annotations
@Evolving()
Source
AggregateFunction.java
Since

3.2.0

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. AggregateFunction
  2. BoundFunction
  3. Function
  4. Serializable
  5. AnyRef
  6. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Abstract Value Members

  1. abstract def inputTypes(): Array[DataType]

    Returns the required data types of the input values to this function.

    Returns the required data types of the input values to this function.

    If the types returned differ from the types passed to UnboundFunction#bind(StructType), Spark will cast input values to the required data types. This allows implementations to delegate input value casting to Spark.

    returns

    an array of input value data types

    Definition Classes
    BoundFunction
  2. abstract def merge(leftState: S, rightState: S): S

    Merge two partial aggregation states.

    Merge two partial aggregation states.

    This is called to merge intermediate aggregation states that were produced by parallel tasks.

    leftState

    intermediate aggregation state

    rightState

    intermediate aggregation state

    returns

    combined aggregation state

  3. abstract def name(): String

    A name to identify this function.

    A name to identify this function. Implementations should provide a meaningful name, like the database and function name from the catalog.

    Definition Classes
    Function
  4. abstract def newAggregationState(): S

    Initialize state for an aggregation.

    Initialize state for an aggregation.

    This method is called one or more times for every group of values to initialize intermediate aggregation state. More than one intermediate aggregation state variable may be used when the aggregation is run in parallel tasks.

    Implementations that return null must support null state passed into all other methods.

    returns

    a state instance or null

  5. abstract def produceResult(state: S): R

    Produce the aggregation result based on intermediate state.

    Produce the aggregation result based on intermediate state.

    state

    intermediate aggregation state

    returns

    a result value

  6. abstract def resultType(): DataType

    Returns the data type of values produced by this function.

    Returns the data type of values produced by this function.

    For example, a "plus" function may return IntegerType when it is bound to arguments that are also IntegerType.

    returns

    a data type for values produced by this function

    Definition Classes
    BoundFunction
  7. abstract def update(state: S, input: InternalRow): S

    Update the aggregation state with a new row.

    Update the aggregation state with a new row.

    This is called for each row in a group to update an intermediate aggregation state.

    state

    intermediate aggregation state

    input

    an input row

    returns

    updated aggregation state

Concrete Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##: Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def canonicalName(): String

    Returns the canonical name of this function, used to determine if functions are equivalent.

    Returns the canonical name of this function, used to determine if functions are equivalent.

    The canonical name is used to determine whether two functions are the same when loaded by different catalogs. For example, the same catalog implementation may be used for by two environments, "prod" and "test". Functions produced by the catalogs may be equivalent, but loaded using different names, like "test.func_name" and "prod.func_name".

    Names returned by this function should be unique and unlikely to conflict with similar functions in other catalogs. For example, many catalogs may define a "bucket" function with a different implementation. Adding context, like "com.mycompany.bucket(string)", is recommended to avoid unintentional collisions.

    returns

    a canonical name for this function

    Definition Classes
    BoundFunction
  6. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.CloneNotSupportedException]) @IntrinsicCandidate() @native()
  7. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  8. def equals(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef → Any
  9. final def getClass(): Class[_ <: AnyRef]
    Definition Classes
    AnyRef → Any
    Annotations
    @IntrinsicCandidate() @native()
  10. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @IntrinsicCandidate() @native()
  11. def isDeterministic(): Boolean

    Returns whether this function result is deterministic.

    Returns whether this function result is deterministic.

    By default, functions are assumed to be deterministic. Functions that are not deterministic should override this method so that Spark can ensure the function runs only once for a given input.

    returns

    true if this function is deterministic, false otherwise

    Definition Classes
    BoundFunction
  12. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  13. def isResultNullable(): Boolean

    Returns whether the values produced by this function may be null.

    Returns whether the values produced by this function may be null.

    For example, a "plus" function may return false when it is bound to arguments that are always non-null, but true when either argument may be null.

    returns

    true if values produced by this function may be null, false otherwise

    Definition Classes
    BoundFunction
  14. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  15. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @IntrinsicCandidate() @native()
  16. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @IntrinsicCandidate() @native()
  17. final def synchronized[T0](arg0: => T0): T0
    Definition Classes
    AnyRef
  18. def toString(): String
    Definition Classes
    AnyRef → Any
  19. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])
  20. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException]) @native()
  21. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.InterruptedException])

Deprecated Value Members

  1. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws(classOf[java.lang.Throwable]) @Deprecated
    Deprecated

    (Since version 9)

Inherited from BoundFunction

Inherited from Function

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped