pyspark.RDD.aggregateByKey¶

RDD.aggregateByKey(zeroValue: U, seqFunc: Callable[[U, V], U], combFunc: Callable[[U, U], U], numPartitions: Optional[int] = None, partitionFunc: Callable[[K], int] = <function portable_hash>) → pyspark.rdd.RDD[Tuple[K, U]][source]¶

Aggregate the values of each key, using given combine functions and a neutral “zero value”. This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U’s, The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.

New in version 1.1.0.

Parameters

zeroValueU: the initial value for the accumulated result of each partition
seqFuncfunction: a function to merge a V into a U
combFuncfunction: a function to combine two U’s into a single one
numPartitionsint, optional: the number of partitions in new RDD
partitionFuncfunction, optional, default portable_hash: function to compute the partition index

Returns

RDD: a RDD containing the keys and the aggregated result for each key

See also

RDD.reduceByKey()
RDD.combineByKey()
RDD.foldByKey()
RDD.groupByKey()

Examples

>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
>>> seqFunc = (lambda x, y: (x[0] + y, x[1] + 1))
>>> combFunc = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
>>> sorted(rdd.aggregateByKey((0, 0), seqFunc, combFunc).collect())
[('a', (3, 2)), ('b', (1, 1))]

pyspark.RDD.aggregate

pyspark.RDD.barrier