pyspark.RDD.foldByKey

RDD.foldByKey(zeroValue: V, func: Callable[[V, V], V], numPartitions: Optional[int] = None, partitionFunc: Callable[[K], int] = <function portable_hash>) → pyspark.rdd.RDD[Tuple[K, V]][source]

Merge the values for each key using an associative function “func” and a neutral “zeroValue” which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.).

New in version 1.1.0.

Parameters
zeroValueV

the initial value for the accumulated result of each partition

funcfunction

a function to combine two V’s into a single one

numPartitionsint, optional

the number of partitions in new RDD

partitionFuncfunction, optional, default portable_hash

function to compute the partition index

Returns
RDD

a RDD containing the keys and the aggregated result for each key

Examples

>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> from operator import add
>>> sorted(rdd.foldByKey(0, add).collect())
[('a', 2), ('b', 1)]