pyspark.RDD.fold

RDD.fold(zeroValue: T, op: Callable[[T, T], T]) → T[source]

Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral “zero value.”

The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.

This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection.

New in version 0.7.0.

Parameters
zeroValueT

the initial value for the accumulated result of each partition

opfunction

a function used to both accumulate results within a partition and combine results from different partitions

Returns
T

the aggregated result

Examples

>>> from operator import add
>>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
15