pyspark.SparkContext.runJob

SparkContext.runJob(rdd: pyspark.rdd.RDD[T], partitionFunc: Callable[[Iterable[T]], Iterable[U]], partitions: Optional[Sequence[int]] = None, allowLocal: bool = False) → List[U][source]

Executes the given partitionFunc on the specified set of partitions, returning the result as an array of elements.

If ‘partitions’ is not specified, this will run over all partitions.

New in version 1.1.0.

Parameters
rddRDD

target RDD to run tasks on

partitionFuncfunction

a function to run on each partition of the RDD

partitionslist, optional

set of partitions to run on; some jobs may not want to compute on all partitions of the target RDD, e.g. for operations like first

allowLocalbool, default False

this parameter takes no effect

Returns
list

results of specified partitions

Examples

>>> myRDD = sc.parallelize(range(6), 3)
>>> sc.runJob(myRDD, lambda part: [x * x for x in part])
[0, 1, 4, 9, 16, 25]
>>> myRDD = sc.parallelize(range(6), 3)
>>> sc.runJob(myRDD, lambda part: [x * x for x in part], [0, 2], True)
[0, 1, 16, 25]