pyspark.RDD.groupBy¶

RDD.groupBy(f: Callable[[T], K], numPartitions: Optional[int] = None, partitionFunc: Callable[[K], int] = <function portable_hash>) → pyspark.rdd.RDD[Tuple[K, Iterable[T]]][source]¶

Return an RDD of grouped items.

New in version 0.7.0.

Parameters

ffunction: a function to compute the key
numPartitionsint, optional: the number of partitions in new RDD
partitionFuncfunction, optional, default portable_hash: a function to compute the partition index

Returns

RDD: a new RDD of grouped items

See also

RDD.groupByKey()
pyspark.sql.DataFrame.groupBy()

Examples

>>> rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
>>> result = rdd.groupBy(lambda x: x % 2).collect()
>>> sorted([(x, sorted(y)) for (x, y) in result])
[(0, [2, 8]), (1, [1, 1, 3, 5])]

pyspark.RDD.glom

pyspark.RDD.groupByKey