pyspark.RDD.repartitionAndSortWithinPartitions

RDD.repartitionAndSortWithinPartitions(numPartitions: Optional[int] = None, partitionFunc: Callable[[Any], int] = <function portable_hash>, ascending: bool = True, keyfunc: Callable[[Any], Any] = <function RDD.<lambda>>) → pyspark.rdd.RDD[Tuple[Any, Any]][source]

Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys.

New in version 1.2.0.

Parameters
numPartitionsint, optional

the number of partitions in new RDD

partitionFuncfunction, optional, default portable_hash

a function to compute the partition index

ascendingbool, optional, default True

sort the keys in ascending or descending order

keyfuncfunction, optional, default identity mapping

a function to compute the key

Returns
RDD

a new RDD

Examples

>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, True)
>>> rdd2.glom().collect()
[[(0, 5), (0, 8), (2, 6)], [(1, 3), (3, 8), (3, 8)]]