pyspark.RDD.sortByKey

RDD.sortByKey(ascending: Optional[bool] = True, numPartitions: Optional[int] = None, keyfunc: Callable[[Any], Any] = <function RDD.<lambda>>) → pyspark.rdd.RDD[Tuple[K, V]][source]

Sorts this RDD, which is assumed to consist of (key, value) pairs.

New in version 0.9.1.

Parameters
ascendingbool, optional, default True

sort the keys in ascending or descending order

numPartitionsint, optional

the number of partitions in new RDD

keyfuncfunction, optional, default identity mapping

a function to compute the key

Returns
RDD

a new RDD

Examples

>>> tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
>>> sc.parallelize(tmp).sortByKey().first()
('1', 3)
>>> sc.parallelize(tmp).sortByKey(True, 1).collect()
[('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]
>>> sc.parallelize(tmp).sortByKey(True, 2).collect()
[('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]
>>> tmp2 = [('Mary', 1), ('had', 2), ('a', 3), ('little', 4), ('lamb', 5)]
>>> tmp2.extend([('whose', 6), ('fleece', 7), ('was', 8), ('white', 9)])
>>> sc.parallelize(tmp2).sortByKey(True, 3, keyfunc=lambda k: k.lower()).collect()
[('a', 3), ('fleece', 7), ('had', 2), ('lamb', 5),...('white', 9), ('whose', 6)]