pyspark.RDD.keyBy¶

RDD.keyBy(f: Callable[[T], K]) → pyspark.rdd.RDD[Tuple[K, T]][source]¶

Creates tuples of the elements in this RDD by applying f.

New in version 0.9.1.

Parameters

ffunction: a function to compute the key

Returns

RDD: a RDD with the elements from this that are not in other

See also

RDD.map()
RDD.keys()
RDD.values()

Examples

>>> rdd1 = sc.parallelize(range(0,3)).keyBy(lambda x: x*x)
>>> rdd2 = sc.parallelize(zip(range(0,5), range(0,5)))
>>> [(x, list(map(list, y))) for x, y in sorted(rdd1.cogroup(rdd2).collect())]
[(0, [[0], [0]]), (1, [[1], [1]]), (2, [[], [2]]), (3, [[], [3]]), (4, [[2], [4]])]

pyspark.RDD.join

pyspark.RDD.keys