pyspark.RDD.countByKey

RDD.countByKey() → Dict[K, int][source]

Count the number of elements for each key, and return the result to the master as a dictionary.

New in version 0.7.0.

Returns
dict

a dictionary of (key, count) pairs

Examples

>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.countByKey().items())
[('a', 2), ('b', 1)]