pyspark.RDD.subtractByKey

RDD.subtractByKey(other: pyspark.rdd.RDD[Tuple[K, Any]], numPartitions: Optional[int] = None) → pyspark.rdd.RDD[Tuple[K, V]][source]

Return each (key, value) pair in self that has no pair with matching key in other.

New in version 0.9.1.

Parameters
otherRDD

another RDD

numPartitionsint, optional

the number of partitions in new RDD

Returns
RDD

a RDD with the pairs from this whose keys are not in other

See also

RDD.subtract()

Examples

>>> rdd1 = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 2)])
>>> rdd2 = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(rdd1.subtractByKey(rdd2).collect())
[('b', 4), ('b', 5)]