pyspark.RDD.subtract

RDD.subtract(other: pyspark.rdd.RDD[T], numPartitions: Optional[int] = None) → pyspark.rdd.RDD[T][source]

Return each value in self that is not contained in other.

New in version 0.9.1.

Parameters
otherRDD

another RDD

numPartitionsint, optional

the number of partitions in new RDD

Returns
RDD

a RDD with the elements from this that are not in other

Examples

>>> rdd1 = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
>>> rdd2 = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(rdd1.subtract(rdd2).collect())
[('a', 1), ('b', 4), ('b', 5)]