pyspark.RDD.intersection

RDD.intersection(other: pyspark.rdd.RDD[T]) → pyspark.rdd.RDD[T][source]

Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.

New in version 1.0.0.

Parameters
otherRDD

another RDD

Returns
RDD

the intersection of this RDD and another one

Notes

This method performs a shuffle internally.

Examples

>>> rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
>>> rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
>>> rdd1.intersection(rdd2).collect()
[1, 2, 3]