pyspark.RDD.leftOuterJoin¶

RDD.leftOuterJoin(other: pyspark.rdd.RDD[Tuple[K, U]], numPartitions: Optional[int] = None) → pyspark.rdd.RDD[Tuple[K, Tuple[V, Optional[U]]]][source]¶

Perform a left outer join of self and other.

For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k.

Hash-partitions the resulting RDD into the given number of partitions.

New in version 0.7.0.

Parameters

otherRDD: another RDD
numPartitionsint, optional: the number of partitions in new RDD

Returns

RDD: a RDD containing all pairs of elements with matching keys

See also

RDD.join()
RDD.rightOuterJoin()
RDD.fullOuterJoin()
pyspark.sql.DataFrame.join()

Examples

>>> rdd1 = sc.parallelize([("a", 1), ("b", 4)])
>>> rdd2 = sc.parallelize([("a", 2)])
>>> sorted(rdd1.leftOuterJoin(rdd2).collect())
[('a', (1, 2)), ('b', (4, None))]

pyspark.RDD.keys

pyspark.RDD.localCheckpoint