pyspark.sql.DataFrame.join

DataFrame.join(other: pyspark.sql.dataframe.DataFrame, on: Union[str, List[str], pyspark.sql.column.Column, List[pyspark.sql.column.Column], None] = None, how: Optional[str] = None) → pyspark.sql.dataframe.DataFrame[source]

Joins with another DataFrame, using the given join expression.

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
otherDataFrame

Right side of the join

onstr, list or Column, optional

a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.

howstr, optional

default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.

Returns
DataFrame

Joined DataFrame.

Examples

The following performs a full outer join between df1 and df2.

>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import desc
>>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", "name")
>>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, name="Bob")])
>>> df3 = spark.createDataFrame([Row(age=2, name="Alice"), Row(age=5, name="Bob")])
>>> df4 = spark.createDataFrame([
...     Row(age=10, height=80, name="Alice"),
...     Row(age=5, height=None, name="Bob"),
...     Row(age=None, height=None, name="Tom"),
...     Row(age=None, height=None, name=None),
... ])

Inner join on columns (default)

>>> df.join(df2, 'name').select(df.name, df2.height).show()
+----+------+
|name|height|
+----+------+
| Bob|    85|
+----+------+
>>> df.join(df4, ['name', 'age']).select(df.name, df.age).show()
+----+---+
|name|age|
+----+---+
| Bob|  5|
+----+---+

Outer join for both DataFrames on the ‘name’ column.

>>> df.join(df2, df.name == df2.name, 'outer').select(
...     df.name, df2.height).sort(desc("name")).show()
+-----+------+
| name|height|
+-----+------+
|  Bob|    85|
|Alice|  NULL|
| NULL|    80|
+-----+------+
>>> df.join(df2, 'name', 'outer').select('name', 'height').sort(desc("name")).show()
+-----+------+
| name|height|
+-----+------+
|  Tom|    80|
|  Bob|    85|
|Alice|  NULL|
+-----+------+

Outer join for both DataFrams with multiple columns.

>>> df.join(
...     df3,
...     [df.name == df3.name, df.age == df3.age],
...     'outer'
... ).select(df.name, df3.age).show()
+-----+---+
| name|age|
+-----+---+
|Alice|  2|
|  Bob|  5|
+-----+---+