pyspark.sql.DataFrameWriter.sortBy

DataFrameWriter.sortBy(col: Union[str, List[str], Tuple[str, …]], *cols: Optional[str]) → pyspark.sql.readwriter.DataFrameWriter[source]

Sorts the output in each bucket by the given columns on the file system.

New in version 2.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
colstr, tuple or list

a name of a column, or a list of names.

colsstr

additional names (optional). If col is a list it should be empty.

Examples

Write a DataFrame into a Parquet file in a sorted-buckted manner, and read it back.

>>> from pyspark.sql.functions import input_file_name
>>> # Write a DataFrame into a Parquet file in a sorted-bucketed manner.
... _ = spark.sql("DROP TABLE IF EXISTS sorted_bucketed_table")
>>> spark.createDataFrame([
...     (100, "Hyukjin Kwon"), (120, "Hyukjin Kwon"), (140, "Haejoon Lee")],
...     schema=["age", "name"]
... ).write.bucketBy(1, "name").sortBy("age").mode(
...     "overwrite").saveAsTable("sorted_bucketed_table")
>>> # Read the Parquet file as a DataFrame.
... spark.read.table("sorted_bucketed_table").sort("age").show()
+---+------------+
|age|        name|
+---+------------+
|100|Hyukjin Kwon|
|120|Hyukjin Kwon|
|140| Haejoon Lee|
+---+------------+
>>> _ = spark.sql("DROP TABLE sorted_bucketed_table")