pyspark.sql.SparkSession¶

class pyspark.sql.SparkSession(sparkContext: pyspark.context.SparkContext, jsparkSession: Optional[py4j.java_gateway.JavaObject] = None, options: Dict[str, Any] = {})[source]¶

The entry point to programming Spark with the Dataset and DataFrame API.

A SparkSession can be used to create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:

Changed in version 3.4.0: Supports Spark Connect.

builder[source]¶

Examples

Create a Spark session.

>>> spark = (
...     SparkSession.builder
...         .master("local")
...         .appName("Word Count")
...         .config("spark.some.config.option", "some-value")
...         .getOrCreate()
... )

Create a Spark session with Spark Connect.

>>> spark = (
...     SparkSession.builder
...         .remote("sc://localhost")
...         .appName("Word Count")
...         .config("spark.some.config.option", "some-value")
...         .getOrCreate()
... )  

Methods

`active`()	Returns the active or default `SparkSession` for the current thread, returned by the builder.
`addArtifact`(*path[, pyfile, archive, file])	Add artifact(s) to the client session.
`addArtifacts`(*path[, pyfile, archive, file])	Add artifact(s) to the client session.
`addTag`(tag)	Add a tag to be assigned to all the operations started by this thread in this session.
`clearTags`()	Clear the current thread’s operation tags.
`copyFromLocalToFs`(local_path, dest_path)	Copy file from local to cloud storage file system.
`createDataFrame`(data[, schema, …])	Creates a `DataFrame` from an `RDD`, a list, a `pandas.DataFrame` or a `numpy.ndarray`.
`getActiveSession`()	Returns the active `SparkSession` for the current thread, returned by the builder
`getTags`()	Get the tags that are currently set to be assigned to all the operations started by this thread.
`interruptAll`()	Interrupt all operations of this session currently running on the connected server.
`interruptOperation`(op_id)	Interrupt an operation of this session with the given operationId.
`interruptTag`(tag)	Interrupt all operations of this session with the given operation tag.
`newSession`()	Returns a new `SparkSession` as new session, that has separate SQLConf, registered temporary views and UDFs, but shared `SparkContext` and table cache.
`range`(start[, end, step, numPartitions])	Create a `DataFrame` with single `pyspark.sql.types.LongType` column named `id`, containing elements in a range from `start` to `end` (exclusive) with step value `step`.
`removeTag`(tag)	Remove a tag previously added to be assigned to all the operations started by this thread in this session.
`sql`(sqlQuery[, args])	Returns a `DataFrame` representing the result of the given query.
`stop`()	Stop the underlying `SparkContext`.
`table`(tableName)	Returns the specified table as a `DataFrame`.

Attributes

`builder`
`catalog`	Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.
`client`	Gives access to the Spark Connect client.
`conf`	Runtime configuration interface for Spark.
`read`	Returns a `DataFrameReader` that can be used to read data in as a `DataFrame`.
`readStream`	Returns a `DataStreamReader` that can be used to read data streams as a streaming `DataFrame`.
`sparkContext`	Returns the underlying `SparkContext`.
`streams`	Returns a `StreamingQueryManager` that allows managing all the `StreamingQuery` instances active on this context.
`udf`	Returns a `UDFRegistration` for UDF registration.
`udtf`	Returns a `UDTFRegistration` for UDTF registration.
`version`	The version of Spark on which this application is running.

Core Classes

pyspark.sql.Catalog