pyspark.SparkContext¶

class pyspark.SparkContext(master: Optional[str] = None, appName: Optional[str] = None, sparkHome: Optional[str] = None, pyFiles: Optional[List[str]] = None, environment: Optional[Dict[str, Any]] = None, batchSize: int = 0, serializer: pyspark.serializers.Serializer = CloudPickleSerializer(), conf: Optional[pyspark.conf.SparkConf] = None, gateway: Optional[py4j.java_gateway.JavaGateway] = None, jsc: Optional[py4j.java_gateway.JavaObject] = None, profiler_cls: Type[pyspark.profiler.BasicProfiler] = <class 'pyspark.profiler.BasicProfiler'>, udf_profiler_cls: Type[pyspark.profiler.UDFBasicProfiler] = <class 'pyspark.profiler.UDFBasicProfiler'>, memory_profiler_cls: Type[pyspark.profiler.MemoryProfiler] = <class 'pyspark.profiler.MemoryProfiler'>)[source]¶

Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.

When you create a new SparkContext, at least the master and app name should be set, either through the named parameters here or through conf.

Parameters

masterstr, optional: Cluster URL to connect to (e.g. mesos://host:port, spark://host:port, local[4]).
appNamestr, optional: A name for your job, to display on the cluster web UI.
sparkHomestr, optional: Location where Spark is installed on cluster nodes.
pyFileslist, optional: Collection of .zip or .py files to send to the cluster and add to PYTHONPATH. These can be paths on the local file system or HDFS, HTTP, HTTPS, or FTP URLs.
environmentdict, optional: A dictionary of environment variables to set on worker nodes.
batchSizeint, optional, default 0: The number of Python objects represented as a single Java object. Set 1 to disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use an unlimited batch size
serializerSerializer, optional, default CPickleSerializer: The serializer for RDDs.
confSparkConf, optional: An object setting Spark properties.
gatewayclass:py4j.java_gateway.JavaGateway, optional: Use an existing gateway and JVM, otherwise a new JVM will be instantiated. This is only used internally.
jscclass:py4j.java_gateway.JavaObject, optional: The JavaSparkContext instance. This is only used internally.
profiler_clstype, optional, default BasicProfiler: A class of custom Profiler used to do profiling
udf_profiler_clstype, optional, default UDFBasicProfiler: A class of custom Profiler used to do udf profiling

Notes

Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one.

SparkContext instance is not supported to share across multiple processes out of the box, and PySpark does not guarantee multi-processing execution. Use threads instead for concurrent processing purpose.

Examples

>>> from pyspark.context import SparkContext
>>> sc = SparkContext('local', 'test')
>>> sc2 = SparkContext('local', 'test2') 
Traceback (most recent call last):
    ...
ValueError: ...

Methods

`accumulator`(value[, accum_param])	Create an `Accumulator` with the given initial value, using a given `AccumulatorParam` helper object to define how to add values of the data type if provided.
`addArchive`(path)	Add an archive to be downloaded with this Spark job on every node.
`addFile`(path[, recursive])	Add a file to be downloaded with this Spark job on every node.
`addJobTag`(tag)	Add a tag to be assigned to all the jobs started by this thread.
`addPyFile`(path)	Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future.
`binaryFiles`(path[, minPartitions])	Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array.
`binaryRecords`(path, recordLength)	Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.
`broadcast`(value)	Broadcast a read-only variable to the cluster, returning a `Broadcast` object for reading it in distributed functions.
`cancelAllJobs`()	Cancel all jobs that have been scheduled or are running.
`cancelJobGroup`(groupId)	Cancel active jobs for the specified group.
`cancelJobsWithTag`(tag)	Cancel active jobs that have the specified tag.
`clearJobTags`()	Clear the current thread’s job tags.
`dump_profiles`(path)	Dump the profile stats into directory path
`emptyRDD`()	Create an `RDD` that has no partitions or elements.
`getCheckpointDir`()	Return the directory where RDDs are checkpointed.
`getConf`()	Return a copy of this SparkContext’s configuration `SparkConf`.
`getJobTags`()	Get the tags that are currently set to be assigned to all the jobs started by this thread.
`getLocalProperty`(key)	Get a local property set in this thread, or null if it is missing.
`getOrCreate`([conf])	Get or instantiate a `SparkContext` and register it as a singleton object.
`hadoopFile`(path, inputFormatClass, keyClass, …)	Read an ‘old’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
`hadoopRDD`(inputFormatClass, keyClass, valueClass)	Read an ‘old’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict.
`newAPIHadoopFile`(path, inputFormatClass, …)	Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
`newAPIHadoopRDD`(inputFormatClass, keyClass, …)	Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict.
`parallelize`(c[, numSlices])	Distribute a local Python collection to form an RDD.
`pickleFile`(name[, minPartitions])	Load an RDD previously saved using `RDD.saveAsPickleFile()` method.
`range`(start[, end, step, numSlices])	Create a new RDD of int containing elements from start to end (exclusive), increased by step every element.
`removeJobTag`(tag)	Remove a tag previously added to be assigned to all the jobs started by this thread.
`runJob`(rdd, partitionFunc[, partitions, …])	Executes the given partitionFunc on the specified set of partitions, returning the result as an array of elements.
`sequenceFile`(path[, keyClass, valueClass, …])	Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
`setCheckpointDir`(dirName)	Set the directory under which RDDs are going to be checkpointed.
`setInterruptOnCancel`(interruptOnCancel)	Set the behavior of job cancellation from jobs started in this thread.
`setJobDescription`(value)	Set a human readable description of the current job.
`setJobGroup`(groupId, description[, …])	Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared.
`setLocalProperty`(key, value)	Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool.
`setLogLevel`(logLevel)	Control our logLevel.
`setSystemProperty`(key, value)	Set a Java system property, such as spark.executor.memory.
`show_profiles`()	Print the profile stats to stdout
`sparkUser`()	Get SPARK_USER for user who is running SparkContext.
`statusTracker`()	Return `StatusTracker` object
`stop`()	Shut down the `SparkContext`.
`textFile`(name[, minPartitions, use_unicode])	Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.
`union`(rdds)	Build the union of a list of RDDs.
`wholeTextFiles`(path[, minPartitions, …])	Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.

Attributes

`PACKAGE_EXTENSIONS`
`applicationId`	A unique identifier for the Spark application.
`defaultMinPartitions`	Default min number of partitions for Hadoop RDDs when not given by user
`defaultParallelism`	Default level of parallelism to use when not given by user (e.g.
`listArchives`	Returns a list of archive paths that are added to resources.
`listFiles`	Returns a list of file paths that are added to resources.
`resources`	Return the resource information of this `SparkContext`.
`startTime`	Return the epoch time when the `SparkContext` was started.
`uiWebUrl`	Return the URL of the SparkUI instance started by this `SparkContext`
`version`	The version of Spark on which this application is running.

Spark Core

pyspark.RDD