Spark Release 0.7.0

The Spark team is proud to release version 0.7.0, a new major release that brings several new features. Most notable are a Python API for Spark and an alpha of Spark Streaming. (Details on Spark Streaming can also be found in this technical report.) The release also adds numerous other improvements across the board. Overall, this is our biggest release to date, with 31 contributors, of which 20 were external to Berkeley.

You can download Spark 0.7.0 as either a source package (4 MB tar.gz) or prebuilt package (60 MB tar.gz).

Python API

Spark 0.7 adds a Python API called PySpark that makes it possible to use Spark from Python, both in standalone programs and in interactive Python shells. It uses the standard CPython runtime, so your programs can call into native libraries like NumPy and SciPy. Like the Scala and Java APIs, PySpark will automatically ship functions from your main program, along with the variables they depend on, to the cluster. PySpark supports most Spark features, including RDDs, accumulators, broadcast variables, and HDFS input and output.

Spark Streaming Alpha

Spark Streaming is a new extension of Spark that adds near-real-time processing capability. It offers a simple and high-level API, where users can transform streams using parallel operations like map, filter, reduce, and new sliding window functions. It automatically distributes work over a cluster and provides efficient fault recovery with exactly-once semantics for transformations, without relying on costly transactions to an external system. Spark Streaming is described in more detail in these slides and our technical report. This release is our first alpha of Spark Streaming, with most of the functionality implemented and APIs in Java and Scala.

Memory Dashboard

Spark jobs now launch a web dashboard for monitoring the memory usage of each distributed dataset (RDD) in the program. Look for lines like this in your log:

15:08:44 INFO BlockManagerUI: Started BlockManager web UI at http://mbk.local:63814

You can also control which port to use through the spark.ui.port property.

Maven Build

Spark can now be built using Maven in addition to SBT. The Maven build enables easier publishing to repositories of your choice, easy selection of Hadoop versions using the Maven profile (-Phadoop1 or -Phadoop2), as well as Debian packaging using mvn -Phadoop1,deb install.

New Operations

This release adds several RDD transformations, including keys, values, keyBy, subtract, coalesce, zip. It also adds SparkContext.hadoopConfiguration to allow programs to configure Hadoop input/output settings globally across operations. Finally, it adds the RDD.toDebugString() method, which can be used to print an RDD’s lineage graph for troubleshooting.

EC2 Improvements

  • Spark will now read S3 credentials from the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, if set, making it easier to access Amazon S3.
  • This release fixes a bug with S3 access that would leave streams open when they are not fully read (e.g. when calling RDD.first() or a SQL query with a limit), causing nodes to hang.
  • The EC2 scripts now support both standalone and Mesos clusters, and launch Ganglia on the cluster.
  • Spark EC2 clusters can now be spread across multiple availability zones.

Other Improvements

  • Shuffle operations like groupByKey and reduceByKey now try to infer parallelism from the size of the parent RDD (unless spark.default.parallelism is set).
  • Several performance improvements to shuffles.
  • Standalone deploy cluster now spreads jobs out across machines by default, leading to better data locality.
  • Better error reporting when jobs aren't being launched due to not enough resources.
  • Standalone deploy web UI now includes JSON endpoints for querying cluster state.
  • Better support for IBM JVM.
  • Default Hadoop version dependency updated to 1.0.4.
  • Improved failure handling and reporting of error messages.
  • Separate configuration for standalone cluster daemons and user applications.
  • Significant refactoring of the scheduler codebase to enable richer unit testing.
  • Several bug and performance fixes throughout.

Compatibility

This release is API-compatible with Spark 0.6 programs, but the following features changed slightly:

  • Parallel shuffle operations where you don't specify a level of parallelism use the number of partitions of the parent RDD instead of a constant default. However, if you set spark.default.parallelism, they will use that.
  • SparkContext.addFile, which distributes a file to worker nodes, is no longer guaranteed to put it in the executor's working directory---instead, you can find the directory it used using SparkFiles.getRootDirectory, or get a particular file using SparkFiles.get. This was done to avoid cluttering the local directory when running in local mode.

Credits

Spark 0.7 was the work of many contributors from Berkeley and outside—in total, 31 different contributors, of which 20 were from outside Berkeley. Here are the people who contributed, along with areas they worked on:

  • Mikhail Bautin -- Maven build
  • Denny Britz -- memory dashboard, streaming, bug fixes
  • Paul Cavallaro -- error reporting
  • Tathagata Das -- streaming (lead developer), 24/7 operation, bug fixes, docs
  • Thomas Dudziak -- Maven build, Hadoop 2 support
  • Harvey Feng -- bug fix
  • Stephen Haberman -- new RDD operations, configuration, S3 improvements, code cleanup, bug fixes
  • Tyson Hamilton -- JSON status endpoints
  • Mark Hamstra -- API improvements, docs
  • Michael Heuer -- docs
  • Shane Huang -- shuffle performance fixes
  • Andy Konwinski -- docs
  • Ryan LeCompte -- streaming
  • Haoyuan Li -- streaming
  • Richard McKinley -- build
  • Sean McNamara -- streaming
  • Lee Moon Soo -- bug fix
  • Fernand Pajot -- bug fix
  • Nick Pentreath -- Python API, examples
  • Andrew Psaltis -- bug fixes
  • Imran Rashid -- memory dashboard, bug fixes
  • Charles Reiss -- fault recovery fixes, code cleanup, testability, error reporting
  • Josh Rosen -- Python API (lead developer), EC2 scripts, bug fixes
  • Peter Sankauskas -- EC2 scripts
  • Prashant Sharma -- streaming
  • Shivaram Venkataraman -- EC2 scripts, optimizations
  • Patrick Wendell -- streaming, bug fixes, examples, docs
  • Reynold Xin -- optimizations, UI
  • Haitao Yao -- run scripts
  • Matei Zaharia -- streaming, fault recovery, Python API, code cleanup, bug fixes, docs
  • Eric Zhang -- examples

Thanks to everyone who contributed!


Spark News Archive