What is Apache Spark?
Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.
What can it do?
Spark was initially developed for two applications where placing data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining. In both cases, Spark can run up to 100x faster than Hadoop MapReduce. However, you can use Spark for general data processing too. Check out our example jobs.
While Spark is a new engine, it can access any data source supported by Hadoop, making it easy to run over existing data.
Who uses it?
Spark was initially created in the UC Berkeley AMPLab, but is now being used and developed at a wide array of companies. See our powered by page for a list of users, and our list of committers. In total, over 25 companies have contributed code to Spark. Spark is open source under an Apache license, so download it to try it out.
Apache Incubator notice
Apache Spark is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.