This guide documents the best way to make various types of contribution to Apache Spark, including what is required before submitting a code change.
Contributing to Spark doesn’t just mean writing code. Helping new users on the mailing list, testing releases, and improving documentation are also welcome. In fact, proposing significant code changes usually requires first gaining experience and credibility within the community by helping in other ways. This is also a guide to becoming an effective contributor.
So, this guide organizes contributions in order that they should probably be considered by new contributors who intend to get involved long-term. Build some track record of helping others, rather than just open pull requests.
A great way to contribute to Spark is to help answer user questions on the user@spark.apache.org
mailing list or on StackOverflow. There are always many new Spark users; taking a few minutes to
help answer a question is a very valuable community service.
Contributors should subscribe to this list and follow it in order to keep up to date on what’s happening in Spark. Answering questions is an excellent and visible way to help the community, which also demonstrates your expertise.
See the Mailing Lists guide for guidelines about how to effectively participate in discussions on the mailing list, as well as forums like StackOverflow.
Spark’s release process is community-oriented, and members of the community can vote on new
releases on the dev@spark.apache.org
mailing list. Spark users are invited to subscribe to
this list to receive announcements, and test their workloads on newer release and provide
feedback on any performance or correctness issues found in the newer release.
Changes to Spark source code are proposed, reviewed and committed via GitHub pull requests (described later). Anyone can view and comment on active changes here. Reviewing others’ changes is a good way to learn how the change process works and gain exposure to activity in various parts of the code. You can help by reviewing the changes and asking questions or pointing out issues – as simple as typos or small issues of style. See also https://spark-prs.appspot.com/ for a convenient way to view and filter open PRs.
To propose a change to release documentation (that is, docs that appear under
https://spark.apache.org/docs/),
edit the Markdown source files in Spark’s
docs/
directory,
whose README
file shows how to build the documentation locally to test your changes.
The process to propose a doc change is otherwise the same as the process for proposing code
changes below.
To propose a change to the rest of the documentation (that is, docs that do not appear under https://spark.apache.org/docs/), similarly, edit the Markdown in the spark-website repository and open a pull request.
Just as Java and Scala applications can access a huge selection of libraries and utilities, none of which are part of Java or Scala themselves, Spark aims to support a rich ecosystem of libraries. Many new useful utilities or features belong outside of Spark rather than in the core. For example: language support probably has to be a part of core Spark, but, useful machine learning algorithms can happily exist outside of MLlib.
To that end, large and independent new functionality is often rejected for inclusion in Spark itself, but, can and should be hosted as a separate project and repository, and included in the spark-packages.org collection.
Ideally, bug reports are accompanied by a proposed code change to fix the bug. This isn’t always possible, as those who discover a bug may not have the experience to fix it. A bug may be reported by creating a JIRA but without creating a pull request (see below).
Bug reports are only useful however if they include enough information to understand, isolate and ideally reproduce the bug. Simply encountering an error does not mean a bug should be reported; as below, search JIRA and search and inquire on the Spark user / dev mailing lists first. Unreproducible bugs, or simple error reports, may be closed.
It’s very helpful if the bug report has a description about how the bug was introduced, by which commit, so that reviewers can easily understand the bug. It also helps committers to decide how far the bug fix should be backported, when the pull request is merged. The pull request to fix the bug should narrow down the problem to the root cause.
Performance regression is also one kind of bug. The pull request to fix a performance regression must provide a benchmark to prove the problem is indeed fixed.
Note that, data correctness/data loss bugs are very serious. Make sure the corresponding bug
report JIRA ticket is labeled as correctness
or data-loss
. If the bug report doesn’t get
enough attention, please send an email to dev@spark.apache.org
, to draw more attentions.
It is possible to propose new features as well. These are generally not helpful unless accompanied by detail, such as a design document and/or code change. Large new contributions should consider spark-packages.org first (see above), or be discussed on the mailing list first. Feature requests may be rejected, or closed after a long period of inactivity.
Given the sheer volume of issues raised in the Apache Spark JIRA, inevitably some issues are duplicates, or become obsolete and eventually fixed otherwise, or can’t be reproduced, or could benefit from more detail, and so on. It’s useful to help identify these issues and resolve them, either by advancing the discussion or even resolving the JIRA. Most contributors are able to directly resolve JIRAs. Use judgment in determining whether you are quite confident the issue should be resolved, although changes can be easily undone. If in doubt, just leave a comment on the JIRA.
When resolving JIRAs, observe a few useful conventions:
Spark is an exceptionally busy project, with a new JIRA or pull request every few hours on average. Review can take hours or days of committer time. Everyone benefits if contributors focus on changes that are useful, clear, easy to evaluate, and already pass basic checks.
Sometimes, a contributor will already have a particular new change or bug in mind. If seeking
ideas, consult the list of starter tasks in JIRA, or ask the user@spark.apache.org
mailing list.
Before proceeding, contributors should evaluate if the proposed change is likely to be relevant, new and actionable:
user@spark.apache.org
first about the possible changeuser@spark.apache.org
and dev@spark.apache.org
mailing list
archives for
related discussions.
Often, the problem has been discussed before, with a resolution that doesn’t require a code
change, or recording what kinds of changes will not be accepted as a resolution.spark [search terms]
at the top right search box. If a logically similar issue already
exists, then contribute to the discussion on the existing JIRA and pull request first, instead of
creating a new one.It’s worth reemphasizing that changes to the core of Spark, or to highly complex and important modules like SQL and Catalyst, are more difficult to make correctly. They will be subjected to more scrutiny, and held to a higher standard of review than changes to less critical code.
While a rich set of algorithms is an important goal for MLLib, scaling the project requires that maintainability, consistency, and code quality come first. New algorithms should:
@Since
annotation on public classes, methods, and variables.Exceptions thrown in Spark should be associated with standardized and actionable error messages.
Error messages should answer the following questions:
When writing error messages, you should:
See the error message guidelines for more details.
Behavior changes are user-visible functional changes in a new release via public APIs. The term ‘user’ here refers not only to those who write queries and/or develop Spark plugins, but also to those who deploy and/or manage Spark clusters. New features and bug fixes, such as correcting query results or schemas and failing unsupported queries that previously returned incorrect results, are considered behavior changes. However, performance improvements, code refactoring, and changes to unreleased APIs/features are not.
Everyone makes mistakes, including Spark developers. We will continue to fix defects in Spark as they arise. However, it is important to communicate these behavior changes so that Spark users can be prepared for version upgrades. If a PR introduces behavior changes, it should be explicitly mentioned in the PR description. If the behavior change may require additional user actions, this should be highlighted in the migration guide (docs/sql-migration-guide.md for the SQL component and similar files for other components). Where possible, provide options to restore the previous behavior and mention these options in the error message. Some examples include:
This list is not meant to be comprehensive. Anyone reviewing a PR can ask the PR author to add to the migration guide if they believe the change is risky and may disrupt users during an upgrade.
Before considering how to contribute code, it’s useful to understand how code is reviewed, and why changes may be rejected. See the detailed guide for code reviewers from Google’s Engineering Practices documentation. Simply put, changes that have many or large positives, and few negative effects or risks, are much more likely to be merged, and merged quickly. Risky and less valuable changes are very unlikely to be merged, and may be rejected outright rather than receive iterations of review.
Please review the preceding section before proposing a code change. This section documents how to do so.
When you contribute code, you affirm that the contribution is your original work and that you license the work to the project under the project’s open source license. Whether or not you state this explicitly, by submitting any copyrighted material via pull request, email, or other means you agree to license the material under the project’s open source license and warrant that you have the legal authority to do so.
If you are interested in working with the newest under-development code or contributing to Apache Spark development, you can check out the master branch from Git:
# Master development branch
git clone git://github.com/apache/spark.git
Once you’ve downloaded Spark, you can find instructions for installing and building it on the documentation page.
Generally, Spark uses JIRA to track logical issues, including bugs and improvements, and uses GitHub pull requests to manage the review and merge of specific code changes. That is, JIRAs are used to describe what should be fixed or changed, and high-level approaches, and pull requests describe how to implement that change in the project’s source code. For example, major design decisions are discussed in JIRA.
Fix typos in Foo scaladoc
correctness
: a correctness issuedata-loss
: a data loss issuerelease-notes
: the change’s effects need mention in release notes. The JIRA or pull request
should include detail suitable for inclusion in release notes – see “Docs Text” below.starter
: small, simple change suitable for new contributorsdev@spark.apache.org
first before proceeding to implement the change.Before creating a pull request in Apache Spark, it is important to check if tests can pass on your branch because our GitHub Actions workflows automatically run tests for your pull request/following commits and every run burdens the limited resources of GitHub Actions in Apache Spark repository. Below steps will take your through the process.
test("SPARK-12345: a short description of the test") {
...
@Test
public void testCase() {
// SPARK-12345: a short description of the test
...
def test_case(self):
# SPARK-12345: a short description of the test
...
test_that("SPARK-12345: a short description of the test", {
...
./dev/run-tests
to verify that the code still compiles, passes tests, and
passes style checks.
If style checks fail, review the Code Style Guide below.master
branch of apache/spark
. (Only in special cases would the PR be opened against other branches). This
will trigger workflows “On pull request*” (on Spark repo) that will look/watch for successful workflow runs on “your” forked repository (it will wait if one is running).
[SPARK-xxxx][COMPONENT] Title
, where SPARK-xxxx
is
the relevant JIRA number, COMPONENT
is one of the PR categories shown at
spark-prs.appspot.com and
Title may be the JIRA’s title or a more specific title describing the PR itself.[WIP]
after the component.@username
in the PR description to ping them
immediately.git remote add upstream https://github.com/apache/spark.git
,
running git fetch upstream
followed by git rebase upstream/master
and resolving the conflicts by hand,
then pushing the result to your branch.master
, that you will actually have to close the pull request manuallyPlease follow the style of the existing codebase.
If you’re not sure about the right style for something, try to follow the style of the existing
codebase. Look at whether there are other examples in the code that use your feature. Feel free
to ask on the dev@spark.apache.org
list as well and/or ask committers.
The Apache Spark project follows the Apache Software Foundation Code of Conduct. The code of conduct applies to all spaces managed by the Apache Software Foundation, including IRC, all public and private mailing lists, issue trackers, wikis, blogs, Twitter, and any other communication channel used by our communities. A code of conduct which is specific to in-person events (ie., conferences) is codified in the published ASF anti-harassment policy.
We expect this code of conduct to be honored by everyone who participates in the Apache community formally or informally, or claims any affiliation with the Foundation, in any Foundation-related activities and especially when representing the ASF, in any role.
This code is not exhaustive or complete. It serves to distill our common understanding of a collaborative, shared environment and goals. We expect it to be followed in spirit as much as in the letter, so that it can enrich all of us and the technical communities in which we participate.
For more information and specific guidelines, refer to the Apache Software Foundation Code of Conduct.