pyspark.pandas.groupby.GroupBy.sum

GroupBy.sum(numeric_only: Optional[bool] = True, min_count: int = 0) → FrameLike[source]

Compute sum of group values

New in version 3.3.0.

Parameters
numeric_onlybool, default False

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. It takes no effect since only numeric columns can be support here.

New in version 3.4.0.

min_countint, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

New in version 3.4.0.

Notes

There is a behavior difference between pandas-on-Spark and pandas:

  • when there is a non-numeric aggregation column, it will be ignored

    even if numeric_only is False.

Examples

>>> df = ps.DataFrame({"A": [1, 2, 1, 2], "B": [True, False, False, True],
...                    "C": [3, 4, 3, 4], "D": ["a", "a", "b", "a"]})
>>> df.groupby("A").sum().sort_index()
   B  C
A
1  1  6
2  1  8
>>> df.groupby("D").sum().sort_index()
   A  B   C
D
a  5  2  11
b  1  0   3
>>> df.groupby("D").sum(min_count=3).sort_index()
     A    B     C
D
a  5.0  2.0  11.0
b  NaN  NaN   NaN