pyspark.pandas.DataFrame.groupby#

DataFrame.groupby(by, axis=0, as_index=True, dropna=True)[source]#

Group DataFrame or Series using one or more columns.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters

by: Series, label, or list of labels: Used to determine the groups for the groupby. If Series is passed, the Series or dict VALUES will be used to determine the groups. A label or list of labels may be passed to group by the columns in self.
axis: int, default 0 or ‘index’: Can only be set to 0 now.
as_index: bool, default True: For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
dropna: bool, default True: If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

Returns

DataFrameGroupBy or SeriesGroupBy: Depends on the calling object and returns groupby object that contains information about the groups.

See also

pyspark.pandas.groupby.GroupBy

Examples

>>> df = ps.DataFrame({'Animal': ['Falcon', 'Falcon',
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]},
...                   columns=['Animal', 'Max Speed'])
>>> df
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0

>>> df.groupby(['Animal']).mean().sort_index()  
        Max Speed
Animal
Falcon      375.0
Parrot       25.0

>>> df.groupby(['Animal'], as_index=False).mean().sort_values('Animal')
... 
   Animal  Max Speed
...Falcon      375.0
...Parrot       25.0

We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is True:

>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = ps.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum().sort_index()  
     a  c
b
1.0  2  3
2.0  2  5

>>> df.groupby(by=["b"], dropna=False).sum().sort_index()  
     a  c
b
1.0  2  3
2.0  2  5
NaN  1  4