class RelationalGroupedDataset extends AnyRef
A set of methods for aggregations on a DataFrame, created by groupBy,
cube or rollup (and also pivot).
The main method is the agg function, which has multiple variants. This class also contains
some first-order statistics such as mean, sum for convenience.
- Annotations
 - @Stable()
 - Source
 - RelationalGroupedDataset.scala
 - Since
 2.0.0
- Note
 This class was named
GroupedDatain Spark 1.x.
- Alphabetic
 - By Inheritance
 
- RelationalGroupedDataset
 - AnyRef
 - Any
 
- Hide All
 - Show All
 
- Public
 - All
 
Instance Constructors
Value Members
- 
      
      
      
        
      
    
      
        final 
        def
      
      
        !=(arg0: Any): Boolean
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        ##(): Int
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        ==(arg0: Any): Boolean
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        agg(expr: Column, exprs: Column*): DataFrame
      
      
      
Compute aggregates by specifying a series of aggregate columns.
Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, set
spark.sql.retainGroupColumnsto false.The available aggregate methods are defined in org.apache.spark.sql.functions.
// Selects the age of the oldest employee and the aggregate expense for each department // Scala: import org.apache.spark.sql.functions._ df.groupBy("department").agg(max("age"), sum("expense")) // Java: import static org.apache.spark.sql.functions.*; df.groupBy("department").agg(max("age"), sum("expense"));
Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change to that behavior, set config variable
spark.sql.retainGroupColumnstofalse.// Scala, 1.3.x: df.groupBy("department").agg($"department", max("age"), sum("expense")) // Java, 1.3.x: df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
- Annotations
 - @varargs()
 - Since
 1.3.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        agg(exprs: Map[String, String]): DataFrame
      
      
      
(Java-specific) Compute aggregates by specifying a map from column name to aggregate methods.
(Java-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting
DataFramewill also contain the grouping columns.The available aggregate methods are
avg,max,min,sum,count.// Selects the age of the oldest employee and the aggregate expense for each department import com.google.common.collect.ImmutableMap; df.groupBy("department").agg(ImmutableMap.of("age", "max", "expense", "sum"));
- Since
 1.3.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        agg(exprs: Map[String, String]): DataFrame
      
      
      
(Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods.
(Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting
DataFramewill also contain the grouping columns.The available aggregate methods are
avg,max,min,sum,count.// Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg(Map( "age" -> "max", "expense" -> "sum" ))
- Since
 1.3.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
      
      
      
(Scala-specific) Compute aggregates by specifying the column names and aggregate methods.
(Scala-specific) Compute aggregates by specifying the column names and aggregate methods. The resulting
DataFramewill also contain the grouping columns.The available aggregate methods are
avg,max,min,sum,count.// Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg( "age" -> "max", "expense" -> "sum" )
- Since
 1.3.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        as[K, T](implicit arg0: Encoder[K], arg1: Encoder[T]): KeyValueGroupedDataset[K, T]
      
      
      
Returns a
KeyValueGroupedDatasetwhere the data is grouped by the grouping expressions of currentRelationalGroupedDataset.Returns a
KeyValueGroupedDatasetwhere the data is grouped by the grouping expressions of currentRelationalGroupedDataset.- Since
 3.0.0
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        asInstanceOf[T0]: T0
      
      
      
- Definition Classes
 - Any
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        avg(colNames: String*): DataFrame
      
      
      
Compute the mean value for each numeric columns for each group.
Compute the mean value for each numeric columns for each group. The resulting
DataFramewill also contain the grouping columns. When specified columns are given, only compute the mean values for them.- Annotations
 - @varargs()
 - Since
 1.3.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        clone(): AnyRef
      
      
      
- Attributes
 - protected[lang]
 - Definition Classes
 - AnyRef
 - Annotations
 - @throws( ... ) @native()
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        count(): DataFrame
      
      
      
Count the number of rows for each group.
Count the number of rows for each group. The resulting
DataFramewill also contain the grouping columns.- Since
 1.3.0
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        eq(arg0: AnyRef): Boolean
      
      
      
- Definition Classes
 - AnyRef
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        equals(arg0: Any): Boolean
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        finalize(): Unit
      
      
      
- Attributes
 - protected[lang]
 - Definition Classes
 - AnyRef
 - Annotations
 - @throws( classOf[java.lang.Throwable] )
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        getClass(): Class[_]
      
      
      
- Definition Classes
 - AnyRef → Any
 - Annotations
 - @native()
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        hashCode(): Int
      
      
      
- Definition Classes
 - AnyRef → Any
 - Annotations
 - @native()
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        isInstanceOf[T0]: Boolean
      
      
      
- Definition Classes
 - Any
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        max(colNames: String*): DataFrame
      
      
      
Compute the max value for each numeric columns for each group.
Compute the max value for each numeric columns for each group. The resulting
DataFramewill also contain the grouping columns. When specified columns are given, only compute the max values for them.- Annotations
 - @varargs()
 - Since
 1.3.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        mean(colNames: String*): DataFrame
      
      
      
Compute the average value for each numeric columns for each group.
Compute the average value for each numeric columns for each group. This is an alias for
avg. The resultingDataFramewill also contain the grouping columns. When specified columns are given, only compute the average values for them.- Annotations
 - @varargs()
 - Since
 1.3.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        min(colNames: String*): DataFrame
      
      
      
Compute the min value for each numeric column for each group.
Compute the min value for each numeric column for each group. The resulting
DataFramewill also contain the grouping columns. When specified columns are given, only compute the min values for them.- Annotations
 - @varargs()
 - Since
 1.3.0
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        ne(arg0: AnyRef): Boolean
      
      
      
- Definition Classes
 - AnyRef
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        notify(): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @native()
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        notifyAll(): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @native()
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        pivot(pivotColumn: Column, values: List[Any]): RelationalGroupedDataset
      
      
      
(Java-specific) Pivots a column of the current
DataFrameand performs the specified aggregation.(Java-specific) Pivots a column of the current
DataFrameand performs the specified aggregation. This is an overloaded version of thepivotmethod withpivotColumnof theStringtype.- pivotColumn
 the column to pivot.
- values
 List of values that will be translated to columns in the output DataFrame.
- Since
 2.4.0
- See also
 org.apache.spark.sql.Dataset.unpivotfor the reverse operation, except for the aggregation.
 - 
      
      
      
        
      
    
      
        
        def
      
      
        pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset
      
      
      
Pivots a column of the current
DataFrameand performs the specified aggregation.Pivots a column of the current
DataFrameand performs the specified aggregation. This is an overloaded version of thepivotmethod withpivotColumnof theStringtype.// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings")
- pivotColumn
 the column to pivot.
- values
 List of values that will be translated to columns in the output DataFrame.
- Since
 2.4.0
- See also
 org.apache.spark.sql.Dataset.unpivotfor the reverse operation, except for the aggregation.
 - 
      
      
      
        
      
    
      
        
        def
      
      
        pivot(pivotColumn: Column): RelationalGroupedDataset
      
      
      
Pivots a column of the current
DataFrameand performs the specified aggregation.Pivots a column of the current
DataFrameand performs the specified aggregation. This is an overloaded version of thepivotmethod withpivotColumnof theStringtype.// Or without specifying column values (less efficient) df.groupBy($"year").pivot($"course").sum($"earnings");
- pivotColumn
 he column to pivot.
- Since
 2.4.0
- See also
 org.apache.spark.sql.Dataset.unpivotfor the reverse operation, except for the aggregation.
 - 
      
      
      
        
      
    
      
        
        def
      
      
        pivot(pivotColumn: String, values: List[Any]): RelationalGroupedDataset
      
      
      
(Java-specific) Pivots a column of the current
DataFrameand performs the specified aggregation.(Java-specific) Pivots a column of the current
DataFrameand performs the specified aggregation.There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.
// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Arrays.<Object>asList("dotNET", "Java")).sum("earnings"); // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings");
- pivotColumn
 Name of the column to pivot.
- values
 List of values that will be translated to columns in the output DataFrame.
- Since
 1.6.0
- See also
 org.apache.spark.sql.Dataset.unpivotfor the reverse operation, except for the aggregation.
 - 
      
      
      
        
      
    
      
        
        def
      
      
        pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset
      
      
      
Pivots a column of the current
DataFrameand performs the specified aggregation.Pivots a column of the current
DataFrameand performs the specified aggregation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings")
From Spark 3.0.0, values can be literal columns, for instance, struct. For pivoting by multiple columns, use the
structfunction to combine the columns and values:df.groupBy("year") .pivot("trainingCourse", Seq(struct(lit("java"), lit("Experts")))) .agg(sum($"earnings"))
- pivotColumn
 Name of the column to pivot.
- values
 List of values that will be translated to columns in the output DataFrame.
- Since
 1.6.0
- See also
 org.apache.spark.sql.Dataset.unpivotfor the reverse operation, except for the aggregation.
 - 
      
      
      
        
      
    
      
        
        def
      
      
        pivot(pivotColumn: String): RelationalGroupedDataset
      
      
      
Pivots a column of the current
DataFrameand performs the specified aggregation.Pivots a column of the current
DataFrameand performs the specified aggregation.There are two versions of
pivotfunction: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings")
- pivotColumn
 Name of the column to pivot.
- Since
 1.6.0
- See also
 org.apache.spark.sql.Dataset.unpivotfor the reverse operation, except for the aggregation.
 - 
      
      
      
        
      
    
      
        
        def
      
      
        sum(colNames: String*): DataFrame
      
      
      
Compute the sum for each numeric columns for each group.
Compute the sum for each numeric columns for each group. The resulting
DataFramewill also contain the grouping columns. When specified columns are given, only compute the sum for them.- Annotations
 - @varargs()
 - Since
 1.3.0
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        synchronized[T0](arg0: ⇒ T0): T0
      
      
      
- Definition Classes
 - AnyRef
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        toString(): String
      
      
      
- Definition Classes
 - RelationalGroupedDataset → AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        wait(): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @throws( ... )
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        wait(arg0: Long, arg1: Int): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @throws( ... )
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        wait(arg0: Long): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @throws( ... ) @native()