Class Dataset<T>
- All Implemented Interfaces:
Serializable
DataFrame, which is a Dataset of Row.
Operations available on Datasets are divided into transformations and actions. Transformations
are the ones that produce new Datasets, and actions are the ones that trigger computation and
return results. Example transformations include map, filter, select, and aggregate (groupBy).
Example actions count, show, or writing data out to file systems.
Datasets are "lazy", i.e. computations are only triggered when an action is invoked.
Internally, a Dataset represents a logical plan that describes the computation required to
produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan
and generates a physical plan for efficient execution in a parallel and distributed manner. To
explore the logical plan as well as optimized physical plan, use the explain function.
To efficiently support domain-specific objects, an Encoder is
required. The encoder maps the domain specific type T to Spark's internal type system. For
example, given a class Person with two fields, name (string) and age (int), an encoder is
used to tell Spark to generate code at runtime to serialize the Person object into a binary
structure. This binary structure often has much lower memory footprint as well as are optimized
for efficiency in data processing (e.g. in a columnar format). To understand the internal
binary representation for data, use the schema function.
There are typically two ways to create a Dataset. The most common way is by pointing Spark to
some files on storage systems, using the read function available on a SparkSession.
val people = spark.read.parquet("...").as[Person] // Scala
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
val names = people.map(_.name) // in Scala; names is a Dataset[String]
Dataset<String> names = people.map(
(MapFunction<Person, String>) p -> p.name, Encoders.STRING()); // Java
Dataset operations can also be untyped, through various domain-specific-language (DSL)
functions defined in: Dataset (this class), Column, and
functions. These operations are very similar to the operations
available in the data frame abstraction in R or Python.
To select a column from the Dataset, use apply method in Scala and col in Java.
val ageCol = people("age") // in Scala
Column ageCol = people.col("age"); // in Java
Note that the Column type can also be manipulated through its various
functions.
// The following creates a new column that increases everybody's age by 10.
people("age") + 10 // in Scala
people.col("age").plus(10); // in Java
A more concrete example in Scala:
// To create Dataset[Row] using SparkSession
val people = spark.read.parquet("...")
val department = spark.read.parquet("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), people("gender"))
.agg(avg(people("salary")), max(people("age")))
and in Java:
// To create Dataset<Row> using SparkSession
Dataset<Row> people = spark.read().parquet("...");
Dataset<Row> department = spark.read().parquet("...");
people.filter(people.col("age").gt(30))
.join(department, people.col("deptId").equalTo(department.col("id")))
.groupBy(department.col("name"), people.col("gender"))
.agg(avg(people.col("salary")), max(people.col("age")));
- Since:
- 1.6.0
- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescription(Java-specific) Aggregates on the entire Dataset without groups.Aggregates on the entire Dataset without groups.Aggregates on the entire Dataset without groups.(Scala-specific) Aggregates on the entire Dataset without groups.agg(scala.Tuple2<String, String> aggExpr, scala.collection.immutable.Seq<scala.Tuple2<String, String>> aggExprs) (Scala-specific) Aggregates on the entire Dataset without groups.Returns a new Dataset with an alias set.alias(scala.Symbol alias) (Scala-specific) Returns a new Dataset with an alias set.Selects column based on the column name and returns it as aColumn.Returns a new Dataset with an alias set.abstract <U> Dataset<U>Returns a new Dataset where each record has been mapped on to the specified type.as(scala.Symbol alias) (Scala-specific) Returns a new Dataset with an alias set.cache()Persist this Dataset with the default storage level (MEMORY_AND_DISK).Eagerly checkpoint a Dataset and return the new Dataset.checkpoint(boolean eager) Returns a checkpointed version of this Dataset.coalesce(int numPartitions) Returns a new Dataset that has exactlynumPartitionspartitions, when the fewer partitions are requested.abstract ColumnSelects column based on the column name and returns it as aColumn.abstract Objectcollect()Returns an array that contains all rows in this Dataset.Returns a Java list that contains all rows in this Dataset.abstract ColumnSelects column based on the column name specified as a regex and returns it asColumn.String[]columns()Returns all column names as an array.abstract longcount()Returns the number of rows in the Dataset.voidcreateGlobalTempView(String viewName) Creates a global temporary view using the given name.voidcreateOrReplaceGlobalTempView(String viewName) Creates or replaces a global temporary view using the given name.voidcreateOrReplaceTempView(String viewName) Creates a local temporary view using the given name.voidcreateTempView(String viewName) Creates a local temporary view using the given name.Explicit cartesian join with anotherDataFrame.Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.abstract RelationalGroupedDatasetCreate a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.distinct()Returns a new Dataset that contains only the unique rows from this Dataset.Returns a new Dataset with a column dropped.Returns a new Dataset with columns dropped.Returns a new Dataset with column dropped.Returns a new Dataset with columns dropped.Returns a new Dataset with columns dropped.Returns a new Dataset with columns dropped.Returns a new Dataset that contains only the unique rows from this Dataset.dropDuplicates(String[] colNames) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.dropDuplicates(String col1, String... cols) Returns a newDatasetwith duplicate rows removed, considering only the subset of columns.dropDuplicates(String col1, scala.collection.immutable.Seq<String> cols) Returns a newDatasetwith duplicate rows removed, considering only the subset of columns.dropDuplicates(scala.collection.immutable.Seq<String> colNames) (Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.Returns a new Dataset with duplicates rows removed, within watermark.dropDuplicatesWithinWatermark(String[] colNames) Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.dropDuplicatesWithinWatermark(String col1, String... cols) Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.dropDuplicatesWithinWatermark(String col1, scala.collection.immutable.Seq<String> cols) Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.dropDuplicatesWithinWatermark(scala.collection.immutable.Seq<String> colNames) Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.dtypes()Returns all column names and their data types as an array.encoder()Returns a new Dataset containing rows in this Dataset but not in another Dataset.Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving the duplicates.exists()Return aColumnobject for an EXISTS Subquery.voidexplain()Prints the physical plan to the console for debugging purposes.voidexplain(boolean extended) Prints the plans (logical and physical) to the console for debugging purposes.abstract voidPrints the plans (logical and physical) with a format specified by a given explain mode.explode(String inputColumn, String outputColumn, scala.Function1<A, scala.collection.IterableOnce<B>> f, scala.reflect.api.TypeTags.TypeTag<B> evidence$4) Deprecated.use flatMap() or select() with functions.explode() instead.explode(scala.collection.immutable.Seq<Column> input, scala.Function1<Row, scala.collection.IterableOnce<A>> f, scala.reflect.api.TypeTags.TypeTag<A> evidence$3) Deprecated.use flatMap() or select() with functions.explode() instead.Filters rows using the given SQL expression.filter(FilterFunction<T> func) (Java-specific) Returns a new Dataset that only contains elements wherefuncreturnstrue.Filters rows using the given condition.(Scala-specific) Returns a new Dataset that only contains elements wherefuncreturnstrue.first()Returns the first row.<U> Dataset<U>flatMap(FlatMapFunction<T, U> f, Encoder<U> encoder) (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.<U> Dataset<U>(Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.voidforeach(ForeachFunction<T> func) (Java-specific) Runsfuncon each element of this Dataset.voidApplies a functionfto all rows.void(Java-specific) Runsfuncon each partition of this Dataset.abstract voidforeachPartition(scala.Function1<scala.collection.Iterator<T>, scala.runtime.BoxedUnit> f) Applies a functionfto each partition of this Dataset.Groups the Dataset using the specified columns, so that we can run aggregation on them.Groups the Dataset using the specified columns, so that we can run aggregation on them.Groups the Dataset using the specified columns, so we can run aggregation on them.abstract RelationalGroupedDatasetGroups the Dataset using the specified columns, so we can run aggregation on them.<K> KeyValueGroupedDataset<K,T> groupByKey(MapFunction<T, K> func, Encoder<K> encoder) (Java-specific) Returns aKeyValueGroupedDatasetwhere the data is grouped by the given keyfunc.abstract <K> KeyValueGroupedDataset<K,T> groupByKey(scala.Function1<T, K> func, Encoder<K> evidence$2) (Scala-specific) Returns aKeyValueGroupedDatasetwhere the data is grouped by the given keyfunc.groupingSets(scala.collection.immutable.Seq<scala.collection.immutable.Seq<Column>> groupingSets, Column... cols) Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them.abstract RelationalGroupedDatasetgroupingSets(scala.collection.immutable.Seq<scala.collection.immutable.Seq<Column>> groupingSets, scala.collection.immutable.Seq<Column> cols) Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them.head()Returns the first row.abstract Objecthead(int n) Returns the firstnrows.Specifies some hint on the current Dataset.Specifies some hint on the current Dataset.abstract String[]Returns a best-effort snapshot of the files that compose this Dataset.Returns a new Dataset containing rows only in both this Dataset and another Dataset.intersectAll(Dataset<T> other) Returns a new Dataset containing rows only in both this Dataset and another Dataset while preserving the duplicates.abstract booleanisEmpty()Returns true if theDatasetis empty.abstract booleanisLocal()Returns true if thecollectandtakemethods can be run locally (without any Spark executors).abstract booleanReturns true if this Dataset contains one or more sources that continuously return data as it arrives.javaRDD()Returns the content of the Dataset as aJavaRDDofTs.Join with anotherDataFrame.Inner equi-join with anotherDataFrameusing the given column.(Java-specific) Inner equi-join with anotherDataFrameusing the given columns.(Java-specific) Equi-join with anotherDataFrameusing the given columns.Equi-join with anotherDataFrameusing the given column.Inner join with anotherDataFrame, using the given join expression.Join with anotherDataFrame, using the given join expression.(Scala-specific) Inner equi-join with anotherDataFrameusing the given columns.(Scala-specific) Equi-join with anotherDataFrameusing the given columns.Using inner equi-join to join this Dataset returning aTuple2for each pair whereconditionevaluates to true.Joins this Dataset returning aTuple2for each pair whereconditionevaluates to true.lateralJoin(Dataset<?> right) Lateral join with anotherDataFrame.lateralJoin(Dataset<?> right, String joinType) Lateral join with anotherDataFrame.lateralJoin(Dataset<?> right, Column joinExprs) Lateral join with anotherDataFrame.lateralJoin(Dataset<?> right, Column joinExprs, String joinType) Lateral join with anotherDataFrame.limit(int n) Returns a new Dataset by taking the firstnrows.Eagerly locally checkpoints a Dataset and return the new Dataset.localCheckpoint(boolean eager) Locally checkpoints a Dataset and return the new Dataset.localCheckpoint(boolean eager, StorageLevel storageLevel) Locally checkpoints a Dataset and return the new Dataset.abstract <U> Dataset<U>map(MapFunction<T, U> func, Encoder<U> encoder) (Java-specific) Returns a new Dataset that contains the result of applyingfuncto each element.abstract <U> Dataset<U>(Scala-specific) Returns a new Dataset that contains the result of applyingfuncto each element.<U> Dataset<U>mapPartitions(MapPartitionsFunction<T, U> f, Encoder<U> encoder) (Java-specific) Returns a new Dataset that contains the result of applyingfto each partition.abstract <U> Dataset<U>mapPartitions(scala.Function1<scala.collection.Iterator<T>, scala.collection.Iterator<U>> func, Encoder<U> evidence$6) (Scala-specific) Returns a new Dataset that contains the result of applyingfuncto each partition.Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.abstract MergeIntoWriter<T>Merges a set of updates, insertions, and deletions based on a source table into a target table.abstract ColumnmetadataColumn(String colName) Selects a metadata column based on its logical column name, and returns it as aColumn.abstract DataFrameNaFunctionsna()Returns aDataFrameNaFunctionsfor working with missing data.Define (named) metrics to observe on the Dataset.Define (named) metrics to observe on the Dataset.observe(Observation observation, Column expr, Column... exprs) Observe (named) metrics through anorg.apache.spark.sql.Observationinstance.observe(Observation observation, Column expr, scala.collection.immutable.Seq<Column> exprs) Observe (named) metrics through anorg.apache.spark.sql.Observationinstance.offset(int n) Returns a new Dataset by skipping the firstnrows.Returns a new Dataset sorted by the given expressions.Returns a new Dataset sorted by the given expressions.Returns a new Dataset sorted by the given expressions.Returns a new Dataset sorted by the given expressions.persist()Persist this Dataset with the default storage level (MEMORY_AND_DISK).persist(StorageLevel newLevel) Persist this Dataset with the given storage level.voidPrints the schema to the console in a nice tree format.voidprintSchema(int level) Prints the schema up to the given level to the console in a nice tree format.abstract org.apache.spark.sql.execution.QueryExecutionrandomSplit(double[] weights) Randomly splits this Dataset with the provided weights.randomSplit(double[] weights, long seed) Randomly splits this Dataset with the provided weights.randomSplitAsList(double[] weights, long seed) Returns a Java list that contains randomly split Dataset with the provided weights.rdd()Represents the content of the Dataset as anRDDofT.reduce(ReduceFunction<T> func) (Java-specific) Reduces the elements of this Dataset using the specified binary function.abstract T(Scala-specific) Reduces the elements of this Dataset using the specified binary function.voidregisterTempTable(String tableName) Deprecated.Use createOrReplaceTempView(viewName) instead.repartition(int numPartitions) Returns a new Dataset that has exactlynumPartitionspartitions.repartition(int numPartitions, Column... partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions.repartition(int numPartitions, scala.collection.immutable.Seq<Column> partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions.repartition(Column... partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitionsas number of partitions.repartition(scala.collection.immutable.Seq<Column> partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitionsas number of partitions.repartitionById(int numPartitions, Column partitionIdExpr) Repartition the Dataset into the given number of partitions using the specified partition ID expression.repartitionByRange(int numPartitions, Column... partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions.repartitionByRange(int numPartitions, scala.collection.immutable.Seq<Column> partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions.repartitionByRange(Column... partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitionsas number of partitions.repartitionByRange(scala.collection.immutable.Seq<Column> partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitionsas number of partitions.Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.abstract RelationalGroupedDatasetCreate a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.abstract booleansameSemantics(Dataset<T> other) Returnstruewhen the logical query plans inside bothDatasets are equal and therefore return same results.sample(boolean withReplacement, double fraction) Returns a newDatasetby sampling a fraction of rows, using a random seed.sample(boolean withReplacement, double fraction, long seed) Returns a newDatasetby sampling a fraction of rows, using a user-supplied seed.sample(double fraction) Returns a newDatasetby sampling a fraction of rows (without replacement), using a random seed.sample(double fraction, long seed) Returns a newDatasetby sampling a fraction of rows (without replacement), using a user-supplied seed.scalar()Return aColumnobject for a SCALAR Subquery containing exactly one row and one column.abstract StructTypeschema()Returns the schema of this Dataset.Selects a set of columns.Selects a set of columns.Selects a set of column based expressions.abstract <U1> Dataset<U1>select(TypedColumn<T, U1> c1) Returns a new Dataset by computing the givenColumnexpression for each element.<U1,U2> Dataset<scala.Tuple2<U1, U2>> select(TypedColumn<T, U1> c1, TypedColumn<T, U2> c2) Returns a new Dataset by computing the givenColumnexpressions for each element.<U1,U2, U3> Dataset<scala.Tuple3<U1, U2, U3>> select(TypedColumn<T, U1> c1, TypedColumn<T, U2> c2, TypedColumn<T, U3> c3) Returns a new Dataset by computing the givenColumnexpressions for each element.<U1,U2, U3, U4>
Dataset<scala.Tuple4<U1,U2, U3, U4>> select(TypedColumn<T, U1> c1, TypedColumn<T, U2> c2, TypedColumn<T, U3> c3, TypedColumn<T, U4> c4) Returns a new Dataset by computing the givenColumnexpressions for each element.<U1,U2, U3, U4, U5>
Dataset<scala.Tuple5<U1,U2, U3, U4, U5>> select(TypedColumn<T, U1> c1, TypedColumn<T, U2> c2, TypedColumn<T, U3> c3, TypedColumn<T, U4> c4, TypedColumn<T, U5> c5) Returns a new Dataset by computing the givenColumnexpressions for each element.Selects a set of column based expressions.selectExpr(String... exprs) Selects a set of SQL expressions.selectExpr(scala.collection.immutable.Seq<String> exprs) Selects a set of SQL expressions.abstract intReturns ahashCodeof the logical query plan against thisDataset.voidshow()Displays the top 20 rows of Dataset in a tabular form.voidshow(boolean truncate) Displays the top 20 rows of Dataset in a tabular form.voidshow(int numRows) Displays the Dataset in a tabular form.abstract voidshow(int numRows, boolean truncate) Displays the Dataset in a tabular form.voidshow(int numRows, int truncate) Displays the Dataset in a tabular form.abstract voidshow(int numRows, int truncate, boolean vertical) Displays the Dataset in a tabular form.Returns a new Dataset sorted by the specified column, all in ascending order.Returns a new Dataset sorted by the specified column, all in ascending order.Returns a new Dataset sorted by the given expressions.Returns a new Dataset sorted by the given expressions.sortWithinPartitions(String sortCol, String... sortCols) Returns a new Dataset with each partition sorted by the given expressions.sortWithinPartitions(String sortCol, scala.collection.immutable.Seq<String> sortCols) Returns a new Dataset with each partition sorted by the given expressions.sortWithinPartitions(Column... sortExprs) Returns a new Dataset with each partition sorted by the given expressions.sortWithinPartitions(scala.collection.immutable.Seq<Column> sortExprs) Returns a new Dataset with each partition sorted by the given expressions.abstract SparkSessionabstract DataFrameStatFunctionsstat()Returns aDataFrameStatFunctionsfor working statistic functions support.abstract StorageLevelGet the Dataset's current storage level, or StorageLevel.NONE if not persisted.Computes specified statistics for numeric and string columns.Computes specified statistics for numeric and string columns.abstract Objecttail(int n) Returns the lastnrows in the Dataset.take(int n) Returns the firstnrows in the Dataset.takeAsList(int n) Returns the firstnrows in the Dataset as a list.to(StructType schema) Returns a new DataFrame where each row is reconciled to match the specified schema.toDF()Converts this strongly typed collection of data to generic Dataframe.Converts this strongly typed collection of data to genericDataFramewith columns renamed.Converts this strongly typed collection of data to genericDataFramewith columns renamed.Returns the content of the Dataset as aJavaRDDofTs.toJSON()Returns the content of the Dataset as a Dataset of JSON strings.Returns an iterator that contains all rows in this Dataset.<U,DSO extends Dataset<?>>
DSOConcise syntax for chaining custom transformations.Transposes a DataFrame, switching rows to columns.Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.Returns a new Dataset containing union of rows in this Dataset and another Dataset.Returns a new Dataset containing union of rows in this Dataset and another Dataset.unionByName(Dataset<T> other) Returns a new Dataset containing union of rows in this Dataset and another Dataset.unionByName(Dataset<T> other, boolean allowMissingColumns) Returns a new Dataset containing union of rows in this Dataset and another Dataset.Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.unpersist(boolean blocking) Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.Filters rows using the given SQL expression.Filters rows using the given condition.withColumn(String colName, Column col) Returns a new Dataset by adding a column or replacing the existing column that has the same name.withColumnRenamed(String existingName, String newName) Returns a new Dataset with a column renamed.withColumns(Map<String, Column> colsMap) (Java-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.withColumns(scala.collection.immutable.Map<String, Column> colsMap) (Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.withColumnsRenamed(Map<String, String> colsMap) (Java-specific) Returns a new Dataset with a columns renamed.withColumnsRenamed(scala.collection.immutable.Map<String, String> colsMap) (Scala-specific) Returns a new Dataset with a columns renamed.withMetadata(String columnName, Metadata metadata) Returns a new Dataset by updating an existing column with metadata.withWatermark(String eventTime, String delayThreshold) Defines an event time watermark for thisDataset.abstract DataFrameWriter<T>write()Interface for saving the content of the non-streaming Dataset out into external storage.abstract DataStreamWriter<T>Interface for saving the content of the streaming Dataset out into external storage.abstract DataFrameWriterV2<T>Create a write configuration builder for v2 sources.
-
Constructor Details
-
Dataset
public Dataset()
-
-
Method Details
-
agg
Aggregates on the entire Dataset without groups.// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(max($"age"), avg($"salary")) ds.groupBy().agg(max($"age"), avg($"salary"))- Parameters:
expr- (undocumented)exprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
agg
public Dataset<Row> agg(scala.Tuple2<String, String> aggExpr, scala.collection.immutable.Seq<scala.Tuple2<String, String>> aggExprs) (Scala-specific) Aggregates on the entire Dataset without groups.// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg("age" -> "max", "salary" -> "avg") ds.groupBy().agg("age" -> "max", "salary" -> "avg")- Parameters:
aggExpr- (undocumented)aggExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
agg
(Scala-specific) Aggregates on the entire Dataset without groups.// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(Map("age" -> "max", "salary" -> "avg")) ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))- Parameters:
exprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
agg
(Java-specific) Aggregates on the entire Dataset without groups.// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(Map("age" -> "max", "salary" -> "avg")) ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))- Parameters:
exprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
agg
Aggregates on the entire Dataset without groups.// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(max($"age"), avg($"salary")) ds.groupBy().agg(max($"age"), avg($"salary"))- Parameters:
expr- (undocumented)exprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
alias
Returns a new Dataset with an alias set. Same asas.- Parameters:
alias- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
alias
(Scala-specific) Returns a new Dataset with an alias set. Same asas.- Parameters:
alias- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
apply
Selects column based on the column name and returns it as aColumn.- Parameters:
colName- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
- Note:
- The column name can also reference to a nested column like
a.b.
-
as
Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type ofU:- When
Uis a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined byspark.sql.caseSensitive). - When
Uis a tuple, the columns will be mapped by ordinal (i.e. the first column will be assigned to_1). - When
Uis a primitive type (i.e. String, Int, etc), then the first column of theDataFramewill be used.
If the schema of the Dataset does not match the desired
Utype, you can useselectalong withaliasorasto rearrange or rename as required.Note that
as[]only changes the view of the data that is passed into typed operations, such asmap(), and does not eagerly project away any columns that are not present in the specified class.- Parameters:
evidence$1- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
- When
-
as
Returns a new Dataset with an alias set.- Parameters:
alias- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
as
(Scala-specific) Returns a new Dataset with an alias set.- Parameters:
alias- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
cache
Persist this Dataset with the default storage level (MEMORY_AND_DISK).- Returns:
- (undocumented)
- Since:
- 1.6.0
-
checkpoint
Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set withSparkContext#setCheckpointDir.- Returns:
- (undocumented)
- Since:
- 2.1.0
-
checkpoint
Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set withSparkContext#setCheckpointDir.- Parameters:
eager- Whether to checkpoint this dataframe immediately- Returns:
- (undocumented)
- Since:
- 2.1.0
- Note:
- When checkpoint is used with eager = false, the final data that is checkpointed after the first action may be different from the data that was used during the job due to non-determinism of the underlying operation and retries. If checkpoint is used to achieve saving a deterministic snapshot of the data, eager = true should be used. Otherwise, it is only deterministic after the first execution, after the checkpoint was finalized.
-
coalesce
Returns a new Dataset that has exactlynumPartitionspartitions, when the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. Similar to coalesce defined on anRDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
- Parameters:
numPartitions- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
col
Selects column based on the column name and returns it as aColumn.- Parameters:
colName- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
- Note:
- The column name can also reference to a nested column like
a.b.
-
colRegex
Selects column based on the column name specified as a regex and returns it asColumn.- Parameters:
colName- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.3.0
-
collect
Returns an array that contains all rows in this Dataset.Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
For Java API, use
collectAsList().- Returns:
- (undocumented)
- Since:
- 1.6.0
-
collectAsList
Returns a Java list that contains all rows in this Dataset.Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
- Returns:
- (undocumented)
- Since:
- 1.6.0
-
columns
Returns all column names as an array.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
count
public abstract long count()Returns the number of rows in the Dataset.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
createGlobalTempView
Creates a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.Global temporary view is cross-session. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database
global_temp, and we must use the qualified name to refer a global temp view, e.g.SELECT * FROM global_temp.view1.- Parameters:
viewName- (undocumented)- Throws:
AnalysisException- if the view name is invalid or already exists- Since:
- 2.1.0
-
createOrReplaceGlobalTempView
Creates or replaces a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.Global temporary view is cross-session. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database
global_temp, and we must use the qualified name to refer a global temp view, e.g.SELECT * FROM global_temp.view1.- Parameters:
viewName- (undocumented)- Since:
- 2.2.0
-
createOrReplaceTempView
Creates a local temporary view using the given name. The lifetime of this temporary view is tied to theSparkSessionthat was used to create this Dataset.- Parameters:
viewName- (undocumented)- Since:
- 2.0.0
-
createTempView
Creates a local temporary view using the given name. The lifetime of this temporary view is tied to theSparkSessionthat was used to create this Dataset.Local temporary view is session-scoped. Its lifetime is the lifetime of the session that created it, i.e. it will be automatically dropped when the session terminates. It's not tied to any databases, i.e. we can't use
db1.view1to reference a local temporary view.- Parameters:
viewName- (undocumented)- Throws:
AnalysisException- if the view name is invalid or already exists- Since:
- 2.0.0
-
crossJoin
Explicit cartesian join with anotherDataFrame.- Parameters:
right- Right side of the join operation.- Returns:
- (undocumented)
- Since:
- 2.1.0
- Note:
- Cartesian joins are very expensive without an extra filter that can be pushed down.
-
cube
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.// Compute the average for all numeric columns cubed by department and group. ds.cube($"department", $"group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
cube
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group. ds.cube("department", "group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
col1- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
cube
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.// Compute the average for all numeric columns cubed by department and group. ds.cube($"department", $"group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
cube
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group. ds.cube("department", "group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
col1- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
describe
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the
aggfunction instead.ds.describe("age", "height").show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // max 92.0 192.0Use
summary(java.lang.String...)for expanded statistics and control over which statistics to compute.- Parameters:
cols- Columns to compute statistics on.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
describe
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the
aggfunction instead.ds.describe("age", "height").show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // max 92.0 192.0Use
summary(java.lang.String...)for expanded statistics and control over which statistics to compute.- Parameters:
cols- Columns to compute statistics on.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
distinct
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias fordropDuplicates.Note that for a streaming
Dataset, this method returns distinct rows only once regardless of the output mode, which the behavior may not be same withDISTINCTin SQL against streamingDataset.- Returns:
- (undocumented)
- Since:
- 2.0.0
- Note:
- Equality checking is performed directly on the encoded representation of the data and thus
is not affected by a custom
equalsfunction defined onT.
-
drop
Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
- Parameters:
colNames- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
drop
Returns a new Dataset with columns dropped.This method can only be used to drop top level columns. This is a no-op if the Dataset doesn't have a columns with an equivalent expression.
- Parameters:
col- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.4.0
-
drop
Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name.This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
Note:
drop(colName)has different semantic withdrop(col(colName)), for example: 1, multi column have the same colName:val df1 = spark.range(0, 2).withColumn("key1", lit(1)) val df2 = spark.range(0, 2).withColumn("key2", lit(2)) val df3 = df1.join(df2) df3.show // +---+----+---+----+ // | id|key1| id|key2| // +---+----+---+----+ // | 0| 1| 0| 2| // | 0| 1| 1| 2| // | 1| 1| 0| 2| // | 1| 1| 1| 2| // +---+----+---+----+ df3.drop("id").show() // output: the two 'id' columns are both dropped. // |key1|key2| // +----+----+ // | 1| 2| // | 1| 2| // | 1| 2| // | 1| 2| // +----+----+ df3.drop(col("id")).show() // ...AnalysisException: [AMBIGUOUS_REFERENCE] Reference `id` is ambiguous...2, colName contains special characters, like dot.
val df = spark.range(0, 2).withColumn("a.b.c", lit(1)) df.show() // +---+-----+ // | id|a.b.c| // +---+-----+ // | 0| 1| // | 1| 1| // +---+-----+ df.drop("a.b.c").show() // +---+ // | id| // +---+ // | 0| // | 1| // +---+ df.drop(col("a.b.c")).show() // no column match the expression 'a.b.c' // +---+-----+ // | id|a.b.c| // +---+-----+ // | 0| 1| // | 1| 1| // +---+-----+- Parameters:
colName- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
drop
Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
- Parameters:
colNames- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
drop
Returns a new Dataset with column dropped.This method can only be used to drop top level column. This version of drop accepts a
Columnrather than a name. This is a no-op if the Dataset doesn't have a column with an equivalent expression.Note:
drop(col(colName))has different semantic withdrop(colName), please refer toDataset#drop(colName: String).- Parameters:
col- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
drop
Returns a new Dataset with columns dropped.This method can only be used to drop top level columns. This is a no-op if the Dataset doesn't have a columns with an equivalent expression.
- Parameters:
col- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.4.0
-
dropDuplicates
Returns a newDatasetwith duplicate rows removed, considering only the subset of columns.For a static batch
Dataset, it just drops duplicate rows. For a streamingDataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can usewithWatermark(java.lang.String,java.lang.String)to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.- Parameters:
col1- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
dropDuplicates
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias fordistinct.For a static batch
Dataset, it just drops duplicate rows. For a streamingDataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can usewithWatermark(java.lang.String,java.lang.String)to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.- Returns:
- (undocumented)
- Since:
- 2.0.0
-
dropDuplicates
(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.For a static batch
Dataset, it just drops duplicate rows. For a streamingDataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can usewithWatermark(java.lang.String,java.lang.String)to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.- Parameters:
colNames- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
dropDuplicates
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.For a static batch
Dataset, it just drops duplicate rows. For a streamingDataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can usewithWatermark(java.lang.String,java.lang.String)to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.- Parameters:
colNames- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
dropDuplicates
Returns a newDatasetwith duplicate rows removed, considering only the subset of columns.For a static batch
Dataset, it just drops duplicate rows. For a streamingDataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can usewithWatermark(java.lang.String,java.lang.String)to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.- Parameters:
col1- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
dropDuplicatesWithinWatermark
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.This only works with streaming
Dataset, and watermark for the inputDatasetmust be set viawithWatermark(java.lang.String,java.lang.String).For a streaming
Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.Note: too late data older than watermark will be dropped.
- Parameters:
col1- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.5.0
-
dropDuplicatesWithinWatermark
Returns a new Dataset with duplicates rows removed, within watermark.This only works with streaming
Dataset, and watermark for the inputDatasetmust be set viawithWatermark(java.lang.String,java.lang.String).For a streaming
Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.Note: too late data older than watermark will be dropped.
- Returns:
- (undocumented)
- Since:
- 3.5.0
-
dropDuplicatesWithinWatermark
public abstract Dataset<T> dropDuplicatesWithinWatermark(scala.collection.immutable.Seq<String> colNames) Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.This only works with streaming
Dataset, and watermark for the inputDatasetmust be set viawithWatermark(java.lang.String,java.lang.String).For a streaming
Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.Note: too late data older than watermark will be dropped.
- Parameters:
colNames- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.5.0
-
dropDuplicatesWithinWatermark
Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.This only works with streaming
Dataset, and watermark for the inputDatasetmust be set viawithWatermark(java.lang.String,java.lang.String).For a streaming
Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.Note: too late data older than watermark will be dropped.
- Parameters:
colNames- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.5.0
-
dropDuplicatesWithinWatermark
public Dataset<T> dropDuplicatesWithinWatermark(String col1, scala.collection.immutable.Seq<String> cols) Returns a new Dataset with duplicates rows removed, considering only the subset of columns, within watermark.This only works with streaming
Dataset, and watermark for the inputDatasetmust be set viawithWatermark(java.lang.String,java.lang.String).For a streaming
Dataset, this will keep all data across triggers as intermediate state to drop duplicated rows. The state will be kept to guarantee the semantic, "Events are deduplicated as long as the time distance of earliest and latest events are smaller than the delay threshold of watermark." Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events.Note: too late data older than watermark will be dropped.
- Parameters:
col1- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.5.0
-
dtypes
Returns all column names and their data types as an array.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
encoder
-
except
Returns a new Dataset containing rows in this Dataset but not in another Dataset. This is equivalent toEXCEPT DISTINCTin SQL.- Parameters:
other- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
- Note:
- Equality checking is performed directly on the encoded representation of the data and thus
is not affected by a custom
equalsfunction defined onT.
-
exceptAll
Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving the duplicates. This is equivalent toEXCEPT ALLin SQL.- Parameters:
other- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.4.0
- Note:
- Equality checking is performed directly on the encoded representation of the data and thus
is not affected by a custom
equalsfunction defined onT. Also as standard in SQL, this function resolves columns by position (not by name).
-
exists
Return aColumnobject for an EXISTS Subquery.The
existsmethod provides a way to create a boolean column that checks for the presence of related records in a subquery. When applied within aDataFrame, this method allows you to filter rows based on whether matching records exist in the related dataset. The resultingColumnobject can be used directly in filtering conditions or as a computed column.- Returns:
- (undocumented)
- Since:
- 4.0.0
-
explain
Prints the plans (logical and physical) with a format specified by a given explain mode.- Parameters:
mode- specifies the expected output format of plans.simplePrint only a physical plan.extended: Print both logical and physical plans.codegen: Print a physical plan and generated codes if they are available.cost: Print a logical plan and statistics if they are available.formatted: Split explain output into two sections: a physical plan outline and node details.
- Since:
- 3.0.0
-
explain
public void explain(boolean extended) Prints the plans (logical and physical) to the console for debugging purposes.- Parameters:
extended- defaultfalse. Iffalse, prints only the physical plan.- Since:
- 1.6.0
-
explain
public void explain()Prints the physical plan to the console for debugging purposes.- Since:
- 1.6.0
-
explode
public abstract <A extends scala.Product> Dataset<Row> explode(scala.collection.immutable.Seq<Column> input, scala.Function1<Row, scala.collection.IterableOnce<A>> f, scala.reflect.api.TypeTags.TypeTag<A> evidence$3) Deprecated.use flatMap() or select() with functions.explode() instead. Since 2.0.0.(Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function. This is similar to aLATERAL VIEWin HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.Given that this is deprecated, as an alternative, you can explode columns either using
functions.explode()orflatMap(). The following example uses these alternatives to count the number of books that contain a given word:case class Book(title: String, words: String) val ds: Dataset[Book] val allWords = ds.select($"title", explode(split($"words", " ")).as("word")) val bookCountPerWord = allWords.groupBy("word").agg(count_distinct("title"))Using
flatMap()this can similarly be exploded as:ds.flatMap(_.words.split(" "))- Parameters:
input- (undocumented)f- (undocumented)evidence$3- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
explode
public abstract <A,B> Dataset<Row> explode(String inputColumn, String outputColumn, scala.Function1<A, scala.collection.IterableOnce<B>> f, scala.reflect.api.TypeTags.TypeTag<B> evidence$4) Deprecated.use flatMap() or select() with functions.explode() instead. Since 2.0.0.(Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to aLATERAL VIEWin HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.Given that this is deprecated, as an alternative, you can explode columns either using
functions.explode():ds.select(explode(split($"words", " ")).as("word"))or
flatMap():ds.flatMap(_.words.split(" "))- Parameters:
inputColumn- (undocumented)outputColumn- (undocumented)f- (undocumented)evidence$4- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
filter
Filters rows using the given condition.// The following are equivalent: peopleDs.filter($"age" > 15) peopleDs.where($"age" > 15)- Parameters:
condition- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
filter
Filters rows using the given SQL expression.peopleDs.filter("age > 15")- Parameters:
conditionExpr- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
filter
(Scala-specific) Returns a new Dataset that only contains elements wherefuncreturnstrue.- Parameters:
func- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
filter
(Java-specific) Returns a new Dataset that only contains elements wherefuncreturnstrue.- Parameters:
func- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
first
Returns the first row. Alias for head().- Returns:
- (undocumented)
- Since:
- 1.6.0
-
flatMap
public <U> Dataset<U> flatMap(scala.Function1<T, scala.collection.IterableOnce<U>> func, Encoder<U> evidence$7) (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.- Parameters:
func- (undocumented)evidence$7- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
flatMap
(Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.- Parameters:
f- (undocumented)encoder- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
foreach
Applies a functionfto all rows.- Parameters:
f- (undocumented)- Since:
- 1.6.0
-
foreach
(Java-specific) Runsfuncon each element of this Dataset.- Parameters:
func- (undocumented)- Since:
- 1.6.0
-
foreachPartition
public abstract void foreachPartition(scala.Function1<scala.collection.Iterator<T>, scala.runtime.BoxedUnit> f) Applies a functionfto each partition of this Dataset.- Parameters:
f- (undocumented)- Since:
- 1.6.0
-
foreachPartition
(Java-specific) Runsfuncon each partition of this Dataset.- Parameters:
func- (undocumented)- Since:
- 1.6.0
-
groupBy
Groups the Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.// Compute the average for all numeric columns grouped by department. ds.groupBy($"department").avg() // Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
groupBy
Groups the Dataset using the specified columns, so that we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns grouped by department. ds.groupBy("department").avg() // Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
col1- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
groupBy
Groups the Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.// Compute the average for all numeric columns grouped by department. ds.groupBy($"department").avg() // Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
groupBy
Groups the Dataset using the specified columns, so that we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns grouped by department. ds.groupBy("department").avg() // Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
col1- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
groupByKey
public abstract <K> KeyValueGroupedDataset<K,T> groupByKey(scala.Function1<T, K> func, Encoder<K> evidence$2) (Scala-specific) Returns aKeyValueGroupedDatasetwhere the data is grouped by the given keyfunc.- Parameters:
func- (undocumented)evidence$2- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
groupByKey
(Java-specific) Returns aKeyValueGroupedDatasetwhere the data is grouped by the given keyfunc.- Parameters:
func- (undocumented)encoder- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
groupingSets
public RelationalGroupedDataset groupingSets(scala.collection.immutable.Seq<scala.collection.immutable.Seq<Column>> groupingSets, Column... cols) Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.// Compute the average for all numeric columns group by specific grouping sets. ds.groupingSets(Seq(Seq($"department", $"group"), Seq()), $"department", $"group").avg() // Compute the max age and average salary, group by specific grouping sets. ds.groupingSets(Seq($"department", $"gender"), Seq()), $"department", $"group").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
groupingSets- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 4.0.0
-
groupingSets
public abstract RelationalGroupedDataset groupingSets(scala.collection.immutable.Seq<scala.collection.immutable.Seq<Column>> groupingSets, scala.collection.immutable.Seq<Column> cols) Create multi-dimensional aggregation for the current Dataset using the specified grouping sets, so we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.// Compute the average for all numeric columns group by specific grouping sets. ds.groupingSets(Seq(Seq($"department", $"group"), Seq()), $"department", $"group").avg() // Compute the max age and average salary, group by specific grouping sets. ds.groupingSets(Seq($"department", $"gender"), Seq()), $"department", $"group").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
groupingSets- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 4.0.0
-
head
Returns the firstnrows.- Parameters:
n- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
- Note:
- this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
-
head
Returns the first row.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
hint
Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted:df1.join(df2.hint("broadcast"))the following code specifies that this dataset could be rebalanced with given number of partitions:
df1.hint("rebalance", 10)- Parameters:
name- the name of the hintparameters- the parameters of the hint, all the parameters should be aColumnorExpressionorSymbolor could be converted into aLiteral- Returns:
- (undocumented)
- Since:
- 2.2.0
-
hint
Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted:df1.join(df2.hint("broadcast"))the following code specifies that this dataset could be rebalanced with given number of partitions:
df1.hint("rebalance", 10)- Parameters:
name- the name of the hintparameters- the parameters of the hint, all the parameters should be aColumnorExpressionorSymbolor could be converted into aLiteral- Returns:
- (undocumented)
- Since:
- 2.2.0
-
inputFiles
Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.- Returns:
- (undocumented)
- Since:
- 2.0.0
-
intersect
Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent toINTERSECTin SQL.- Parameters:
other- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
- Note:
- Equality checking is performed directly on the encoded representation of the data and thus
is not affected by a custom
equalsfunction defined onT.
-
intersectAll
Returns a new Dataset containing rows only in both this Dataset and another Dataset while preserving the duplicates. This is equivalent toINTERSECT ALLin SQL.- Parameters:
other- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.4.0
- Note:
- Equality checking is performed directly on the encoded representation of the data and thus
is not affected by a custom
equalsfunction defined onT. Also as standard in SQL, this function resolves columns by position (not by name).
-
isEmpty
public abstract boolean isEmpty()Returns true if theDatasetis empty.- Returns:
- (undocumented)
- Since:
- 2.4.0
-
isLocal
public abstract boolean isLocal()Returns true if thecollectandtakemethods can be run locally (without any Spark executors).- Returns:
- (undocumented)
- Since:
- 1.6.0
-
isStreaming
public abstract boolean isStreaming()Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as aStreamingQueryusing thestart()method inDataStreamWriter. Methods that return a single answer, e.g.count()orcollect(), will throw anAnalysisExceptionwhen there is a streaming source present.- Returns:
- (undocumented)
- Since:
- 2.0.0
-
javaRDD
Returns the content of the Dataset as aJavaRDDofTs.- Returns:
- (undocumented)
- Since:
- 1.6.0
- Note:
- this is only supported in Classic.
-
join
Join with anotherDataFrame.Behaves as an INNER JOIN and requires a subsequent join predicate.
- Parameters:
right- Right side of the join operation.- Returns:
- (undocumented)
- Since:
- 2.0.0
-
join
Inner equi-join with anotherDataFrameusing the given column.Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's
JOIN USINGsyntax.// Joining df1 and df2 using the column "user_id" df1.join(df2, "user_id")- Parameters:
right- Right side of the join operation.usingColumn- Name of the column to join on. This column must exist on both sides.- Returns:
- (undocumented)
- Since:
- 2.0.0
- Note:
- If you perform a self-join using this function without aliasing the input
DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
-
join
(Java-specific) Inner equi-join with anotherDataFrameusing the given columns. See the Scala-specific overload for more details.- Parameters:
right- Right side of the join operation.usingColumns- Names of the columns to join on. This columns must exist on both sides.- Returns:
- (undocumented)
- Since:
- 3.4.0
-
join
(Scala-specific) Inner equi-join with anotherDataFrameusing the given columns.Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's
JOIN USINGsyntax.// Joining df1 and df2 using the columns "user_id" and "user_name" df1.join(df2, Seq("user_id", "user_name"))- Parameters:
right- Right side of the join operation.usingColumns- Names of the columns to join on. This columns must exist on both sides.- Returns:
- (undocumented)
- Since:
- 2.0.0
- Note:
- If you perform a self-join using this function without aliasing the input
DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
-
join
Equi-join with anotherDataFrameusing the given column. A cross join with a predicate is specified as an inner join. If you would explicitly like to perform a cross join use thecrossJoinmethod.Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's
JOIN USINGsyntax.- Parameters:
right- Right side of the join operation.usingColumn- Name of the column to join on. This column must exist on both sides.joinType- Type of join to perform. Defaultinner. Must be one of:inner,cross,outer,full,fullouter,full_outer,left,leftouter,left_outer,right,rightouter,right_outer,semi,leftsemi,left_semi,anti,leftanti,left_anti.- Returns:
- (undocumented)
- Since:
- 3.4.0
- Note:
- If you perform a self-join using this function without aliasing the input
DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
-
join
(Java-specific) Equi-join with anotherDataFrameusing the given columns. See the Scala-specific overload for more details.- Parameters:
right- Right side of the join operation.usingColumns- Names of the columns to join on. This columns must exist on both sides.joinType- Type of join to perform. Defaultinner. Must be one of:inner,cross,outer,full,fullouter,full_outer,left,leftouter,left_outer,right,rightouter,right_outer,semi,leftsemi,left_semi,anti,leftanti,left_anti.- Returns:
- (undocumented)
- Since:
- 3.4.0
-
join
public abstract Dataset<Row> join(Dataset<?> right, scala.collection.immutable.Seq<String> usingColumns, String joinType) (Scala-specific) Equi-join with anotherDataFrameusing the given columns. A cross join with a predicate is specified as an inner join. If you would explicitly like to perform a cross join use thecrossJoinmethod.Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's
JOIN USINGsyntax.- Parameters:
right- Right side of the join operation.usingColumns- Names of the columns to join on. This columns must exist on both sides.joinType- Type of join to perform. Defaultinner. Must be one of:inner,cross,outer,full,fullouter,full_outer,left,leftouter,left_outer,right,rightouter,right_outer,semi,leftsemi,left_semi,anti,leftanti,left_anti.- Returns:
- (undocumented)
- Since:
- 2.0.0
- Note:
- If you perform a self-join using this function without aliasing the input
DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
-
join
Inner join with anotherDataFrame, using the given join expression.// The following two are equivalent: df1.join(df2, $"df1Key" === $"df2Key") df1.join(df2).where($"df1Key" === $"df2Key")- Parameters:
right- (undocumented)joinExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
join
Join with anotherDataFrame, using the given join expression. The following performs a full outer join betweendf1anddf2.// Scala: import org.apache.spark.sql.functions._ df1.join(df2, $"df1Key" === $"df2Key", "outer") // Java: import static org.apache.spark.sql.functions.*; df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");- Parameters:
right- Right side of the join.joinExprs- Join expression.joinType- Type of join to perform. Defaultinner. Must be one of:inner,cross,outer,full,fullouter,full_outer,left,leftouter,left_outer,right,rightouter,right_outer,semi,leftsemi,left_semi,anti,leftanti,left_anti.- Returns:
- (undocumented)
- Since:
- 2.0.0
-
joinWith
public abstract <U> Dataset<scala.Tuple2<T,U>> joinWith(Dataset<U> other, Column condition, String joinType) Joins this Dataset returning aTuple2for each pair whereconditionevaluates to true.This is similar to the relation
joinfunction with one important difference in the result schema. SincejoinWithpreserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names_1and_2.This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
- Parameters:
other- Right side of the join.condition- Join expression.joinType- Type of join to perform. Defaultinner. Must be one of:inner,cross,outer,full,fullouter,full_outer,left,leftouter,left_outer,right,rightouter,right_outer.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
joinWith
Using inner equi-join to join this Dataset returning aTuple2for each pair whereconditionevaluates to true.- Parameters:
other- Right side of the join.condition- Join expression.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
lateralJoin
Lateral join with anotherDataFrame.Behaves as an JOIN LATERAL.
- Parameters:
right- Right side of the join operation.- Returns:
- (undocumented)
- Since:
- 4.0.0
-
lateralJoin
Lateral join with anotherDataFrame.Behaves as an JOIN LATERAL.
- Parameters:
right- Right side of the join operation.joinExprs- Join expression.- Returns:
- (undocumented)
- Since:
- 4.0.0
-
lateralJoin
Lateral join with anotherDataFrame.- Parameters:
right- Right side of the join operation.joinType- Type of join to perform. Defaultinner. Must be one of:inner,cross,left,leftouter,left_outer.- Returns:
- (undocumented)
- Since:
- 4.0.0
-
lateralJoin
Lateral join with anotherDataFrame.- Parameters:
right- Right side of the join operation.joinExprs- Join expression.joinType- Type of join to perform. Defaultinner. Must be one of:inner,cross,left,leftouter,left_outer.- Returns:
- (undocumented)
- Since:
- 4.0.0
-
limit
Returns a new Dataset by taking the firstnrows. The difference between this function andheadis thatheadis an action and returns an array (by triggering query execution) whilelimitreturns a new Dataset.- Parameters:
n- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
localCheckpoint
Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. Local checkpoints are written to executor storage and despite potentially faster they are unreliable and may compromise job completion.- Returns:
- (undocumented)
- Since:
- 2.3.0
-
localCheckpoint
Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. Local checkpoints are written to executor storage and despite potentially faster they are unreliable and may compromise job completion.- Parameters:
eager- Whether to checkpoint this dataframe immediately- Returns:
- (undocumented)
- Since:
- 2.3.0
- Note:
- When checkpoint is used with eager = false, the final data that is checkpointed after the first action may be different from the data that was used during the job due to non-determinism of the underlying operation and retries. If checkpoint is used to achieve saving a deterministic snapshot of the data, eager = true should be used. Otherwise, it is only deterministic after the first execution, after the checkpoint was finalized.
-
localCheckpoint
Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. Local checkpoints are written to executor storage and despite potentially faster they are unreliable and may compromise job completion.- Parameters:
eager- Whether to checkpoint this dataframe immediatelystorageLevel- StorageLevel with which to checkpoint the data.- Returns:
- (undocumented)
- Since:
- 4.0.0
- Note:
- When checkpoint is used with eager = false, the final data that is checkpointed after the first action may be different from the data that was used during the job due to non-determinism of the underlying operation and retries. If checkpoint is used to achieve saving a deterministic snapshot of the data, eager = true should be used. Otherwise, it is only deterministic after the first execution, after the checkpoint was finalized.
-
map
(Scala-specific) Returns a new Dataset that contains the result of applyingfuncto each element.- Parameters:
func- (undocumented)evidence$5- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
map
(Java-specific) Returns a new Dataset that contains the result of applyingfuncto each element.- Parameters:
func- (undocumented)encoder- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
mapPartitions
public abstract <U> Dataset<U> mapPartitions(scala.Function1<scala.collection.Iterator<T>, scala.collection.Iterator<U>> func, Encoder<U> evidence$6) (Scala-specific) Returns a new Dataset that contains the result of applyingfuncto each partition.- Parameters:
func- (undocumented)evidence$6- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
mapPartitions
(Java-specific) Returns a new Dataset that contains the result of applyingfto each partition.- Parameters:
f- (undocumented)encoder- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
melt
public Dataset<Row> melt(Column[] ids, Column[] values, String variableColumnName, String valueColumnName) Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse togroupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed. This is an alias forunpivot.- Parameters:
ids- Id columnsvalues- Value columns to unpivotvariableColumnName- Name of the variable columnvalueColumnName- Name of the value column- Returns:
- (undocumented)
- Since:
- 3.4.0
- See Also:
-
org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)
-
melt
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse togroupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed. This is an alias forunpivot.- Parameters:
ids- Id columnsvariableColumnName- Name of the variable columnvalueColumnName- Name of the value column- Returns:
- (undocumented)
- Since:
- 3.4.0
- See Also:
-
org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)This is equivalent to calling
Dataset#unpivot(Array, Array, String, String)wherevaluesis set to all non-id columns that exist in the DataFrame.
-
mergeInto
Merges a set of updates, insertions, and deletions based on a source table into a target table.Scala Examples:
spark.table("source") .mergeInto("target", $"source.id" === $"target.id") .whenMatched($"salary" === 100) .delete() .whenNotMatched() .insertAll() .whenNotMatchedBySource($"salary" === 100) .update(Map( "salary" -> lit(200) )) .merge()- Parameters:
table- (undocumented)condition- (undocumented)- Returns:
- (undocumented)
- Since:
- 4.0.0
-
metadataColumn
Selects a metadata column based on its logical column name, and returns it as aColumn.A metadata column can be accessed this way even if the underlying data source defines a data column with a conflicting name.
- Parameters:
colName- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.5.0
-
na
Returns aDataFrameNaFunctionsfor working with missing data.// Dropping rows containing any null values. ds.na.drop()- Returns:
- (undocumented)
- Since:
- 1.6.0
-
observe
Define (named) metrics to observe on the Dataset. This method returns an 'observed' Dataset that returns the same result as the input, with the following guarantees:- It will compute the defined aggregates (metrics) on all the data that is flowing through the Dataset at that point.
- It will report the value of the defined aggregate columns as soon as we reach a completion point. A completion point is either the end of a query (batch mode) or the end of a streaming epoch. The value of the aggregates only reflects the data processed since the previous completion point.
The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input Dataset's columns must always be wrapped in an aggregate function.
- Parameters:
name- (undocumented)expr- (undocumented)exprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.0.0
-
observe
Observe (named) metrics through anorg.apache.spark.sql.Observationinstance. This method does not support streaming datasets.A user can retrieve the metrics by accessing
org.apache.spark.sql.Observation.get.// Observe row count (rows) and highest id (maxid) in the Dataset while writing it val observation = Observation("my_metrics") val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid")) observed_ds.write.parquet("ds.parquet") val metrics = observation.get- Parameters:
observation- (undocumented)expr- (undocumented)exprs- (undocumented)- Returns:
- (undocumented)
- Throws:
IllegalArgumentException- If this is a streaming Dataset (this.isStreaming == true)- Since:
- 3.3.0
-
observe
public abstract Dataset<T> observe(String name, Column expr, scala.collection.immutable.Seq<Column> exprs) Define (named) metrics to observe on the Dataset. This method returns an 'observed' Dataset that returns the same result as the input, with the following guarantees:- It will compute the defined aggregates (metrics) on all the data that is flowing through the Dataset at that point.
- It will report the value of the defined aggregate columns as soon as we reach a completion point. A completion point is either the end of a query (batch mode) or the end of a streaming epoch. The value of the aggregates only reflects the data processed since the previous completion point.
The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input Dataset's columns must always be wrapped in an aggregate function.
- Parameters:
name- (undocumented)expr- (undocumented)exprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.0.0
-
observe
public abstract Dataset<T> observe(Observation observation, Column expr, scala.collection.immutable.Seq<Column> exprs) Observe (named) metrics through anorg.apache.spark.sql.Observationinstance. This method does not support streaming datasets.A user can retrieve the metrics by accessing
org.apache.spark.sql.Observation.get.// Observe row count (rows) and highest id (maxid) in the Dataset while writing it val observation = Observation("my_metrics") val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid")) observed_ds.write.parquet("ds.parquet") val metrics = observation.get- Parameters:
observation- (undocumented)expr- (undocumented)exprs- (undocumented)- Returns:
- (undocumented)
- Throws:
IllegalArgumentException- If this is a streaming Dataset (this.isStreaming == true)- Since:
- 3.3.0
-
offset
Returns a new Dataset by skipping the firstnrows.- Parameters:
n- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.4.0
-
orderBy
Returns a new Dataset sorted by the given expressions. This is an alias of thesortfunction.- Parameters:
sortCol- (undocumented)sortCols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
orderBy
Returns a new Dataset sorted by the given expressions. This is an alias of thesortfunction.- Parameters:
sortExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
orderBy
Returns a new Dataset sorted by the given expressions. This is an alias of thesortfunction.- Parameters:
sortCol- (undocumented)sortCols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
orderBy
Returns a new Dataset sorted by the given expressions. This is an alias of thesortfunction.- Parameters:
sortExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
persist
Persist this Dataset with the default storage level (MEMORY_AND_DISK).- Returns:
- (undocumented)
- Since:
- 1.6.0
-
persist
Persist this Dataset with the given storage level.- Parameters:
newLevel- One of:MEMORY_ONLY,MEMORY_AND_DISK,MEMORY_ONLY_SER,MEMORY_AND_DISK_SER,DISK_ONLY,MEMORY_ONLY_2,MEMORY_AND_DISK_2, etc.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
printSchema
public void printSchema()Prints the schema to the console in a nice tree format.- Since:
- 1.6.0
-
printSchema
public void printSchema(int level) Prints the schema up to the given level to the console in a nice tree format.- Parameters:
level- (undocumented)- Since:
- 3.0.0
-
queryExecution
public abstract org.apache.spark.sql.execution.QueryExecution queryExecution() -
randomSplit
Randomly splits this Dataset with the provided weights.- Parameters:
weights- weights for splits, will be normalized if they don't sum to 1.seed- Seed for sampling.For Java API, use
randomSplitAsList(double[],long).- Returns:
- (undocumented)
- Since:
- 2.0.0
-
randomSplit
Randomly splits this Dataset with the provided weights.- Parameters:
weights- weights for splits, will be normalized if they don't sum to 1.- Returns:
- (undocumented)
- Since:
- 2.0.0
-
randomSplitAsList
Returns a Java list that contains randomly split Dataset with the provided weights.- Parameters:
weights- weights for splits, will be normalized if they don't sum to 1.seed- Seed for sampling.- Returns:
- (undocumented)
- Since:
- 2.0.0
-
rdd
Represents the content of the Dataset as anRDDofT.- Returns:
- (undocumented)
- Since:
- 1.6.0
- Note:
- this is only supported in Classic.
-
reduce
(Scala-specific) Reduces the elements of this Dataset using the specified binary function. The givenfuncmust be commutative and associative or the result may be non-deterministic.- Parameters:
func- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
reduce
(Java-specific) Reduces the elements of this Dataset using the specified binary function. The givenfuncmust be commutative and associative or the result may be non-deterministic.- Parameters:
func- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
registerTempTable
Deprecated.Use createOrReplaceTempView(viewName) instead. Since 2.0.0.Registers this Dataset as a temporary table using the given name. The lifetime of this temporary table is tied to theSparkSessionthat was used to create this Dataset.- Parameters:
tableName- (undocumented)- Since:
- 1.6.0
-
repartition
Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions. The resulting Dataset is hash partitioned.This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
- Parameters:
numPartitions- (undocumented)partitionExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
repartition
Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitionsas number of partitions. The resulting Dataset is hash partitioned.This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
- Parameters:
partitionExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
repartition
Returns a new Dataset that has exactlynumPartitionspartitions.- Parameters:
numPartitions- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
repartition
public Dataset<T> repartition(int numPartitions, scala.collection.immutable.Seq<Column> partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions. The resulting Dataset is hash partitioned.This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
- Parameters:
numPartitions- (undocumented)partitionExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
repartition
Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitionsas number of partitions. The resulting Dataset is hash partitioned.This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
- Parameters:
partitionExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
repartitionById
Repartition the Dataset into the given number of partitions using the specified partition ID expression.- Parameters:
numPartitions- the number of partitions to use.partitionIdExpr- the expression to be used as the partition ID. Must be an integer type.- Returns:
- (undocumented)
- Since:
- 4.1.0
-
repartitionByRange
Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions. The resulting Dataset is range partitioned.At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config
spark.sql.execution.rangeExchange.sampleSizePerPartition.- Parameters:
numPartitions- (undocumented)partitionExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.3.0
-
repartitionByRange
Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitionsas number of partitions. The resulting Dataset is range partitioned.At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config
spark.sql.execution.rangeExchange.sampleSizePerPartition.- Parameters:
partitionExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.3.0
-
repartitionByRange
public Dataset<T> repartitionByRange(int numPartitions, scala.collection.immutable.Seq<Column> partitionExprs) Returns a new Dataset partitioned by the given partitioning expressions intonumPartitions. The resulting Dataset is range partitioned.At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config
spark.sql.execution.rangeExchange.sampleSizePerPartition.- Parameters:
numPartitions- (undocumented)partitionExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.3.0
-
repartitionByRange
Returns a new Dataset partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitionsas number of partitions. The resulting Dataset is range partitioned.At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config
spark.sql.execution.rangeExchange.sampleSizePerPartition.- Parameters:
partitionExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.3.0
-
rollup
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.// Compute the average for all numeric columns rolled up by department and group. ds.rollup($"department", $"group").avg() // Compute the max age and average salary, rolled up by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
rollup
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolled up by department and group. ds.rollup("department", "group").avg() // Compute the max age and average salary, rolled up by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
col1- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
rollup
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.// Compute the average for all numeric columns rolled up by department and group. ds.rollup($"department", $"group").avg() // Compute the max age and average salary, rolled up by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
rollup
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. SeeRelationalGroupedDatasetfor all the available aggregate functions.This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolled up by department and group. ds.rollup("department", "group").avg() // Compute the max age and average salary, rolled up by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))- Parameters:
col1- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
sameSemantics
Returnstruewhen the logical query plans inside bothDatasets are equal and therefore return same results.- Parameters:
other- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.1.0
- Note:
- The equality comparison here is simplified by tolerating the cosmetic differences such as
attribute names., This API can compare both
Datasets very fast but can still returnfalseon theDatasetthat return the same results, for instance, from different plans. Such false negative semantic can be useful when caching as an example.
-
sample
Returns a newDatasetby sampling a fraction of rows (without replacement), using a user-supplied seed.- Parameters:
fraction- Fraction of rows to generate, range [0.0, 1.0].seed- Seed for sampling.- Returns:
- (undocumented)
- Since:
- 2.3.0
- Note:
- This is NOT guaranteed to provide exactly the fraction of the count of the given
Dataset.
-
sample
Returns a newDatasetby sampling a fraction of rows (without replacement), using a random seed.- Parameters:
fraction- Fraction of rows to generate, range [0.0, 1.0].- Returns:
- (undocumented)
- Since:
- 2.3.0
- Note:
- This is NOT guaranteed to provide exactly the fraction of the count of the given
Dataset.
-
sample
Returns a newDatasetby sampling a fraction of rows, using a user-supplied seed.- Parameters:
withReplacement- Sample with replacement or not.fraction- Fraction of rows to generate, range [0.0, 1.0].seed- Seed for sampling.- Returns:
- (undocumented)
- Since:
- 1.6.0
- Note:
- This is NOT guaranteed to provide exactly the fraction of the count of the given
Dataset.
-
sample
Returns a newDatasetby sampling a fraction of rows, using a random seed.- Parameters:
withReplacement- Sample with replacement or not.fraction- Fraction of rows to generate, range [0.0, 1.0].- Returns:
- (undocumented)
- Since:
- 1.6.0
- Note:
- This is NOT guaranteed to provide exactly the fraction of the total count of the given
Dataset.
-
scalar
Return aColumnobject for a SCALAR Subquery containing exactly one row and one column.The
scalar()method is useful for extracting aColumnobject that represents a scalar value from a DataFrame, especially when the DataFrame results from an aggregation or single-value computation. This returnedColumncan then be used directly inselectclauses or as predicates in filters on the outer DataFrame, enabling dynamic data filtering and calculations based on scalar values.- Returns:
- (undocumented)
- Since:
- 4.0.0
-
schema
Returns the schema of this Dataset.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
select
Selects a set of column based expressions.ds.select($"colA", $"colB" + 1)- Parameters:
cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
select
Selects a set of columns. This is a variant ofselectthat can only select existing columns using column names (i.e. cannot construct expressions).// The following two are equivalent: ds.select("colA", "colB") ds.select($"colA", $"colB")- Parameters:
col- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
select
Selects a set of column based expressions.ds.select($"colA", $"colB" + 1)- Parameters:
cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
select
Selects a set of columns. This is a variant ofselectthat can only select existing columns using column names (i.e. cannot construct expressions).// The following two are equivalent: ds.select("colA", "colB") ds.select($"colA", $"colB")- Parameters:
col- (undocumented)cols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
select
Returns a new Dataset by computing the givenColumnexpression for each element.val ds = Seq(1, 2, 3).toDS() val newDS = ds.select(expr("value + 1").as[Int])- Parameters:
c1- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
select
Returns a new Dataset by computing the givenColumnexpressions for each element.- Parameters:
c1- (undocumented)c2- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
select
public <U1,U2, Dataset<scala.Tuple3<U1,U3> U2, selectU3>> (TypedColumn<T, U1> c1, TypedColumn<T, U2> c2, TypedColumn<T, U3> c3) Returns a new Dataset by computing the givenColumnexpressions for each element.- Parameters:
c1- (undocumented)c2- (undocumented)c3- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
select
public <U1,U2, Dataset<scala.Tuple4<U1,U3, U4> U2, selectU3, U4>> (TypedColumn<T, U1> c1, TypedColumn<T, U2> c2, TypedColumn<T, U3> c3, TypedColumn<T, U4> c4) Returns a new Dataset by computing the givenColumnexpressions for each element.- Parameters:
c1- (undocumented)c2- (undocumented)c3- (undocumented)c4- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
select
public <U1,U2, Dataset<scala.Tuple5<U1,U3, U4, U5> U2, selectU3, U4, U5>> (TypedColumn<T, U1> c1, TypedColumn<T, U2> c2, TypedColumn<T, U3> c3, TypedColumn<T, U4> c4, TypedColumn<T, U5> c5) Returns a new Dataset by computing the givenColumnexpressions for each element.- Parameters:
c1- (undocumented)c2- (undocumented)c3- (undocumented)c4- (undocumented)c5- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
selectExpr
Selects a set of SQL expressions. This is a variant ofselectthat accepts SQL expressions.// The following are equivalent: ds.selectExpr("colA", "colB as newName", "abs(colC)") ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))- Parameters:
exprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
selectExpr
Selects a set of SQL expressions. This is a variant ofselectthat accepts SQL expressions.// The following are equivalent: ds.selectExpr("colA", "colB as newName", "abs(colC)") ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))- Parameters:
exprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
semanticHash
public abstract int semanticHash()Returns ahashCodeof the logical query plan against thisDataset.- Returns:
- (undocumented)
- Since:
- 3.1.0
- Note:
- Unlike the standard
hashCode, the hash is calculated against the query plan simplified by tolerating the cosmetic differences such as attribute names.
-
show
public void show(int numRows) Displays the Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521- Parameters:
numRows- Number of rows to show- Since:
- 1.6.0
-
show
public void show()Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.- Since:
- 1.6.0
-
show
public void show(boolean truncate) Displays the top 20 rows of Dataset in a tabular form.- Parameters:
truncate- Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right- Since:
- 1.6.0
-
show
public abstract void show(int numRows, boolean truncate) Displays the Dataset in a tabular form. For example:year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521- Parameters:
numRows- Number of rows to showtruncate- Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right- Since:
- 1.6.0
-
show
public void show(int numRows, int truncate) Displays the Dataset in a tabular form. For example:year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521- Parameters:
numRows- Number of rows to showtruncate- If set to more than 0, truncates strings totruncatecharacters and all cells will be aligned right.- Since:
- 1.6.0
-
show
public abstract void show(int numRows, int truncate, boolean vertical) Displays the Dataset in a tabular form. For example:year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521If
verticalenabled, this command prints output rows vertically (one line per column value)?-RECORD 0------------------- year | 1980 month | 12 AVG('Adj Close) | 0.503218 AVG('Adj Close) | 0.595103 -RECORD 1------------------- year | 1981 month | 01 AVG('Adj Close) | 0.523289 AVG('Adj Close) | 0.570307 -RECORD 2------------------- year | 1982 month | 02 AVG('Adj Close) | 0.436504 AVG('Adj Close) | 0.475256 -RECORD 3------------------- year | 1983 month | 03 AVG('Adj Close) | 0.410516 AVG('Adj Close) | 0.442194 -RECORD 4------------------- year | 1984 month | 04 AVG('Adj Close) | 0.450090 AVG('Adj Close) | 0.483521- Parameters:
numRows- Number of rows to showtruncate- If set to more than 0, truncates strings totruncatecharacters and all cells will be aligned right.vertical- If set to true, prints output rows vertically (one line per column value).- Since:
- 2.3.0
-
sort
Returns a new Dataset sorted by the specified column, all in ascending order.// The following 3 are equivalent ds.sort("sortcol") ds.sort($"sortcol") ds.sort($"sortcol".asc)- Parameters:
sortCol- (undocumented)sortCols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
sort
Returns a new Dataset sorted by the given expressions. For example:ds.sort($"col1", $"col2".desc)- Parameters:
sortExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
sort
Returns a new Dataset sorted by the specified column, all in ascending order.// The following 3 are equivalent ds.sort("sortcol") ds.sort($"sortcol") ds.sort($"sortcol".asc)- Parameters:
sortCol- (undocumented)sortCols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
sort
Returns a new Dataset sorted by the given expressions. For example:ds.sort($"col1", $"col2".desc)- Parameters:
sortExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
sortWithinPartitions
Returns a new Dataset with each partition sorted by the given expressions.This is the same operation as "SORT BY" in SQL (Hive QL).
- Parameters:
sortCol- (undocumented)sortCols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
sortWithinPartitions
Returns a new Dataset with each partition sorted by the given expressions.This is the same operation as "SORT BY" in SQL (Hive QL).
- Parameters:
sortExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
sortWithinPartitions
public Dataset<T> sortWithinPartitions(String sortCol, scala.collection.immutable.Seq<String> sortCols) Returns a new Dataset with each partition sorted by the given expressions.This is the same operation as "SORT BY" in SQL (Hive QL).
- Parameters:
sortCol- (undocumented)sortCols- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
sortWithinPartitions
Returns a new Dataset with each partition sorted by the given expressions.This is the same operation as "SORT BY" in SQL (Hive QL).
- Parameters:
sortExprs- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
sparkSession
-
stat
Returns aDataFrameStatFunctionsfor working statistic functions support.// Finding frequent items in column with name 'a'. ds.stat.freqItems(Seq("a"))- Returns:
- (undocumented)
- Since:
- 1.6.0
-
storageLevel
Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.- Returns:
- (undocumented)
- Since:
- 2.1.0
-
summary
Computes specified statistics for numeric and string columns. Available statistics are:- count
- mean
- stddev
- min
- max
- arbitrary approximate percentiles specified as a percentage (e.g. 75%)
- count_distinct
- approx_count_distinct
If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the
aggfunction instead.ds.summary().show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // 25% 24.0 176.0 // 50% 24.0 176.0 // 75% 32.0 180.0 // max 92.0 192.0ds.summary("count", "min", "25%", "75%", "max").show() // output: // summary age height // count 10.0 10.0 // min 18.0 163.0 // 25% 24.0 176.0 // 75% 32.0 180.0 // max 92.0 192.0To do a summary for specific columns first select them:
ds.select("age", "height").summary().show()Specify statistics to output custom summaries:
ds.summary("count", "count_distinct").show()The distinct count isn't included by default.
You can also run approximate distinct counts which are faster:
ds.summary("count", "approx_count_distinct").show()See also
describe(java.lang.String...)for basic statistics.- Parameters:
statistics- Statistics from above list to be computed.- Returns:
- (undocumented)
- Since:
- 2.3.0
-
summary
Computes specified statistics for numeric and string columns. Available statistics are:- count
- mean
- stddev
- min
- max
- arbitrary approximate percentiles specified as a percentage (e.g. 75%)
- count_distinct
- approx_count_distinct
If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the
aggfunction instead.ds.summary().show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // 25% 24.0 176.0 // 50% 24.0 176.0 // 75% 32.0 180.0 // max 92.0 192.0ds.summary("count", "min", "25%", "75%", "max").show() // output: // summary age height // count 10.0 10.0 // min 18.0 163.0 // 25% 24.0 176.0 // 75% 32.0 180.0 // max 92.0 192.0To do a summary for specific columns first select them:
ds.select("age", "height").summary().show()Specify statistics to output custom summaries:
ds.summary("count", "count_distinct").show()The distinct count isn't included by default.
You can also run approximate distinct counts which are faster:
ds.summary("count", "approx_count_distinct").show()See also
describe(java.lang.String...)for basic statistics.- Parameters:
statistics- Statistics from above list to be computed.- Returns:
- (undocumented)
- Since:
- 2.3.0
-
tail
Returns the lastnrows in the Dataset.Running tail requires moving data into the application's driver process, and doing so with a very large
ncan crash the driver process with OutOfMemoryError.- Parameters:
n- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.0.0
-
take
Returns the firstnrows in the Dataset.Running take requires moving data into the application's driver process, and doing so with a very large
ncan crash the driver process with OutOfMemoryError.- Parameters:
n- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
takeAsList
Returns the firstnrows in the Dataset as a list.Running take requires moving data into the application's driver process, and doing so with a very large
ncan crash the driver process with OutOfMemoryError.- Parameters:
n- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
to
Returns a new DataFrame where each row is reconciled to match the specified schema. Spark will:- Reorder columns and/or inner fields by name to match the specified schema.
- Project away columns and/or inner fields that are not needed by the specified schema. Missing columns and/or inner fields (present in the specified schema but not input DataFrame) lead to failures.
- Cast the columns and/or inner fields to match the data types in the specified schema, if the types are compatible, e.g., numeric to numeric (error if overflows), but not string to int.
- Carry over the metadata from the specified schema, while the columns and/or inner fields still keep their own metadata if not overwritten by the specified schema.
- Fail if the nullability is not compatible. For example, the column and/or inner field is nullable but the specified schema requires them to be not nullable.
- Parameters:
schema- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.4.0
-
toDF
Converts this strongly typed collection of data to genericDataFramewith columns renamed. This can be quite convenient in conversion from an RDD of tuples into aDataFramewith meaningful names. For example:val rdd: RDD[(Int, String)] = ... rdd.toDF() // this implicit conversion creates a DataFrame with column name `_1` and `_2` rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"- Parameters:
colNames- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
toDF
Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns genericRowobjects that allow fields to be accessed by ordinal or name.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
toDF
Converts this strongly typed collection of data to genericDataFramewith columns renamed. This can be quite convenient in conversion from an RDD of tuples into aDataFramewith meaningful names. For example:val rdd: RDD[(Int, String)] = ... rdd.toDF() // this implicit conversion creates a DataFrame with column name `_1` and `_2` rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"- Parameters:
colNames- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
toJSON
Returns the content of the Dataset as a Dataset of JSON strings.- Returns:
- (undocumented)
- Since:
- 2.0.0
-
toJavaRDD
Returns the content of the Dataset as aJavaRDDofTs.- Returns:
- (undocumented)
- Since:
- 1.6.0
- Note:
- this is only supported in Classic.
-
toLocalIterator
Returns an iterator that contains all rows in this Dataset.The iterator will consume as much memory as the largest partition in this Dataset.
- Returns:
- (undocumented)
- Since:
- 2.0.0
- Note:
- this results in multiple Spark jobs, and if the input Dataset is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input Dataset should be cached first.
-
transform
Concise syntax for chaining custom transformations.def featurize(ds: Dataset[T]): Dataset[U] = ... ds .transform(featurize) .transform(...)- Parameters:
t- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
transpose
Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.Please note: - All columns except the index column must share a least common data type. Unless they are the same data type, all columns are cast to the nearest common data type. - The name of the column into which the original column names are transposed defaults to "key". - null values in the index column are excluded from the column names for the transposed table, which are ordered in ascending order.
val df = Seq(("A", 1, 2), ("B", 3, 4)).toDF("id", "val1", "val2") df.show() // output: // +---+----+----+ // | id|val1|val2| // +---+----+----+ // | A| 1| 2| // | B| 3| 4| // +---+----+----+ df.transpose($"id").show() // output: // +----+---+---+ // | key| A| B| // +----+---+---+ // |val1| 1| 3| // |val2| 2| 4| // +----+---+---+ // schema: // root // |-- key: string (nullable = false) // |-- A: integer (nullable = true) // |-- B: integer (nullable = true) df.transpose().show() // output: // +----+---+---+ // | key| A| B| // +----+---+---+ // |val1| 1| 3| // |val2| 2| 4| // +----+---+---+ // schema: // root // |-- key: string (nullable = false) // |-- A: integer (nullable = true) // |-- B: integer (nullable = true)- Parameters:
indexColumn- The single column that will be treated as the index for the transpose operation. This column will be used to pivot the data, transforming the DataFrame such that the values of the indexColumn become the new columns in the transposed DataFrame.- Returns:
- (undocumented)
- Since:
- 4.0.0
-
transpose
Transposes a DataFrame, switching rows to columns. This function transforms the DataFrame such that the values in the first column become the new columns of the DataFrame.This is equivalent to calling
Dataset#transpose(Column)whereindexColumnis set to the first column.Please note: - All columns except the index column must share a least common data type. Unless they are the same data type, all columns are cast to the nearest common data type. - The name of the column into which the original column names are transposed defaults to "key". - Non-"key" column names for the transposed table are ordered in ascending order.
- Returns:
- (undocumented)
- Since:
- 4.0.0
-
union
Returns a new Dataset containing union of rows in this Dataset and another Dataset.This is equivalent to
UNION ALLin SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by adistinct().Also as standard in SQL, this function resolves columns by position (not by name):
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2") val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0") df1.union(df2).show // output: // +----+----+----+ // |col0|col1|col2| // +----+----+----+ // | 1| 2| 3| // | 4| 5| 6| // +----+----+----+Notice that the column positions in the schema aren't necessarily matched with the fields in the strongly typed objects in a Dataset. This function resolves columns by their positions in the schema, not the fields in the strongly typed objects. Use
unionByName(org.apache.spark.sql.Dataset<T>)to resolve columns by field name in the typed objects.- Parameters:
other- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
unionAll
Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is an alias forunion.This is equivalent to
UNION ALLin SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by adistinct().Also as standard in SQL, this function resolves columns by position (not by name).
- Parameters:
other- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
unionByName
Returns a new Dataset containing union of rows in this Dataset and another Dataset.This is different from both
UNION ALLandUNION DISTINCTin SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by adistinct().The difference between this function and
union(org.apache.spark.sql.Dataset<T>)is that this function resolves columns by name (not by position):val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2") val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0") df1.unionByName(df2).show // output: // +----+----+----+ // |col0|col1|col2| // +----+----+----+ // | 1| 2| 3| // | 6| 4| 5| // +----+----+----+Note that this supports nested columns in struct and array types. Nested columns in map types are not currently supported.
- Parameters:
other- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.3.0
-
unionByName
Returns a new Dataset containing union of rows in this Dataset and another Dataset.The difference between this function and
union(org.apache.spark.sql.Dataset<T>)is that this function resolves columns by name (not by position).When the parameter
allowMissingColumnsistrue, the set of column names in this and otherDatasetcan differ; missing columns will be filled with null. Further, the missing columns of thisDatasetwill be added at the end in the schema of the union result:val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2") val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3") df1.unionByName(df2, true).show // output: "col3" is missing at left df1 and added at the end of schema. // +----+----+----+----+ // |col0|col1|col2|col3| // +----+----+----+----+ // | 1| 2| 3|NULL| // | 5| 4|NULL| 6| // +----+----+----+----+ df2.unionByName(df1, true).show // output: "col2" is missing at left df2 and added at the end of schema. // +----+----+----+----+ // |col1|col0|col3|col2| // +----+----+----+----+ // | 4| 5| 6|NULL| // | 2| 1|NULL| 3| // +----+----+----+----+Note that this supports nested columns in struct and array types. With
allowMissingColumns, missing nested columns of struct columns with the same name will also be filled with null values and added to the end of struct. Nested columns in map types are not currently supported.- Parameters:
other- (undocumented)allowMissingColumns- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.1.0
-
unpersist
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset.- Parameters:
blocking- Whether to block until all blocks are deleted.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
unpersist
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
unpivot
public abstract Dataset<Row> unpivot(Column[] ids, Column[] values, String variableColumnName, String valueColumnName) Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse togroupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed.This function is useful to massage a DataFrame into a format where some columns are identifier columns ("ids"), while all other columns ("values") are "unpivoted" to the rows, leaving just two non-id columns, named as given by
variableColumnNameandvalueColumnName.val df = Seq((1, 11, 12L), (2, 21, 22L)).toDF("id", "int", "long") df.show() // output: // +---+---+----+ // | id|int|long| // +---+---+----+ // | 1| 11| 12| // | 2| 21| 22| // +---+---+----+ df.unpivot(Array($"id"), Array($"int", $"long"), "variable", "value").show() // output: // +---+--------+-----+ // | id|variable|value| // +---+--------+-----+ // | 1| int| 11| // | 1| long| 12| // | 2| int| 21| // | 2| long| 22| // +---+--------+-----+ // schema: //root // |-- id: integer (nullable = false) // |-- variable: string (nullable = false) // |-- value: long (nullable = true)When no "id" columns are given, the unpivoted DataFrame consists of only the "variable" and "value" columns.
All "value" columns must share a least common data type. Unless they are the same data type, all "value" columns are cast to the nearest common data type. For instance, types
IntegerTypeandLongTypeare cast toLongType, whileIntegerTypeandStringTypedo not have a common data type andunpivotfails with anAnalysisException.- Parameters:
ids- Id columnsvalues- Value columns to unpivotvariableColumnName- Name of the variable columnvalueColumnName- Name of the value column- Returns:
- (undocumented)
- Since:
- 3.4.0
-
unpivot
public abstract Dataset<Row> unpivot(Column[] ids, String variableColumnName, String valueColumnName) Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. This is the reverse togroupBy(...).pivot(...).agg(...), except for the aggregation, which cannot be reversed.- Parameters:
ids- Id columnsvariableColumnName- Name of the variable columnvalueColumnName- Name of the value column- Returns:
- (undocumented)
- Since:
- 3.4.0
- See Also:
-
org.apache.spark.sql.Dataset.unpivot(Array, Array, String, String)This is equivalent to calling
Dataset#unpivot(Array, Array, String, String)wherevaluesis set to all non-id columns that exist in the DataFrame.
-
where
Filters rows using the given condition. This is an alias forfilter.// The following are equivalent: peopleDs.filter($"age" > 15) peopleDs.where($"age" > 15)- Parameters:
condition- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
where
Filters rows using the given SQL expression.peopleDs.where("age > 15")- Parameters:
conditionExpr- (undocumented)- Returns:
- (undocumented)
- Since:
- 1.6.0
-
withColumn
Returns a new Dataset by adding a column or replacing the existing column that has the same name.column's expression must only refer to attributes supplied by this Dataset. It is an error to add a column that refers to some other Dataset.- Parameters:
colName- (undocumented)col- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
- Note:
- this method introduces a projection internally. Therefore, calling it multiple times, for
instance, via loops in order to add multiple columns can generate big plans which can cause
performance issues and even
StackOverflowException. To avoid this, useselectwith the multiple columns at once.
-
withColumnRenamed
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.- Parameters:
existingName- (undocumented)newName- (undocumented)- Returns:
- (undocumented)
- Since:
- 2.0.0
-
withColumns
(Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.colsMapis a map of column name and column, the column must only refer to attributes supplied by this Dataset. It is an error to add columns that refers to some other Dataset.- Parameters:
colsMap- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.3.0
-
withColumns
(Java-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.colsMapis a map of column name and column, the column must only refer to attribute supplied by this Dataset. It is an error to add columns that refers to some other Dataset.- Parameters:
colsMap- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.3.0
-
withColumnsRenamed
public Dataset<Row> withColumnsRenamed(scala.collection.immutable.Map<String, String> colsMap) throws AnalysisException(Scala-specific) Returns a new Dataset with a columns renamed. This is a no-op if schema doesn't contain existingName.colsMapis a map of existing column name and new column name.- Parameters:
colsMap- (undocumented)- Returns:
- (undocumented)
- Throws:
AnalysisException- if there are duplicate names in resulting projection- Since:
- 3.4.0
-
withColumnsRenamed
(Java-specific) Returns a new Dataset with a columns renamed. This is a no-op if schema doesn't contain existingName.colsMapis a map of existing column name and new column name.- Parameters:
colsMap- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.4.0
-
withMetadata
Returns a new Dataset by updating an existing column with metadata.- Parameters:
columnName- (undocumented)metadata- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.3.0
-
withWatermark
Defines an event time watermark for thisDataset. A watermark tracks a point in time before which we assume no more late data is going to arrive.Spark will use this watermark for several purposes:
- To know when a given time window aggregation can be finalized and thus can be emitted when using output modes that do not allow updates.
- To minimize the amount of state that we need to keep for on-going
aggregations,
mapGroupsWithStateanddropDuplicatesoperators.
MAX(eventTime)seen across all of the partitions in the query minus a user specifieddelayThreshold. Due to the cost of coordinating this value across partitions, the actual watermark used is only guaranteed to be at leastdelayThresholdbehind the actual event time. In some cases we may still process records that arrive more thandelayThresholdlate.- Parameters:
eventTime- the name of the column that contains the event time of the row.delayThreshold- the minimum delay to wait to data to arrive late, relative to the latest record that has been processed in the form of an interval (e.g. "1 minute" or "5 hours"). NOTE: This should not be negative.- Returns:
- (undocumented)
- Since:
- 2.1.0
-
write
Interface for saving the content of the non-streaming Dataset out into external storage.- Returns:
- (undocumented)
- Since:
- 1.6.0
-
writeStream
Interface for saving the content of the streaming Dataset out into external storage.- Returns:
- (undocumented)
- Since:
- 2.0.0
-
writeTo
Create a write configuration builder for v2 sources.This builder is used to configure and execute write operations. For example, to append to an existing table, run:
df.writeTo("catalog.db.table").append()This can also be used to create or replace existing tables:
df.writeTo("catalog.db.table").partitionedBy($"col").createOrReplace()- Parameters:
table- (undocumented)- Returns:
- (undocumented)
- Since:
- 3.0.0
-