Computes the absolute value of a numeric value.
Computes the absolute value of a numeric value.
1.3.0
inverse cosine of columnName
, as if computed by java.lang.Math.acos
1.4.0
inverse cosine of e
in radians, as if computed by java.lang.Math.acos
1.4.0
Returns the date that is numMonths
after startDate
.
Returns the date that is numMonths
after startDate
.
A date, timestamp or string. If a string, the data must be in a format that
can be cast to a date, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
The number of months to add to startDate
, can be negative to subtract months
A date, or null if startDate
was a string that could not be cast to a date
1.5.0
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
maximum estimation error allowed (default = 0.05)
2.1.0
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
maximum estimation error allowed (default = 0.05)
2.1.0
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
2.1.0
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
2.1.0
Creates a new array column.
Creates a new array column. The input columns must all have the same data type.
1.4.0
Creates a new array column.
Creates a new array column. The input columns must all have the same data type.
1.4.0
Returns null if the array is null, true if the array contains value
, and false otherwise.
Returns null if the array is null, true if the array contains value
, and false otherwise.
1.5.0
Removes duplicate values from the array.
Removes duplicate values from the array.
2.4.0
Returns an array of the elements in the first array but not in the second array, without duplicates.
Returns an array of the elements in the first array but not in the second array, without duplicates. The order of elements in the result is not determined
2.4.0
Returns an array of the elements in the intersection of the given two arrays, without duplicates.
Returns an array of the elements in the intersection of the given two arrays, without duplicates.
2.4.0
Concatenates the elements of column
using the delimiter
.
Concatenates the elements of column
using the delimiter
.
2.4.0
Concatenates the elements of column
using the delimiter
.
Concatenates the elements of column
using the delimiter
. Null values are replaced with
nullReplacement
.
2.4.0
Returns the maximum value in the array.
Returns the maximum value in the array.
2.4.0
Returns the minimum value in the array.
Returns the minimum value in the array.
2.4.0
Locates the position of the first occurrence of the value in the given array as long.
Locates the position of the first occurrence of the value in the given array as long. Returns null if either of the arguments are null.
2.4.0
The position is not zero based, but 1 based index. Returns 0 if value could not be found in array.
Remove all elements that equal to element from the given array.
Remove all elements that equal to element from the given array.
2.4.0
Creates an array containing the left argument repeated the number of times given by the right argument.
Creates an array containing the left argument repeated the number of times given by the right argument.
2.4.0
Creates an array containing the left argument repeated the number of times given by the right argument.
Creates an array containing the left argument repeated the number of times given by the right argument.
2.4.0
Sorts the input array in ascending order.
Sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array.
2.4.0
Returns an array of the elements in the union of the given two arrays, without duplicates.
Returns an array of the elements in the union of the given two arrays, without duplicates.
2.4.0
Returns true
if a1
and a2
have at least one non-null element in common.
Returns true
if a1
and a2
have at least one non-null element in common. If not and both
the arrays are non-empty and any of them contains a null
, it returns null
. It returns
false
otherwise.
2.4.0
Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
2.4.0
Returns a sort expression based on ascending order of the column.
Returns a sort expression based on ascending order of the column.
df.sort(asc("dept"), desc("age"))
1.3.0
Returns a sort expression based on ascending order of the column, and null values return before non-null values.
Returns a sort expression based on ascending order of the column, and null values return before non-null values.
df.sort(asc_nulls_first("dept"), desc("age"))
2.1.0
Returns a sort expression based on ascending order of the column, and null values appear after non-null values.
Returns a sort expression based on ascending order of the column, and null values appear after non-null values.
df.sort(asc_nulls_last("dept"), desc("age"))
2.1.0
Computes the numeric value of the first character of the string column, and returns the result as an int column.
Computes the numeric value of the first character of the string column, and returns the result as an int column.
1.5.0
inverse sine of columnName
, as if computed by java.lang.Math.asin
1.4.0
inverse sine of e
in radians, as if computed by java.lang.Math.asin
1.4.0
inverse tangent of columnName
, as if computed by java.lang.Math.atan
1.4.0
inverse tangent of e
, as if computed by java.lang.Math.atan
1.4.0
coordinate on y-axis
coordinate on x-axis
the theta component of the point
(r, theta)
in polar coordinates that corresponds to the point
(x, y) in Cartesian coordinates,
as if computed by java.lang.Math.atan2
1.4.0
coordinate on y-axis
coordinate on x-axis
the theta component of the point
(r, theta)
in polar coordinates that corresponds to the point
(x, y) in Cartesian coordinates,
as if computed by java.lang.Math.atan2
1.4.0
coordinate on y-axis
coordinate on x-axis
the theta component of the point
(r, theta)
in polar coordinates that corresponds to the point
(x, y) in Cartesian coordinates,
as if computed by java.lang.Math.atan2
1.4.0
coordinate on y-axis
coordinate on x-axis
the theta component of the point
(r, theta)
in polar coordinates that corresponds to the point
(x, y) in Cartesian coordinates,
as if computed by java.lang.Math.atan2
1.4.0
coordinate on y-axis
coordinate on x-axis
the theta component of the point
(r, theta)
in polar coordinates that corresponds to the point
(x, y) in Cartesian coordinates,
as if computed by java.lang.Math.atan2
1.4.0
coordinate on y-axis
coordinate on x-axis
the theta component of the point
(r, theta)
in polar coordinates that corresponds to the point
(x, y) in Cartesian coordinates,
as if computed by java.lang.Math.atan2
1.4.0
coordinate on y-axis
coordinate on x-axis
the theta component of the point
(r, theta)
in polar coordinates that corresponds to the point
(x, y) in Cartesian coordinates,
as if computed by java.lang.Math.atan2
1.4.0
coordinate on y-axis
coordinate on x-axis
the theta component of the point
(r, theta)
in polar coordinates that corresponds to the point
(x, y) in Cartesian coordinates,
as if computed by java.lang.Math.atan2
1.4.0
Aggregate function: returns the average of the values in a group.
Aggregate function: returns the average of the values in a group.
1.3.0
Aggregate function: returns the average of the values in a group.
Aggregate function: returns the average of the values in a group.
1.3.0
Computes the BASE64 encoding of a binary column and returns it as a string column.
Computes the BASE64 encoding of a binary column and returns it as a string column. This is the reverse of unbase64.
1.5.0
An expression that returns the string representation of the binary value of the given long column.
An expression that returns the string representation of the binary value of the given long column. For example, bin("12") returns "1100".
1.5.0
An expression that returns the string representation of the binary value of the given long column.
An expression that returns the string representation of the binary value of the given long column. For example, bin("12") returns "1100".
1.5.0
Computes bitwise NOT (~) of a number.
Computes bitwise NOT (~) of a number.
1.4.0
Marks a DataFrame as small enough for use in broadcast joins.
Marks a DataFrame as small enough for use in broadcast joins.
The following example marks the right DataFrame for broadcast hash join using joinKey
.
// left and right are DataFrames left.join(broadcast(right), "joinKey")
1.5.0
Round the value of e
to scale
decimal places with HALF_EVEN round mode
if scale
is greater than or equal to 0 or at integral part when scale
is less than 0.
Round the value of e
to scale
decimal places with HALF_EVEN round mode
if scale
is greater than or equal to 0 or at integral part when scale
is less than 0.
2.0.0
Returns the value of the column e
rounded to 0 decimal places with HALF_EVEN round mode.
Returns the value of the column e
rounded to 0 decimal places with HALF_EVEN round mode.
2.0.0
Call an user-defined function.
Call an user-defined function. Example:
import org.apache.spark.sql._ val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value") val spark = df.sparkSession spark.udf.register("simpleUDF", (v: Int) => v * v) df.select($"id", callUDF("simpleUDF", $"value"))
1.5.0
Computes the cube-root of the given column.
Computes the cube-root of the given column.
1.4.0
Computes the cube-root of the given value.
Computes the cube-root of the given value.
1.4.0
Computes the ceiling of the given column.
Computes the ceiling of the given column.
1.4.0
Computes the ceiling of the given value.
Computes the ceiling of the given value.
1.4.0
Returns the first column that is not null, or null if all inputs are null.
Returns the first column that is not null, or null if all inputs are null.
For example, coalesce(a, b, c)
will return a if a is not null,
or b if a is null and b is not null, or c if both a and b are null but c is not null.
1.3.0
Returns a Column based on the given column name.
Returns a Column based on the given column name.
1.3.0
Aggregate function: returns a list of objects with duplicates.
Aggregate function: returns a list of objects with duplicates.
1.6.0
The function is non-deterministic because the order of collected results depends on order of rows which may be non-deterministic after a shuffle.
Aggregate function: returns a list of objects with duplicates.
Aggregate function: returns a list of objects with duplicates.
1.6.0
The function is non-deterministic because the order of collected results depends on order of rows which may be non-deterministic after a shuffle.
Aggregate function: returns a set of objects with duplicate elements eliminated.
Aggregate function: returns a set of objects with duplicate elements eliminated.
1.6.0
The function is non-deterministic because the order of collected results depends on order of rows which may be non-deterministic after a shuffle.
Aggregate function: returns a set of objects with duplicate elements eliminated.
Aggregate function: returns a set of objects with duplicate elements eliminated.
1.6.0
The function is non-deterministic because the order of collected results depends on order of rows which may be non-deterministic after a shuffle.
Returns a Column based on the given column name.
Concatenates multiple input columns together into a single column.
Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
1.5.0
Concatenates multiple input string columns together into a single string column, using the given separator.
Concatenates multiple input string columns together into a single string column, using the given separator.
1.5.0
Convert a number in a string column from one base to another.
Convert a number in a string column from one base to another.
1.5.0
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
1.6.0
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
1.6.0
angle in radians
cosine of the angle, as if computed by java.lang.Math.cos
1.4.0
angle in radians
cosine of the angle, as if computed by java.lang.Math.cos
1.4.0
hyperbolic angle
hyperbolic cosine of the angle, as if computed by java.lang.Math.cosh
1.4.0
hyperbolic angle
hyperbolic cosine of the angle, as if computed by java.lang.Math.cosh
1.4.0
Aggregate function: returns the number of items in a group.
Aggregate function: returns the number of items in a group.
1.3.0
Aggregate function: returns the number of items in a group.
Aggregate function: returns the number of items in a group.
1.3.0
Aggregate function: returns the number of distinct items in a group.
Aggregate function: returns the number of distinct items in a group.
1.3.0
Aggregate function: returns the number of distinct items in a group.
Aggregate function: returns the number of distinct items in a group.
1.3.0
Aggregate function: returns the population covariance for two columns.
Aggregate function: returns the population covariance for two columns.
2.0.0
Aggregate function: returns the population covariance for two columns.
Aggregate function: returns the population covariance for two columns.
2.0.0
Aggregate function: returns the sample covariance for two columns.
Aggregate function: returns the sample covariance for two columns.
2.0.0
Aggregate function: returns the sample covariance for two columns.
Aggregate function: returns the sample covariance for two columns.
2.0.0
Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.
Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.
1.5.0
Window function: returns the cumulative distribution of values within a window partition, i.e.
Window function: returns the cumulative distribution of values within a window partition, i.e. the fraction of rows that are below the current row.
N = total number of rows in the partition cumeDist(x) = number of values before (and including) x / N
1.6.0
Returns the current date as a date column.
Returns the current date as a date column.
1.5.0
Returns the current timestamp as a timestamp column.
Returns the current timestamp as a timestamp column.
1.5.0
Returns the date that is days
days after start
Returns the date that is days
days after start
A date, timestamp or string. If a string, the data must be in a format that
can be cast to a date, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
The number of days to add to start
, can be negative to subtract days
A date, or null if start
was a string that could not be cast to a date
1.5.0
Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
See java.text.SimpleDateFormat for valid date and time format patterns
A date, timestamp or string. If a string, the data must be in a format that
can be cast to a timestamp, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
A pattern dd.MM.yyyy
would return a string like 18.03.1993
A string, or null if dateExpr
was a string that could not be cast to a timestamp
1.5.0
IllegalArgumentException
if the format
pattern is invalid
Use specialized functions like year whenever possible as they benefit from a specialized implementation.
Returns the date that is days
days before start
Returns the date that is days
days before start
A date, timestamp or string. If a string, the data must be in a format that
can be cast to a date, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
The number of days to subtract from start
, can be negative to add days
A date, or null if start
was a string that could not be cast to a date
1.5.0
Returns timestamp truncated to the unit specified by the format.
Returns timestamp truncated to the unit specified by the format.
For example, date_tunc("2018-11-19 12:01:19", "year")
returns 2018-01-01 00:00:00
A date, timestamp or string. If a string, the data must be in a format that
can be cast to a timestamp, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
A timestamp, or null if timestamp
was a string that could not be cast to a timestamp
or format
was an invalid value
2.3.0
Returns the number of days from start
to end
.
Returns the number of days from start
to end
.
Only considers the date part of the input. For example:
dateddiff("2018-01-10 00:00:00", "2018-01-09 23:59:59") // returns 1
A date, timestamp or string. If a string, the data must be in a format that
can be cast to a date, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
A date, timestamp or string. If a string, the data must be in a format that
can be cast to a date, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
An integer, or null if either end
or start
were strings that could not be cast to
a date. Negative if end
is before start
1.5.0
Extracts the day of the month as an integer from a given date/timestamp/string.
Extracts the day of the month as an integer from a given date/timestamp/string.
An integer, or null if the input was a string that could not be cast to a date
1.5.0
Extracts the day of the week as an integer from a given date/timestamp/string.
Extracts the day of the week as an integer from a given date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a Saturday
An integer, or null if the input was a string that could not be cast to a date
2.3.0
Extracts the day of the year as an integer from a given date/timestamp/string.
Extracts the day of the year as an integer from a given date/timestamp/string.
An integer, or null if the input was a string that could not be cast to a date
1.5.0
Computes the first argument into a string from a binary using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16').
Computes the first argument into a string from a binary using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null.
1.5.0
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
angle in radians
angle in degrees, as if computed by java.lang.Math.toDegrees
2.1.0
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
angle in radians
angle in degrees, as if computed by java.lang.Math.toDegrees
2.1.0
Window function: returns the rank of rows within a window partition, without any gaps.
Window function: returns the rank of rows within a window partition, without any gaps.
The difference between rank and dense_rank is that denseRank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth.
This is equivalent to the DENSE_RANK function in SQL.
1.6.0
Returns a sort expression based on the descending order of the column.
Returns a sort expression based on the descending order of the column.
df.sort(asc("dept"), desc("age"))
1.3.0
Returns a sort expression based on the descending order of the column, and null values appear before non-null values.
Returns a sort expression based on the descending order of the column, and null values appear before non-null values.
df.sort(asc("dept"), desc_nulls_first("age"))
2.1.0
Returns a sort expression based on the descending order of the column, and null values appear after non-null values.
Returns a sort expression based on the descending order of the column, and null values appear after non-null values.
df.sort(asc("dept"), desc_nulls_last("age"))
2.1.0
Returns element of array at given index in value if column is array.
Returns element of array at given index in value if column is array. Returns value for the given key in value if column is map.
2.4.0
Computes the first argument into a binary from a string using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16').
Computes the first argument into a binary from a string using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null.
1.5.0
Computes the exponential of the given column.
Computes the exponential of the given column.
1.4.0
Computes the exponential of the given value.
Computes the exponential of the given value.
1.4.0
Creates a new row for each element in the given array or map column.
Creates a new row for each element in the given array or map column.
1.3.0
Creates a new row for each element in the given array or map column.
Creates a new row for each element in the given array or map column. Unlike explode, if the array/map is null or empty then null is produced.
2.2.0
Computes the exponential of the given column minus one.
Computes the exponential of the given column minus one.
1.4.0
Computes the exponential of the given value minus one.
Computes the exponential of the given value minus one.
1.4.0
Parses the expression string into the column that it represents, similar to Dataset#selectExpr.
Parses the expression string into the column that it represents, similar to Dataset#selectExpr.
// get the number of words of each length df.groupBy(expr("length(word)")).count()
Computes the factorial of the given value.
Computes the factorial of the given value.
1.5.0
Aggregate function: returns the first value of a column in a group.
Aggregate function: returns the first value of a column in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
1.3.0
The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Aggregate function: returns the first value in a group.
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
1.3.0
The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Aggregate function: returns the first value of a column in a group.
Aggregate function: returns the first value of a column in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
2.0.0
The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Aggregate function: returns the first value in a group.
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
2.0.0
The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Creates a single array from an array of arrays.
Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.
2.4.0
Computes the floor of the given column.
Computes the floor of the given column.
1.4.0
Computes the floor of the given value.
Computes the floor of the given value.
1.4.0
Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column.
Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column.
If d is 0, the result has no decimal point or fractional part. If d is less than 0, the result will be null.
1.5.0
Formats the arguments in printf-style and returns the result as a string column.
Formats the arguments in printf-style and returns the result as a string column.
1.5.0
(Java-specific) Parses a column containing a JSON string into a MapType
with StringType
as keys type, StructType
or ArrayType
of StructType
s with the specified schema.
(Java-specific) Parses a column containing a JSON string into a MapType
with StringType
as keys type, StructType
or ArrayType
of StructType
s with the specified schema.
Returns null
, in the case of an unparseable string.
a string column containing JSON data.
the schema to use when parsing the json string
options to control how the json is parsed. accepts the same options and the json data source.
2.4.0
(Scala-specific) Parses a column containing a JSON string into a MapType
with StringType
as keys type, StructType
or ArrayType
of StructType
s with the specified schema.
(Scala-specific) Parses a column containing a JSON string into a MapType
with StringType
as keys type, StructType
or ArrayType
of StructType
s with the specified schema.
Returns null
, in the case of an unparseable string.
a string column containing JSON data.
the schema to use when parsing the json string
2.4.0
(Scala-specific) Parses a column containing a JSON string into a MapType
with StringType
as keys type, StructType
or ArrayType
with the specified schema.
(Scala-specific) Parses a column containing a JSON string into a MapType
with StringType
as keys type, StructType
or ArrayType
with the specified schema.
Returns null
, in the case of an unparseable string.
a string column containing JSON data.
the schema to use when parsing the json string as a json string, it could be a JSON format string or a DDL-formatted string.
2.3.0
(Java-specific) Parses a column containing a JSON string into a MapType
with StringType
as keys type, StructType
or ArrayType
with the specified schema.
(Java-specific) Parses a column containing a JSON string into a MapType
with StringType
as keys type, StructType
or ArrayType
with the specified schema.
Returns null
, in the case of an unparseable string.
a string column containing JSON data.
the schema to use when parsing the json string as a json string. In Spark 2.1, the user-provided schema has to be in JSON format. Since Spark 2.2, the DDL format is also supported for the schema.
2.1.0
Parses a column containing a JSON string into a MapType
with StringType
as keys type,
StructType
or ArrayType
with the specified schema.
Parses a column containing a JSON string into a MapType
with StringType
as keys type,
StructType
or ArrayType
with the specified schema.
Returns null
, in the case of an unparseable string.
a string column containing JSON data.
the schema to use when parsing the json string
2.2.0
Parses a column containing a JSON string into a StructType
with the specified schema.
Parses a column containing a JSON string into a StructType
with the specified schema.
Returns null
, in the case of an unparseable string.
a string column containing JSON data.
the schema to use when parsing the json string
2.1.0
(Java-specific) Parses a column containing a JSON string into a MapType
with StringType
as keys type, StructType
or ArrayType
with the specified schema.
(Java-specific) Parses a column containing a JSON string into a MapType
with StringType
as keys type, StructType
or ArrayType
with the specified schema.
Returns null
, in the case of an unparseable string.
a string column containing JSON data.
the schema to use when parsing the json string
options to control how the json is parsed. accepts the same options and the json data source.
2.2.0
(Java-specific) Parses a column containing a JSON string into a StructType
with the
specified schema.
(Java-specific) Parses a column containing a JSON string into a StructType
with the
specified schema. Returns null
, in the case of an unparseable string.
a string column containing JSON data.
the schema to use when parsing the json string
options to control how the json is parsed. accepts the same options and the json data source.
2.1.0
(Scala-specific) Parses a column containing a JSON string into a MapType
with StringType
as keys type, StructType
or ArrayType
with the specified schema.
(Scala-specific) Parses a column containing a JSON string into a MapType
with StringType
as keys type, StructType
or ArrayType
with the specified schema.
Returns null
, in the case of an unparseable string.
a string column containing JSON data.
the schema to use when parsing the json string
options to control how the json is parsed. accepts the same options and the json data source.
2.2.0
(Scala-specific) Parses a column containing a JSON string into a StructType
with the
specified schema.
(Scala-specific) Parses a column containing a JSON string into a StructType
with the
specified schema. Returns null
, in the case of an unparseable string.
a string column containing JSON data.
the schema to use when parsing the json string
options to control how the json is parsed. Accepts the same options as the json data source.
2.1.0
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
See java.text.SimpleDateFormat for valid date and time format patterns
A number of a type that is castable to a long, such as string or integer. Can be negative for timestamps before the unix epoch
A date time pattern that the input will be formatted to
A string, or null if ut
was a string that could not be cast to a long or f
was
an invalid date time pattern
1.5.0
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format.
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format.
A number of a type that is castable to a long, such as string or integer. Can be negative for timestamps before the unix epoch
A string, or null if the input was a string that could not be cast to a long
1.5.0
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone.
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'.
2.4.0
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone.
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '2017-07-14 03:40:00.0'.
A date, timestamp or string. If a string, the data must be in a format that can be
cast to a timestamp, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
A string detailing the time zone that the input should be adjusted to, such as
Europe/London
, PST
or GMT+5
A timestamp, or null if ts
was a string that could not be cast to a timestamp or
tz
was an invalid value
1.5.0
Extracts json object from a json string based on json path specified, and returns json string of the extracted json object.
Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.
1.6.0
Returns the greatest value of the list of column names, skipping null values.
Returns the greatest value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
1.5.0
Returns the greatest value of the list of values, skipping null values.
Returns the greatest value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
1.5.0
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
2.0.0
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
2.0.0
Aggregate function: returns the level of grouping, equals to
Aggregate function: returns the level of grouping, equals to
(grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn)
2.0.0
The list of columns should match with grouping columns exactly.
Aggregate function: returns the level of grouping, equals to
Aggregate function: returns the level of grouping, equals to
(grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn)
2.0.0
The list of columns should match with grouping columns exactly, or empty (means all the grouping columns).
Calculates the hash code of given columns, and returns the result as an int column.
Calculates the hash code of given columns, and returns the result as an int column.
2.0.0
Computes hex value of the given column.
Computes hex value of the given column.
1.5.0
Extracts the hours as an integer from a given date/timestamp/string.
Extracts the hours as an integer from a given date/timestamp/string.
An integer, or null if the input was a string that could not be cast to a date
1.5.0
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
1.4.0
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
1.4.0
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
1.4.0
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
1.4.0
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
1.4.0
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
1.4.0
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
1.4.0
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
Computes sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.
1.4.0
Returns a new string column by converting the first letter of each word to uppercase.
Returns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace.
For example, "hello world" will become "Hello World".
1.5.0
Creates a string column for the file name of the current Spark task.
Creates a string column for the file name of the current Spark task.
1.6.0
Locate the position of the first occurrence of substr column in the given string.
Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
1.5.0
The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str.
Return true iff the column is NaN.
Return true iff the column is NaN.
1.6.0
Return true iff the column is null.
Return true iff the column is null.
1.6.0
Creates a new row for a json column according to the given field names.
Creates a new row for a json column according to the given field names.
1.6.0
Aggregate function: returns the kurtosis of the values in a group.
Aggregate function: returns the kurtosis of the values in a group.
1.6.0
Aggregate function: returns the kurtosis of the values in a group.
Aggregate function: returns the kurtosis of the values in a group.
1.6.0
Window function: returns the value that is offset
rows before the current row, and
defaultValue
if there is less than offset
rows before the current row.
Window function: returns the value that is offset
rows before the current row, and
defaultValue
if there is less than offset
rows before the current row. For example,
an offset
of one will return the previous row at any given point in the window partition.
This is equivalent to the LAG function in SQL.
1.4.0
Window function: returns the value that is offset
rows before the current row, and
defaultValue
if there is less than offset
rows before the current row.
Window function: returns the value that is offset
rows before the current row, and
defaultValue
if there is less than offset
rows before the current row. For example,
an offset
of one will return the previous row at any given point in the window partition.
This is equivalent to the LAG function in SQL.
1.4.0
Window function: returns the value that is offset
rows before the current row, and
null
if there is less than offset
rows before the current row.
Window function: returns the value that is offset
rows before the current row, and
null
if there is less than offset
rows before the current row. For example,
an offset
of one will return the previous row at any given point in the window partition.
This is equivalent to the LAG function in SQL.
1.4.0
Window function: returns the value that is offset
rows before the current row, and
null
if there is less than offset
rows before the current row.
Window function: returns the value that is offset
rows before the current row, and
null
if there is less than offset
rows before the current row. For example,
an offset
of one will return the previous row at any given point in the window partition.
This is equivalent to the LAG function in SQL.
1.4.0
Aggregate function: returns the last value of the column in a group.
Aggregate function: returns the last value of the column in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
1.3.0
The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Aggregate function: returns the last value in a group.
Aggregate function: returns the last value in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
1.3.0
The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Aggregate function: returns the last value of the column in a group.
Aggregate function: returns the last value of the column in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
2.0.0
The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Aggregate function: returns the last value in a group.
Aggregate function: returns the last value in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
2.0.0
The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Returns the last day of the month which the given date belongs to.
Returns the last day of the month which the given date belongs to. For example, input "2015-07-27" returns "2015-07-31" since July 31 is the last day of the month in July 2015.
A date, timestamp or string. If a string, the data must be in a format that can be
cast to a date, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
A date, or null if the input was a string that could not be cast to a date
1.5.0
Window function: returns the value that is offset
rows after the current row, and
defaultValue
if there is less than offset
rows after the current row.
Window function: returns the value that is offset
rows after the current row, and
defaultValue
if there is less than offset
rows after the current row. For example,
an offset
of one will return the next row at any given point in the window partition.
This is equivalent to the LEAD function in SQL.
1.4.0
Window function: returns the value that is offset
rows after the current row, and
defaultValue
if there is less than offset
rows after the current row.
Window function: returns the value that is offset
rows after the current row, and
defaultValue
if there is less than offset
rows after the current row. For example,
an offset
of one will return the next row at any given point in the window partition.
This is equivalent to the LEAD function in SQL.
1.4.0
Window function: returns the value that is offset
rows after the current row, and
null
if there is less than offset
rows after the current row.
Window function: returns the value that is offset
rows after the current row, and
null
if there is less than offset
rows after the current row. For example,
an offset
of one will return the next row at any given point in the window partition.
This is equivalent to the LEAD function in SQL.
1.4.0
Window function: returns the value that is offset
rows after the current row, and
null
if there is less than offset
rows after the current row.
Window function: returns the value that is offset
rows after the current row, and
null
if there is less than offset
rows after the current row. For example,
an offset
of one will return the next row at any given point in the window partition.
This is equivalent to the LEAD function in SQL.
1.4.0
Returns the least value of the list of column names, skipping null values.
Returns the least value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
1.5.0
Returns the least value of the list of values, skipping null values.
Returns the least value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
1.5.0
Computes the character length of a given string or number of bytes of a binary string.
Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.
1.5.0
Computes the Levenshtein distance of the two given string columns.
Computes the Levenshtein distance of the two given string columns.
1.5.0
Creates a Column of literal value.
Locate the position of the first occurrence of substr in a string column, after position pos.
Locate the position of the first occurrence of substr in a string column, after position pos.
1.5.0
The position is not zero based, but 1 based index. returns 0 if substr could not be found in str.
Locate the position of the first occurrence of substr.
Locate the position of the first occurrence of substr.
1.5.0
The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str.
Returns the first argument-base logarithm of the second argument.
Returns the first argument-base logarithm of the second argument.
1.4.0
Returns the first argument-base logarithm of the second argument.
Returns the first argument-base logarithm of the second argument.
1.4.0
Computes the natural logarithm of the given column.
Computes the natural logarithm of the given column.
1.4.0
Computes the natural logarithm of the given value.
Computes the natural logarithm of the given value.
1.4.0
Computes the logarithm of the given value in base 10.
Computes the logarithm of the given value in base 10.
1.4.0
Computes the logarithm of the given value in base 10.
Computes the logarithm of the given value in base 10.
1.4.0
Computes the natural logarithm of the given column plus one.
Computes the natural logarithm of the given column plus one.
1.4.0
Computes the natural logarithm of the given value plus one.
Computes the natural logarithm of the given value plus one.
1.4.0
Computes the logarithm of the given value in base 2.
Computes the logarithm of the given value in base 2.
1.5.0
Computes the logarithm of the given column in base 2.
Computes the logarithm of the given column in base 2.
1.5.0
Converts a string column to lower case.
Converts a string column to lower case.
1.3.0
Left-pad the string column with pad to a length of len.
Left-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.
1.5.0
Trim the specified character string from left end for the specified string column.
Trim the specified character string from left end for the specified string column.
2.3.0
Trim the spaces from left end for the specified string value.
Trim the spaces from left end for the specified string value.
1.5.0
Creates a new map column.
Creates a new map column. The input columns must be grouped as key-value pairs, e.g. (key1, value1, key2, value2, ...). The key columns must all have the same data type, and can't be null. The value columns must all have the same data type.
2.0
Returns the union of all the given maps.
Returns the union of all the given maps.
2.4.0
Creates a new map column.
Creates a new map column. The array in the first column is used for keys. The array in the second column is used for values. All elements in the array for key should not be null.
2.4
Returns a map created from the given array of entries.
Returns a map created from the given array of entries.
2.4.0
Returns an unordered array containing the keys of the map.
Returns an unordered array containing the keys of the map.
2.3.0
Returns an unordered array containing the values of the map.
Returns an unordered array containing the values of the map.
2.3.0
Aggregate function: returns the maximum value of the column in a group.
Aggregate function: returns the maximum value of the column in a group.
1.3.0
Aggregate function: returns the maximum value of the expression in a group.
Aggregate function: returns the maximum value of the expression in a group.
1.3.0
Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.
Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.
1.5.0
Aggregate function: returns the average of the values in a group.
Aggregate function: returns the average of the values in a group. Alias for avg.
1.4.0
Aggregate function: returns the average of the values in a group.
Aggregate function: returns the average of the values in a group. Alias for avg.
1.4.0
Aggregate function: returns the minimum value of the column in a group.
Aggregate function: returns the minimum value of the column in a group.
1.3.0
Aggregate function: returns the minimum value of the expression in a group.
Aggregate function: returns the minimum value of the expression in a group.
1.3.0
Extracts the minutes as an integer from a given date/timestamp/string.
Extracts the minutes as an integer from a given date/timestamp/string.
An integer, or null if the input was a string that could not be cast to a date
1.5.0
A column expression that generates monotonically increasing 64-bit integers.
A column expression that generates monotonically increasing 64-bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As an example, consider a DataFrame
with two partitions, each with 3 records.
This expression would return the following IDs:
0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
1.6.0
Extracts the month as an integer from a given date/timestamp/string.
Extracts the month as an integer from a given date/timestamp/string.
An integer, or null if the input was a string that could not be cast to a date
1.5.0
Returns number of months between dates end
and start
.
Returns number of months between dates end
and start
. If roundOff
is set to true, the
result is rounded off to 8 digits; it is not rounded otherwise.
2.4.0
Returns number of months between dates start
and end
.
Returns number of months between dates start
and end
.
A whole number is returned if both inputs have the same day of month or both are the last day of their respective months. Otherwise, the difference is calculated assuming 31 days per month.
For example:
months_between("2017-11-14", "2017-07-14") // returns 4.0 months_between("2017-01-01", "2017-01-10") // returns 0.29032258 months_between("2017-06-01", "2017-06-16 12:00:00") // returns -0.5
A date, timestamp or string. If a string, the data must be in a format that can
be cast to a timestamp, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
A date, timestamp or string. If a string, the data must be in a format that can
cast to a timestamp, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
A double, or null if either end
or start
were strings that could not be cast to a
timestamp. Negative if end
is before start
1.5.0
Returns col1 if it is not NaN, or col2 if col1 is NaN.
Returns col1 if it is not NaN, or col2 if col1 is NaN.
Both inputs should be floating point columns (DoubleType or FloatType).
1.5.0
Unary minus, i.e.
Unary minus, i.e. negate the expression.
// Select the amount column and negates all values. // Scala: df.select( -df("amount") ) // Java: df.select( negate(df.col("amount")) );
1.3.0
Returns the first date which is later than the value of the date
column that is on the
specified day of the week.
Returns the first date which is later than the value of the date
column that is on the
specified day of the week.
For example, next_day('2015-07-27', "Sunday")
returns 2015-08-02 because that is the first
Sunday after 2015-07-27.
A date, timestamp or string. If a string, the data must be in a format that
can be cast to a date, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
Case insensitive, and accepts: "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"
A date, or null if date
was a string that could not be cast to a date or if
dayOfWeek
was an invalid value
1.5.0
Inversion of boolean expression, i.e.
Inversion of boolean expression, i.e. NOT.
// Scala: select rows that are not active (isActive === false) df.filter( !df("isActive") ) // Java: df.filter( not(df.col("isActive")) );
1.3.0
Window function: returns the ntile group id (from 1 to n
inclusive) in an ordered window
partition.
Window function: returns the ntile group id (from 1 to n
inclusive) in an ordered window
partition. For example, if n
is 4, the first quarter of the rows will get value 1, the second
quarter will get 2, the third quarter will get 3, and the last quarter will get 4.
This is equivalent to the NTILE function in SQL.
1.4.0
Window function: returns the relative rank (i.e.
Window function: returns the relative rank (i.e. percentile) of rows within a window partition.
This is computed by:
(rank of row in its partition - 1) / (number of rows in the partition - 1)
This is equivalent to the PERCENT_RANK function in SQL.
1.6.0
Returns the positive value of dividend mod divisor.
Returns the positive value of dividend mod divisor.
1.5.0
Creates a new row for each element with position in the given array or map column.
Creates a new row for each element with position in the given array or map column.
2.1.0
Creates a new row for each element with position in the given array or map column.
Creates a new row for each element with position in the given array or map column. Unlike posexplode, if the array/map is null or empty then the row (null, null) is produced.
2.2.0
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
1.4.0
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
1.4.0
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
1.4.0
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
1.4.0
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
1.4.0
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
1.4.0
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
1.4.0
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
1.4.0
Extracts the quarter as an integer from a given date/timestamp/string.
Extracts the quarter as an integer from a given date/timestamp/string.
An integer, or null if the input was a string that could not be cast to a date
1.5.0
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
angle in degrees
angle in radians, as if computed by java.lang.Math.toRadians
2.1.0
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
angle in degrees
angle in radians, as if computed by java.lang.Math.toRadians
2.1.0
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
1.4.0
The function is non-deterministic in general case.
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
1.4.0
The function is non-deterministic in general case.
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
1.4.0
The function is non-deterministic in general case.
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
1.4.0
The function is non-deterministic in general case.
Window function: returns the rank of rows within a window partition.
Window function: returns the rank of rows within a window partition.
The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth.
This is equivalent to the RANK function in SQL.
1.4.0
Extract a specific group matched by a Java regex, from the specified string column.
Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned.
1.5.0
Replace all substrings of the specified string value that match regexp with rep.
Replace all substrings of the specified string value that match regexp with rep.
2.1.0
Replace all substrings of the specified string value that match regexp with rep.
Replace all substrings of the specified string value that match regexp with rep.
1.5.0
Repeats a string column n times, and returns it as a new string column.
Repeats a string column n times, and returns it as a new string column.
1.5.0
Returns a reversed string or an array with reverse order of elements.
Returns a reversed string or an array with reverse order of elements.
1.5.0
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
1.4.0
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
1.4.0
Round the value of e
to scale
decimal places with HALF_UP round mode
if scale
is greater than or equal to 0 or at integral part when scale
is less than 0.
Round the value of e
to scale
decimal places with HALF_UP round mode
if scale
is greater than or equal to 0 or at integral part when scale
is less than 0.
1.5.0
Returns the value of the column e
rounded to 0 decimal places with HALF_UP round mode.
Returns the value of the column e
rounded to 0 decimal places with HALF_UP round mode.
1.5.0
Window function: returns a sequential number starting at 1 within a window partition.
Window function: returns a sequential number starting at 1 within a window partition.
1.6.0
Right-pad the string column with pad to a length of len.
Right-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.
1.5.0
Trim the specified character string from right end for the specified string column.
Trim the specified character string from right end for the specified string column.
2.3.0
Trim the spaces from right end for the specified string value.
Trim the spaces from right end for the specified string value.
1.5.0
Parses a JSON string and infers its schema in DDL format.
Parses a JSON string and infers its schema in DDL format.
a string literal containing a JSON string.
2.4.0
Parses a JSON string and infers its schema in DDL format.
Parses a JSON string and infers its schema in DDL format.
a JSON string.
2.4.0
Extracts the seconds as an integer from a given date/timestamp/string.
Extracts the seconds as an integer from a given date/timestamp/string.
An integer, or null if the input was a string that could not be cast to a timestamp
1.5.0
Generate a sequence of integers from start to stop, incrementing by 1 if start is less than or equal to stop, otherwise -1.
Generate a sequence of integers from start to stop, incrementing by 1 if start is less than or equal to stop, otherwise -1.
2.4.0
Generate a sequence of integers from start to stop, incrementing by step.
Generate a sequence of integers from start to stop, incrementing by step.
2.4.0
Calculates the SHA-1 digest of a binary column and returns the value as a 40 character hex string.
Calculates the SHA-1 digest of a binary column and returns the value as a 40 character hex string.
1.5.0
Calculates the SHA-2 family of hash functions of a binary column and returns the value as a hex string.
Calculates the SHA-2 family of hash functions of a binary column and returns the value as a hex string.
column to compute SHA-2 on.
one of 224, 256, 384, or 512.
1.5.0
Shift the given value numBits left.
Shift the given value numBits left. If the given value is a long value, this function will return a long value else it will return an integer value.
1.5.0
(Signed) shift the given value numBits right.
(Signed) shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.
1.5.0
Unsigned shift the given value numBits right.
Unsigned shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.
1.5.0
Returns a random permutation of the given array.
Returns a random permutation of the given array.
2.4.0
The function is non-deterministic.
Computes the signum of the given column.
Computes the signum of the given column.
1.4.0
Computes the signum of the given value.
Computes the signum of the given value.
1.4.0
angle in radians
sine of the angle, as if computed by java.lang.Math.sin
1.4.0
angle in radians
sine of the angle, as if computed by java.lang.Math.sin
1.4.0
hyperbolic angle
hyperbolic sine of the given value, as if computed by java.lang.Math.sinh
1.4.0
hyperbolic angle
hyperbolic sine of the given value, as if computed by java.lang.Math.sinh
1.4.0
Returns length of array or map.
Returns length of array or map.
1.5.0
Aggregate function: returns the skewness of the values in a group.
Aggregate function: returns the skewness of the values in a group.
1.6.0
Aggregate function: returns the skewness of the values in a group.
Aggregate function: returns the skewness of the values in a group.
1.6.0
Returns an array containing all the elements in x
from index start
(or starting from the
end if start
is negative) with the specified length
.
Returns an array containing all the elements in x
from index start
(or starting from the
end if start
is negative) with the specified length
.
the array column to be sliced
the starting index
the length of the slice
2.4.0
Sorts the input array for the given column in ascending or descending order, according to the natural ordering of the array elements.
Sorts the input array for the given column in ascending or descending order, according to the natural ordering of the array elements. Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in descending order.
1.5.0
Sorts the input array for the given column in ascending order, according to the natural ordering of the array elements.
Sorts the input array for the given column in ascending order, according to the natural ordering of the array elements. Null elements will be placed at the beginning of the returned array.
1.5.0
Returns the soundex code for the specified expression.
Returns the soundex code for the specified expression.
1.5.0
Partition ID.
Partition ID.
1.6.0
This is non-deterministic because it depends on data partitioning and task scheduling.
Splits str around pattern (pattern is a regular expression).
Splits str around pattern (pattern is a regular expression).
1.5.0
Pattern is a string representation of the regular expression.
Computes the square root of the specified float value.
Computes the square root of the specified float value.
1.5.0
Computes the square root of the specified float value.
Computes the square root of the specified float value.
1.3.0
Aggregate function: alias for stddev_samp
.
Aggregate function: alias for stddev_samp
.
1.6.0
Aggregate function: alias for stddev_samp
.
Aggregate function: alias for stddev_samp
.
1.6.0
Aggregate function: returns the population standard deviation of the expression in a group.
Aggregate function: returns the population standard deviation of the expression in a group.
1.6.0
Aggregate function: returns the population standard deviation of the expression in a group.
Aggregate function: returns the population standard deviation of the expression in a group.
1.6.0
Aggregate function: returns the sample standard deviation of the expression in a group.
Aggregate function: returns the sample standard deviation of the expression in a group.
1.6.0
Aggregate function: returns the sample standard deviation of the expression in a group.
Aggregate function: returns the sample standard deviation of the expression in a group.
1.6.0
Creates a new struct column that composes multiple input columns.
Creates a new struct column that composes multiple input columns.
1.4.0
Creates a new struct column.
Creates a new struct column.
If the input column is a column in a DataFrame
, or a derived column expression
that is named (i.e. aliased), its name would be retained as the StructField's name,
otherwise, the newly generated StructField's name would be auto generated as
col
with a suffix index + 1
, i.e. col1, col2, col3, ...
1.4.0
Substring starts at pos
and is of length len
when str is String type or
returns the slice of byte array that starts at pos
in byte and is of length len
when str is Binary type
Substring starts at pos
and is of length len
when str is String type or
returns the slice of byte array that starts at pos
in byte and is of length len
when str is Binary type
1.5.0
The position is not zero based, but 1 based index.
Returns the substring from string str before count occurrences of the delimiter delim.
Returns the substring from string str before count occurrences of the delimiter delim. If count is positive, everything the left of the final delimiter (counting from left) is returned. If count is negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a case-sensitive match when searching for delim.
Aggregate function: returns the sum of all values in the given column.
Aggregate function: returns the sum of all values in the given column.
1.3.0
Aggregate function: returns the sum of all values in the expression.
Aggregate function: returns the sum of all values in the expression.
1.3.0
Aggregate function: returns the sum of distinct values in the expression.
Aggregate function: returns the sum of distinct values in the expression.
1.3.0
Aggregate function: returns the sum of distinct values in the expression.
Aggregate function: returns the sum of distinct values in the expression.
1.3.0
angle in radians
tangent of the given value, as if computed by java.lang.Math.tan
1.4.0
angle in radians
tangent of the given value, as if computed by java.lang.Math.tan
1.4.0
hyperbolic angle
hyperbolic tangent of the given value, as if computed by java.lang.Math.tanh
1.4.0
hyperbolic angle
hyperbolic tangent of the given value, as if computed by java.lang.Math.tanh
1.4.0
Converts the column into a DateType
with a specified format
Converts the column into a DateType
with a specified format
See java.text.SimpleDateFormat for valid date and time format patterns
A date, timestamp or string. If a string, the data must be in a format that can be
cast to a date, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
A date time pattern detailing the format of e
when e
is a string
A date, or null if e
was a string that could not be cast to a date or fmt
was an
invalid format
2.2.0
Converts the column into DateType
by casting rules to DateType
.
Converts the column into DateType
by casting rules to DateType
.
1.5.0
Converts a column containing a StructType
, ArrayType
or
a MapType
into a JSON string with the specified schema.
Converts a column containing a StructType
, ArrayType
or
a MapType
into a JSON string with the specified schema.
Throws an exception, in the case of an unsupported type.
a column containing a struct, an array or a map.
2.1.0
(Java-specific) Converts a column containing a StructType
, ArrayType
or
a MapType
into a JSON string with the specified schema.
(Java-specific) Converts a column containing a StructType
, ArrayType
or
a MapType
into a JSON string with the specified schema.
Throws an exception, in the case of an unsupported type.
a column containing a struct, an array or a map.
options to control how the struct column is converted into a json string. accepts the same options and the json data source.
2.1.0
(Scala-specific) Converts a column containing a StructType
, ArrayType
or
a MapType
into a JSON string with the specified schema.
(Scala-specific) Converts a column containing a StructType
, ArrayType
or
a MapType
into a JSON string with the specified schema.
Throws an exception, in the case of an unsupported type.
a column containing a struct, an array or a map.
options to control how the struct column is converted into a json string. accepts the same options and the json data source.
2.1.0
Converts time string with the given pattern to timestamp.
Converts time string with the given pattern to timestamp.
See java.text.SimpleDateFormat for valid date and time format patterns
A date, timestamp or string. If a string, the data must be in a format that can be
cast to a timestamp, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss
A date time pattern detailing the format of s
when s
is a string
A timestamp, or null if s
was a string that could not be cast to a timestamp or
fmt
was an invalid format
2.2.0
Converts to a timestamp by casting rules to TimestampType
.
Converts to a timestamp by casting rules to TimestampType
.
A date, timestamp or string. If a string, the data must be in a format that can be
cast to a timestamp, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
A timestamp, or null if the input was a string that could not be cast to a timestamp
2.2.0
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC.
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, 'GMT+1' would yield '2017-07-14 01:40:00.0'.
2.4.0
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC.
Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, 'GMT+1' would yield '2017-07-14 01:40:00.0'.
A date, timestamp or string. If a string, the data must be in a format that can be
cast to a timestamp, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
A string detailing the time zone that the input belongs to, such as Europe/London
,
PST
or GMT+5
A timestamp, or null if ts
was a string that could not be cast to a timestamp or
tz
was an invalid value
1.5.0
Translate any character in the src by a character in replaceString.
Translate any character in the src by a character in replaceString.
The characters in replaceString correspond to the characters in matchingString.
The translate will happen when any character in the string matches the character
in the matchingString
.
1.5.0
Trim the specified character from both ends for the specified string column.
Trim the specified character from both ends for the specified string column.
2.3.0
Trim the spaces from both ends for the specified string column.
Trim the spaces from both ends for the specified string column.
1.5.0
Returns date truncated to the unit specified by the format.
Returns date truncated to the unit specified by the format.
For example, trunc("2018-11-19 12:01:19", "year")
returns 2018-01-01
A date, timestamp or string. If a string, the data must be in a format that can be
cast to a date, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
A date, or null if date
was a string that could not be cast to a date or format
was an invalid value
1.5.0
Creates a Column of literal value.
Creates a Column of literal value.
The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value. The difference between this function and lit is that this function can handle parameterized scala types e.g.: List, Seq and Map.
2.2.0
Defines a deterministic user-defined function (UDF) using a Scala closure.
Defines a deterministic user-defined function (UDF) using a Scala closure. For this variant,
the caller must specify the output data type, and there is no automatic input type coercion.
By default the returned UDF is deterministic. To change it to nondeterministic, call the
API UserDefinedFunction.asNondeterministic()
.
A closure in Scala
The output data type of the UDF
2.0.0
Defines a Java UDF10 instance as user-defined function (UDF).
Defines a Java UDF10 instance as user-defined function (UDF).
The caller must specify the output data type, and there is no automatic input type coercion.
By default the returned UDF is deterministic. To change it to nondeterministic, call the
API UserDefinedFunction.asNondeterministic()
.
2.3.0
Defines a Java UDF9 instance as user-defined function (UDF).
Defines a Java UDF9 instance as user-defined function (UDF).
The caller must specify the output data type, and there is no automatic input type coercion.
By default the returned UDF is deterministic. To change it to nondeterministic, call the
API UserDefinedFunction.asNondeterministic()
.
2.3.0
Defines a Java UDF8 instance as user-defined function (UDF).
Defines a Java UDF8 instance as user-defined function (UDF).
The caller must specify the output data type, and there is no automatic input type coercion.
By default the returned UDF is deterministic. To change it to nondeterministic, call the
API UserDefinedFunction.asNondeterministic()
.
2.3.0
Defines a Java UDF7 instance as user-defined function (UDF).
Defines a Java UDF7 instance as user-defined function (UDF).
The caller must specify the output data type, and there is no automatic input type coercion.
By default the returned UDF is deterministic. To change it to nondeterministic, call the
API UserDefinedFunction.asNondeterministic()
.
2.3.0
Defines a Java UDF6 instance as user-defined function (UDF).
Defines a Java UDF6 instance as user-defined function (UDF).
The caller must specify the output data type, and there is no automatic input type coercion.
By default the returned UDF is deterministic. To change it to nondeterministic, call the
API UserDefinedFunction.asNondeterministic()
.
2.3.0
Defines a Java UDF5 instance as user-defined function (UDF).
Defines a Java UDF5 instance as user-defined function (UDF).
The caller must specify the output data type, and there is no automatic input type coercion.
By default the returned UDF is deterministic. To change it to nondeterministic, call the
API UserDefinedFunction.asNondeterministic()
.
2.3.0
Defines a Java UDF4 instance as user-defined function (UDF).
Defines a Java UDF4 instance as user-defined function (UDF).
The caller must specify the output data type, and there is no automatic input type coercion.
By default the returned UDF is deterministic. To change it to nondeterministic, call the
API UserDefinedFunction.asNondeterministic()
.
2.3.0
Defines a Java UDF3 instance as user-defined function (UDF).
Defines a Java UDF3 instance as user-defined function (UDF).
The caller must specify the output data type, and there is no automatic input type coercion.
By default the returned UDF is deterministic. To change it to nondeterministic, call the
API UserDefinedFunction.asNondeterministic()
.
2.3.0
Defines a Java UDF2 instance as user-defined function (UDF).
Defines a Java UDF2 instance as user-defined function (UDF).
The caller must specify the output data type, and there is no automatic input type coercion.
By default the returned UDF is deterministic. To change it to nondeterministic, call the
API UserDefinedFunction.asNondeterministic()
.
2.3.0
Defines a Java UDF1 instance as user-defined function (UDF).
Defines a Java UDF1 instance as user-defined function (UDF).
The caller must specify the output data type, and there is no automatic input type coercion.
By default the returned UDF is deterministic. To change it to nondeterministic, call the
API UserDefinedFunction.asNondeterministic()
.
2.3.0
Defines a Java UDF0 instance as user-defined function (UDF).
Defines a Java UDF0 instance as user-defined function (UDF).
The caller must specify the output data type, and there is no automatic input type coercion.
By default the returned UDF is deterministic. To change it to nondeterministic, call the
API UserDefinedFunction.asNondeterministic()
.
2.3.0
Defines a Scala closure of 10 arguments as user-defined function (UDF).
Defines a Scala closure of 10 arguments as user-defined function (UDF).
The data types are automatically inferred based on the Scala closure's
signature. By default the returned UDF is deterministic. To change it to
nondeterministic, call the API UserDefinedFunction.asNondeterministic()
.
1.3.0
Defines a Scala closure of 9 arguments as user-defined function (UDF).
Defines a Scala closure of 9 arguments as user-defined function (UDF).
The data types are automatically inferred based on the Scala closure's
signature. By default the returned UDF is deterministic. To change it to
nondeterministic, call the API UserDefinedFunction.asNondeterministic()
.
1.3.0
Defines a Scala closure of 8 arguments as user-defined function (UDF).
Defines a Scala closure of 8 arguments as user-defined function (UDF).
The data types are automatically inferred based on the Scala closure's
signature. By default the returned UDF is deterministic. To change it to
nondeterministic, call the API UserDefinedFunction.asNondeterministic()
.
1.3.0
Defines a Scala closure of 7 arguments as user-defined function (UDF).
Defines a Scala closure of 7 arguments as user-defined function (UDF).
The data types are automatically inferred based on the Scala closure's
signature. By default the returned UDF is deterministic. To change it to
nondeterministic, call the API UserDefinedFunction.asNondeterministic()
.
1.3.0
Defines a Scala closure of 6 arguments as user-defined function (UDF).
Defines a Scala closure of 6 arguments as user-defined function (UDF).
The data types are automatically inferred based on the Scala closure's
signature. By default the returned UDF is deterministic. To change it to
nondeterministic, call the API UserDefinedFunction.asNondeterministic()
.
1.3.0
Defines a Scala closure of 5 arguments as user-defined function (UDF).
Defines a Scala closure of 5 arguments as user-defined function (UDF).
The data types are automatically inferred based on the Scala closure's
signature. By default the returned UDF is deterministic. To change it to
nondeterministic, call the API UserDefinedFunction.asNondeterministic()
.
1.3.0
Defines a Scala closure of 4 arguments as user-defined function (UDF).
Defines a Scala closure of 4 arguments as user-defined function (UDF).
The data types are automatically inferred based on the Scala closure's
signature. By default the returned UDF is deterministic. To change it to
nondeterministic, call the API UserDefinedFunction.asNondeterministic()
.
1.3.0
Defines a Scala closure of 3 arguments as user-defined function (UDF).
Defines a Scala closure of 3 arguments as user-defined function (UDF).
The data types are automatically inferred based on the Scala closure's
signature. By default the returned UDF is deterministic. To change it to
nondeterministic, call the API UserDefinedFunction.asNondeterministic()
.
1.3.0
Defines a Scala closure of 2 arguments as user-defined function (UDF).
Defines a Scala closure of 2 arguments as user-defined function (UDF).
The data types are automatically inferred based on the Scala closure's
signature. By default the returned UDF is deterministic. To change it to
nondeterministic, call the API UserDefinedFunction.asNondeterministic()
.
1.3.0
Defines a Scala closure of 1 arguments as user-defined function (UDF).
Defines a Scala closure of 1 arguments as user-defined function (UDF).
The data types are automatically inferred based on the Scala closure's
signature. By default the returned UDF is deterministic. To change it to
nondeterministic, call the API UserDefinedFunction.asNondeterministic()
.
1.3.0
Defines a Scala closure of 0 arguments as user-defined function (UDF).
Defines a Scala closure of 0 arguments as user-defined function (UDF).
The data types are automatically inferred based on the Scala closure's
signature. By default the returned UDF is deterministic. To change it to
nondeterministic, call the API UserDefinedFunction.asNondeterministic()
.
1.3.0
Decodes a BASE64 encoded string column and returns it as a binary column.
Decodes a BASE64 encoded string column and returns it as a binary column. This is the reverse of base64.
1.5.0
Inverse of hex.
Inverse of hex. Interprets each pair of characters as a hexadecimal number and converts to the byte representation of number.
1.5.0
Converts time string with given pattern to Unix timestamp (in seconds).
Converts time string with given pattern to Unix timestamp (in seconds).
See java.text.SimpleDateFormat for valid date and time format patterns
A date, timestamp or string. If a string, the data must be in a format that can be
cast to a date, such as yyyy-MM-dd
or yyyy-MM-dd HH:mm:ss.SSSS
A date time pattern detailing the format of s
when s
is a string
A long, or null if s
was a string that could not be cast to a date or p
was
an invalid format
1.5.0
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.
A date, timestamp or string. If a string, the data must be in the
yyyy-MM-dd HH:mm:ss
format
A long, or null if the input was a string not of the correct format
1.5.0
Returns the current Unix timestamp (in seconds) as a long.
Returns the current Unix timestamp (in seconds) as a long.
1.5.0
All calls of unix_timestamp
within the same query return the same value
(i.e. the current timestamp is calculated at the start of query evaluation).
Converts a string column to upper case.
Converts a string column to upper case.
1.3.0
Aggregate function: returns the population variance of the values in a group.
Aggregate function: returns the population variance of the values in a group.
1.6.0
Aggregate function: returns the population variance of the values in a group.
Aggregate function: returns the population variance of the values in a group.
1.6.0
Aggregate function: returns the unbiased variance of the values in a group.
Aggregate function: returns the unbiased variance of the values in a group.
1.6.0
Aggregate function: returns the unbiased variance of the values in a group.
Aggregate function: returns the unbiased variance of the values in a group.
1.6.0
Aggregate function: alias for var_samp
.
Aggregate function: alias for var_samp
.
1.6.0
Aggregate function: alias for var_samp
.
Aggregate function: alias for var_samp
.
1.6.0
Extracts the week number as an integer from a given date/timestamp/string.
Extracts the week number as an integer from a given date/timestamp/string.
A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601
An integer, or null if the input was a string that could not be cast to a date
1.5.0
Evaluates a list of conditions and returns one of multiple possible result expressions.
Evaluates a list of conditions and returns one of multiple possible result expressions. If otherwise is not defined at the end, null is returned for unmatched conditions.
// Example: encoding gender string column into integer. // Scala: people.select(when(people("gender") === "male", 0) .when(people("gender") === "female", 1) .otherwise(2)) // Java: people.select(when(col("gender").equalTo("male"), 0) .when(col("gender").equalTo("female"), 1) .otherwise(2))
1.4.0
Generates tumbling time windows given a timestamp specifying column.
Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC. The following example takes the average stock price for a one minute tumbling window:
val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType df.groupBy(window($"time", "1 minute"), $"stockId") .agg(mean("price"))
The windows will look like:
09:00:00-09:01:00 09:01:00-09:02:00 09:02:00-09:03:00 ...
For a streaming query, you may use the function current_timestamp
to generate windows on
processing time.
The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType.
A string specifying the width of the window, e.g. 10 minutes
,
1 second
. Check org.apache.spark.unsafe.types.CalendarInterval
for
valid duration identifiers.
2.0.0
Bucketize rows into one or more time windows given a timestamp specifying column.
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC. The following example takes the average stock price for a one minute window every 10 seconds:
val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType df.groupBy(window($"time", "1 minute", "10 seconds"), $"stockId") .agg(mean("price"))
The windows will look like:
09:00:00-09:01:00 09:00:10-09:01:10 09:00:20-09:01:20 ...
For a streaming query, you may use the function current_timestamp
to generate windows on
processing time.
The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType.
A string specifying the width of the window, e.g. 10 minutes
,
1 second
. Check org.apache.spark.unsafe.types.CalendarInterval
for
valid duration identifiers. Note that the duration is a fixed length of
time, and does not vary over time according to a calendar. For example,
1 day
always means 86,400,000 milliseconds, not a calendar day.
A string specifying the sliding interval of the window, e.g. 1 minute
.
A new window will be generated every slideDuration
. Must be less than
or equal to the windowDuration
. Check
org.apache.spark.unsafe.types.CalendarInterval
for valid duration
identifiers. This duration is likewise absolute, and does not vary
according to a calendar.
2.0.0
Bucketize rows into one or more time windows given a timestamp specifying column.
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The following example takes the average stock price for a one minute window every 10 seconds starting 5 seconds after the hour:
val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType df.groupBy(window($"time", "1 minute", "10 seconds", "5 seconds"), $"stockId") .agg(mean("price"))
The windows will look like:
09:00:05-09:01:05 09:00:15-09:01:15 09:00:25-09:01:25 ...
For a streaming query, you may use the function current_timestamp
to generate windows on
processing time.
The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType.
A string specifying the width of the window, e.g. 10 minutes
,
1 second
. Check org.apache.spark.unsafe.types.CalendarInterval
for
valid duration identifiers. Note that the duration is a fixed length of
time, and does not vary over time according to a calendar. For example,
1 day
always means 86,400,000 milliseconds, not a calendar day.
A string specifying the sliding interval of the window, e.g. 1 minute
.
A new window will be generated every slideDuration
. Must be less than
or equal to the windowDuration
. Check
org.apache.spark.unsafe.types.CalendarInterval
for valid duration
identifiers. This duration is likewise absolute, and does not vary
according to a calendar.
The offset with respect to 1970-01-01 00:00:00 UTC with which to start
window intervals. For example, in order to have hourly tumbling windows that
start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide
startTime
as 15 minutes
.
2.0.0
Extracts the year as an integer from a given date/timestamp/string.
Extracts the year as an integer from a given date/timestamp/string.
An integer, or null if the input was a string that could not be cast to a date
1.5.0
(Since version 2.1.0) Use approx_count_distinct
1.3.0
(Since version 2.1.0) Use approx_count_distinct
1.3.0
(Since version 2.1.0) Use approx_count_distinct
1.3.0
(Since version 2.1.0) Use approx_count_distinct
1.3.0
This function has been deprecated in Spark 2.4.
This function has been deprecated in Spark 2.4. See SPARK-25842 for more information.
(Since version 2.4.0) Use Window.currentRow
2.3.0
A column expression that generates monotonically increasing 64-bit integers.
A column expression that generates monotonically increasing 64-bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As an example, consider a DataFrame
with two partitions, each with 3 records.
This expression would return the following IDs:
0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
(Since version 2.0.0) Use monotonically_increasing_id()
1.4.0
(Since version 2.1.0) Use degrees
1.4.0
(Since version 2.1.0) Use degrees
1.4.0
(Since version 2.1.0) Use radians
1.4.0
(Since version 2.1.0) Use radians
1.4.0
This function has been deprecated in Spark 2.4.
This function has been deprecated in Spark 2.4. See SPARK-25842 for more information.
(Since version 2.4.0) Use Window.unboundedFollowing
2.3.0
This function has been deprecated in Spark 2.4.
This function has been deprecated in Spark 2.4. See SPARK-25842 for more information.
(Since version 2.4.0) Use Window.unboundedPreceding
2.3.0
Commonly used functions available for DataFrame operations. Using functions defined here provides a little bit more compile-time safety to make sure the function exists.
Spark also includes more built-in functions that are less common and are not defined here. You can still access them (and all the functions defined here) using the
functions.expr()
API and calling them through a SQL expression string. You can find the entire list of functions at SQL API documentation.As an example,
isnan
is a function that is defined here. You can useisnan(col("myCol"))
to invoke theisnan
function. This way the programming language's compiler ensuresisnan
exists and is of the proper form. You can also useexpr("isnan(myCol)")
function to invoke the same function. In this case, Spark itself will ensureisnan
exists when it analyzes the query.regr_count
is an example of a function that is built-in but not defined here, because it is less commonly used. To invoke it, useexpr("regr_count(yCol, xCol)")
.1.3.0