final class DataFrameStatFunctions extends AnyRef
Statistic functions for DataFrames.
- Annotations
 - @Stable()
 - Source
 - DataFrameStatFunctions.scala
 - Since
 1.4.0
- Alphabetic
 - By Inheritance
 
- DataFrameStatFunctions
 - AnyRef
 - Any
 
- Hide All
 - Show All
 
- Public
 - All
 
Value Members
- 
      
      
      
        
      
    
      
        final 
        def
      
      
        !=(arg0: Any): Boolean
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        ##(): Int
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        ==(arg0: Any): Boolean
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        approxQuantile(cols: Array[String], probabilities: Array[Double], relativeError: Double): Array[Array[Double]]
      
      
      
Calculates the approximate quantiles of numerical columns of a DataFrame.
Calculates the approximate quantiles of numerical columns of a DataFrame.
- cols
 the names of the numerical columns
- probabilities
 a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
- relativeError
 The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
- returns
 the approximate quantiles at the given probabilities of each column
- Since
 2.2.0
- Note
 null and NaN values will be ignored in numerical columns before calculation. For columns only containing null or NaN values, an empty array is returned.
- See also
 approxQuantile(col:Str* approxQuantile)for detailed description.
 - 
      
      
      
        
      
    
      
        
        def
      
      
        approxQuantile(col: String, probabilities: Array[Double], relativeError: Double): Array[Double]
      
      
      
Calculates the approximate quantiles of a numerical column of a DataFrame.
Calculates the approximate quantiles of a numerical column of a DataFrame.
The result of this algorithm has the following deterministic bound: If the DataFrame has N elements and if we request the quantile at probability
pup to errorerr, then the algorithm will return a samplexfrom the DataFrame so that the *exact* rank ofxis close to (p * N). More precisely,floor((p - err) * N) <= rank(x) <= ceil((p + err) * N)
This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in Space-efficient Online Computation of Quantile Summaries by Greenwald and Khanna.
- col
 the name of the numerical column
- probabilities
 a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
- relativeError
 The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
- returns
 the approximate quantiles at the given probabilities
- Since
 2.0.0
- Note
 null and NaN values will be removed from the numerical column before calculation. If the dataframe is empty or the column only contains null or NaN, an empty array is returned.
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        asInstanceOf[T0]: T0
      
      
      
- Definition Classes
 - Any
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter
      
      
      
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
- col
 the column over which the filter is built
- expectedNumItems
 expected number of items which will be put into the filter.
- numBits
 expected number of bits of the filter.
- Since
 2.0.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter
      
      
      
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
- colName
 name of the column over which the filter is built
- expectedNumItems
 expected number of items which will be put into the filter.
- numBits
 expected number of bits of the filter.
- Since
 2.0.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter
      
      
      
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
- col
 the column over which the filter is built
- expectedNumItems
 expected number of items which will be put into the filter.
- fpp
 expected false positive probability of the filter.
- Since
 2.0.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter
      
      
      
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
- colName
 name of the column over which the filter is built
- expectedNumItems
 expected number of items which will be put into the filter.
- fpp
 expected false positive probability of the filter.
- Since
 2.0.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        clone(): AnyRef
      
      
      
- Attributes
 - protected[lang]
 - Definition Classes
 - AnyRef
 - Annotations
 - @throws( ... ) @native()
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        corr(col1: String, col2: String): Double
      
      
      
Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
- col1
 the name of the column
- col2
 the name of the column to calculate the correlation against
- returns
 The Pearson Correlation Coefficient as a Double.
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.corr("rand1", "rand2", "pearson") res1: Double = 0.613...
- Since
 1.4.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        corr(col1: String, col2: String, method: String): Double
      
      
      
Calculates the correlation of two columns of a DataFrame.
Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.
- col1
 the name of the column
- col2
 the name of the column to calculate the correlation against
- returns
 The Pearson Correlation Coefficient as a Double.
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.corr("rand1", "rand2") res1: Double = 0.613...
- Since
 1.4.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch
      
      
      
Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
- col
 the column over which the sketch is built
- eps
 relative error of the sketch
- confidence
 confidence of the sketch
- seed
 random seed
- returns
 a
CountMinSketchover columncolName
- Since
 2.0.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch
      
      
      
Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
- col
 the column over which the sketch is built
- depth
 depth of the sketch
- width
 width of the sketch
- seed
 random seed
- returns
 a
CountMinSketchover columncolName
- Since
 2.0.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch
      
      
      
Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
- colName
 name of the column over which the sketch is built
- eps
 relative error of the sketch
- confidence
 confidence of the sketch
- seed
 random seed
- returns
 a
CountMinSketchover columncolName
- Since
 2.0.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch
      
      
      
Builds a Count-min Sketch over a specified column.
Builds a Count-min Sketch over a specified column.
- colName
 name of the column over which the sketch is built
- depth
 depth of the sketch
- width
 width of the sketch
- seed
 random seed
- returns
 a
CountMinSketchover columncolName
- Since
 2.0.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        cov(col1: String, col2: String): Double
      
      
      
Calculate the sample covariance of two numerical columns of a DataFrame.
Calculate the sample covariance of two numerical columns of a DataFrame.
- col1
 the name of the first column
- col2
 the name of the second column
- returns
 the covariance of the two columns.
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.cov("rand1", "rand2") res1: Double = 0.065...
- Since
 1.4.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        crosstab(col1: String, col2: String): DataFrame
      
      
      
Computes a pair-wise frequency table of the given columns.
Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The first column of each row will be the distinct values of
col1and the column names will be the distinct values ofcol2. The name of the first column will becol1_col2. Counts will be returned asLongs. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.- col1
 The name of the first column. Distinct items will make the first item of each row.
- col2
 The name of the second column. Distinct items will make the column names of the DataFrame.
- returns
 A DataFrame containing for the contingency table.
val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))) .toDF("key", "value") val ct = df.stat.crosstab("key", "value") ct.show() +---------+---+---+---+ |key_value| 1| 2| 3| +---------+---+---+---+ | 2| 2| 0| 1| | 1| 1| 1| 0| | 3| 0| 1| 1| +---------+---+---+---+
- Since
 1.4.0
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        eq(arg0: AnyRef): Boolean
      
      
      
- Definition Classes
 - AnyRef
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        equals(arg0: Any): Boolean
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        finalize(): Unit
      
      
      
- Attributes
 - protected[lang]
 - Definition Classes
 - AnyRef
 - Annotations
 - @throws( classOf[java.lang.Throwable] )
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        freqItems(cols: Seq[String]): DataFrame
      
      
      
(Scala-specific) Finding frequent items for columns, possibly with false positives.
(Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Uses a
defaultsupport of 1%.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting
DataFrame.- cols
 the names of the columns to search frequent items in.
- returns
 A Local DataFrame with the Array of frequent items for each column.
- Since
 1.4.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        freqItems(cols: Seq[String], support: Double): DataFrame
      
      
      
(Scala-specific) Finding frequent items for columns, possibly with false positives.
(Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting
DataFrame.- cols
 the names of the columns to search frequent items in.
- returns
 A Local DataFrame with the Array of frequent items for each column.
val rows = Seq.tabulate(100) { i => if (i % 2 == 0) (1, -1.0) else (i, i * -1.0) } val df = spark.createDataFrame(rows).toDF("a", "b") // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns // "a" and "b" val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4) freqSingles.show() +-----------+-------------+ |a_freqItems| b_freqItems| +-----------+-------------+ | [1, 99]|[-1.0, -99.0]| +-----------+-------------+ // find the pair of items with a frequency greater than 0.1 in columns "a" and "b" val pairDf = df.select(struct("a", "b").as("a-b")) val freqPairs = pairDf.stat.freqItems(Seq("a-b"), 0.1) freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show() +----------+ | freq_ab| +----------+ | [1,-1.0]| | ... | +----------+
- Since
 1.4.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        freqItems(cols: Array[String]): DataFrame
      
      
      
Finding frequent items for columns, possibly with false positives.
Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Uses a
defaultsupport of 1%.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting
DataFrame.- cols
 the names of the columns to search frequent items in.
- returns
 A Local DataFrame with the Array of frequent items for each column.
- Since
 1.4.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        freqItems(cols: Array[String], support: Double): DataFrame
      
      
      
Finding frequent items for columns, possibly with false positives.
Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. The
supportshould be greater than 1e-4.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting
DataFrame.- cols
 the names of the columns to search frequent items in.
- support
 The minimum frequency for an item to be considered
frequent. Should be greater than 1e-4.- returns
 A Local DataFrame with the Array of frequent items for each column.
val rows = Seq.tabulate(100) { i => if (i % 2 == 0) (1, -1.0) else (i, i * -1.0) } val df = spark.createDataFrame(rows).toDF("a", "b") // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns // "a" and "b" val freqSingles = df.stat.freqItems(Array("a", "b"), 0.4) freqSingles.show() +-----------+-------------+ |a_freqItems| b_freqItems| +-----------+-------------+ | [1, 99]|[-1.0, -99.0]| +-----------+-------------+ // find the pair of items with a frequency greater than 0.1 in columns "a" and "b" val pairDf = df.select(struct("a", "b").as("a-b")) val freqPairs = pairDf.stat.freqItems(Array("a-b"), 0.1) freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show() +----------+ | freq_ab| +----------+ | [1,-1.0]| | ... | +----------+
- Since
 1.4.0
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        getClass(): Class[_]
      
      
      
- Definition Classes
 - AnyRef → Any
 - Annotations
 - @native()
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        hashCode(): Int
      
      
      
- Definition Classes
 - AnyRef → Any
 - Annotations
 - @native()
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        isInstanceOf[T0]: Boolean
      
      
      
- Definition Classes
 - Any
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        ne(arg0: AnyRef): Boolean
      
      
      
- Definition Classes
 - AnyRef
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        notify(): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @native()
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        notifyAll(): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @native()
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        sampleBy[T](col: Column, fractions: Map[T, Double], seed: Long): DataFrame
      
      
      
(Java-specific) Returns a stratified sample without replacement based on the fraction given on each stratum.
(Java-specific) Returns a stratified sample without replacement based on the fraction given on each stratum.
- T
 stratum type
- col
 column that defines strata
- fractions
 sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
- seed
 random seed
- returns
 a new
DataFramethat represents the stratified sample
- Since
 3.0.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        sampleBy[T](col: Column, fractions: Map[T, Double], seed: Long): DataFrame
      
      
      
Returns a stratified sample without replacement based on the fraction given on each stratum.
Returns a stratified sample without replacement based on the fraction given on each stratum.
- T
 stratum type
- col
 column that defines strata
- fractions
 sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
- seed
 random seed
- returns
 a new
DataFramethat represents the stratified sample The stratified sample can be performed over multiple columns:import org.apache.spark.sql.Row import org.apache.spark.sql.functions.struct val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17), ("Alice", 10))).toDF("name", "age") val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0) df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show() +-----+---+ | name|age| +-----+---+ | Nico| 8| |Alice| 10| +-----+---+
- Since
 3.0.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame
      
      
      
Returns a stratified sample without replacement based on the fraction given on each stratum.
Returns a stratified sample without replacement based on the fraction given on each stratum.
- T
 stratum type
- col
 column that defines strata
- fractions
 sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
- seed
 random seed
- returns
 a new
DataFramethat represents the stratified sample
- Since
 1.5.0
 - 
      
      
      
        
      
    
      
        
        def
      
      
        sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame
      
      
      
Returns a stratified sample without replacement based on the fraction given on each stratum.
Returns a stratified sample without replacement based on the fraction given on each stratum.
- T
 stratum type
- col
 column that defines strata
- fractions
 sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
- seed
 random seed
- returns
 a new
DataFramethat represents the stratified sampleval df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value") val fractions = Map(1 -> 1.0, 3 -> 0.5) df.stat.sampleBy("key", fractions, 36L).show() +---+-----+ |key|value| +---+-----+ | 1| 1| | 1| 2| | 3| 2| +---+-----+
- Since
 1.5.0
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        synchronized[T0](arg0: ⇒ T0): T0
      
      
      
- Definition Classes
 - AnyRef
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        toString(): String
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        wait(): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @throws( ... )
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        wait(arg0: Long, arg1: Int): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @throws( ... )
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        wait(arg0: Long): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @throws( ... ) @native()