Class DataFrameStatFunctions

Object
org.apache.spark.sql.DataFrameStatFunctions

public final class DataFrameStatFunctions extends Object
Statistic functions for DataFrames.

Since:
1.4.0
  • Method Summary

    Modifier and Type
    Method
    Description
    double[][]
    approxQuantile(String[] cols, double[] probabilities, double relativeError)
    Calculates the approximate quantiles of numerical columns of a DataFrame.
    double[]
    approxQuantile(String col, double[] probabilities, double relativeError)
    Calculates the approximate quantiles of a numerical column of a DataFrame.
    bloomFilter(String colName, long expectedNumItems, double fpp)
    Builds a Bloom filter over a specified column.
    bloomFilter(String colName, long expectedNumItems, long numBits)
    Builds a Bloom filter over a specified column.
    bloomFilter(Column col, long expectedNumItems, double fpp)
    Builds a Bloom filter over a specified column.
    bloomFilter(Column col, long expectedNumItems, long numBits)
    Builds a Bloom filter over a specified column.
    double
    corr(String col1, String col2)
    Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
    double
    corr(String col1, String col2, String method)
    Calculates the correlation of two columns of a DataFrame.
    countMinSketch(String colName, double eps, double confidence, int seed)
    Builds a Count-min Sketch over a specified column.
    countMinSketch(String colName, int depth, int width, int seed)
    Builds a Count-min Sketch over a specified column.
    countMinSketch(Column col, double eps, double confidence, int seed)
    Builds a Count-min Sketch over a specified column.
    countMinSketch(Column col, int depth, int width, int seed)
    Builds a Count-min Sketch over a specified column.
    double
    cov(String col1, String col2)
    Calculate the sample covariance of two numerical columns of a DataFrame.
    crosstab(String col1, String col2)
    Computes a pair-wise frequency table of the given columns.
    freqItems(String[] cols)
    Finding frequent items for columns, possibly with false positives.
    freqItems(String[] cols, double support)
    Finding frequent items for columns, possibly with false positives.
    freqItems(scala.collection.Seq<String> cols)
    (Scala-specific) Finding frequent items for columns, possibly with false positives.
    freqItems(scala.collection.Seq<String> cols, double support)
    (Scala-specific) Finding frequent items for columns, possibly with false positives.
    <T> Dataset<Row>
    sampleBy(String col, Map<T,Double> fractions, long seed)
    Returns a stratified sample without replacement based on the fraction given on each stratum.
    <T> Dataset<Row>
    sampleBy(String col, scala.collection.immutable.Map<T,Object> fractions, long seed)
    Returns a stratified sample without replacement based on the fraction given on each stratum.
    <T> Dataset<Row>
    sampleBy(Column col, Map<T,Double> fractions, long seed)
    (Java-specific) Returns a stratified sample without replacement based on the fraction given on each stratum.
    <T> Dataset<Row>
    sampleBy(Column col, scala.collection.immutable.Map<T,Object> fractions, long seed)
    Returns a stratified sample without replacement based on the fraction given on each stratum.

    Methods inherited from class java.lang.Object

    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Method Details

    • approxQuantile

      public double[] approxQuantile(String col, double[] probabilities, double relativeError)
      Calculates the approximate quantiles of a numerical column of a DataFrame.

      The result of this algorithm has the following deterministic bound: If the DataFrame has N elements and if we request the quantile at probability p up to error err, then the algorithm will return a sample x from the DataFrame so that the *exact* rank of x is close to (p * N). More precisely,

      
         floor((p - err) * N) <= rank(x) <= ceil((p + err) * N)
       

      This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in Space-efficient Online Computation of Quantile Summaries by Greenwald and Khanna.

      Parameters:
      col - the name of the numerical column
      probabilities - a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
      relativeError - The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
      Returns:
      the approximate quantiles at the given probabilities

      Since:
      2.0.0
      Note:
      null and NaN values will be removed from the numerical column before calculation. If the dataframe is empty or the column only contains null or NaN, an empty array is returned.

    • approxQuantile

      public double[][] approxQuantile(String[] cols, double[] probabilities, double relativeError)
      Calculates the approximate quantiles of numerical columns of a DataFrame.
      Parameters:
      cols - the names of the numerical columns
      probabilities - a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
      relativeError - The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
      Returns:
      the approximate quantiles at the given probabilities of each column

      Since:
      2.2.0
      See Also:
      • approxQuantile(col:Str* approxQuantile) for detailed description.

      Note:
      null and NaN values will be ignored in numerical columns before calculation. For columns only containing null or NaN values, an empty array is returned.

    • bloomFilter

      public BloomFilter bloomFilter(String colName, long expectedNumItems, double fpp)
      Builds a Bloom filter over a specified column.

      Parameters:
      colName - name of the column over which the filter is built
      expectedNumItems - expected number of items which will be put into the filter.
      fpp - expected false positive probability of the filter.
      Returns:
      (undocumented)
      Since:
      2.0.0
    • bloomFilter

      public BloomFilter bloomFilter(Column col, long expectedNumItems, double fpp)
      Builds a Bloom filter over a specified column.

      Parameters:
      col - the column over which the filter is built
      expectedNumItems - expected number of items which will be put into the filter.
      fpp - expected false positive probability of the filter.
      Returns:
      (undocumented)
      Since:
      2.0.0
    • bloomFilter

      public BloomFilter bloomFilter(String colName, long expectedNumItems, long numBits)
      Builds a Bloom filter over a specified column.

      Parameters:
      colName - name of the column over which the filter is built
      expectedNumItems - expected number of items which will be put into the filter.
      numBits - expected number of bits of the filter.
      Returns:
      (undocumented)
      Since:
      2.0.0
    • bloomFilter

      public BloomFilter bloomFilter(Column col, long expectedNumItems, long numBits)
      Builds a Bloom filter over a specified column.

      Parameters:
      col - the column over which the filter is built
      expectedNumItems - expected number of items which will be put into the filter.
      numBits - expected number of bits of the filter.
      Returns:
      (undocumented)
      Since:
      2.0.0
    • corr

      public double corr(String col1, String col2, String method)
      Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.

      Parameters:
      col1 - the name of the column
      col2 - the name of the column to calculate the correlation against
      method - (undocumented)
      Returns:
      The Pearson Correlation Coefficient as a Double.

      
          val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
            .withColumn("rand2", rand(seed=27))
          df.stat.corr("rand1", "rand2")
          res1: Double = 0.613...
       

      Since:
      1.4.0
    • corr

      public double corr(String col1, String col2)
      Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.

      Parameters:
      col1 - the name of the column
      col2 - the name of the column to calculate the correlation against
      Returns:
      The Pearson Correlation Coefficient as a Double.

      
          val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
            .withColumn("rand2", rand(seed=27))
          df.stat.corr("rand1", "rand2", "pearson")
          res1: Double = 0.613...
       

      Since:
      1.4.0
    • countMinSketch

      public CountMinSketch countMinSketch(String colName, int depth, int width, int seed)
      Builds a Count-min Sketch over a specified column.

      Parameters:
      colName - name of the column over which the sketch is built
      depth - depth of the sketch
      width - width of the sketch
      seed - random seed
      Returns:
      a CountMinSketch over column colName
      Since:
      2.0.0
    • countMinSketch

      public CountMinSketch countMinSketch(String colName, double eps, double confidence, int seed)
      Builds a Count-min Sketch over a specified column.

      Parameters:
      colName - name of the column over which the sketch is built
      eps - relative error of the sketch
      confidence - confidence of the sketch
      seed - random seed
      Returns:
      a CountMinSketch over column colName
      Since:
      2.0.0
    • countMinSketch

      public CountMinSketch countMinSketch(Column col, int depth, int width, int seed)
      Builds a Count-min Sketch over a specified column.

      Parameters:
      col - the column over which the sketch is built
      depth - depth of the sketch
      width - width of the sketch
      seed - random seed
      Returns:
      a CountMinSketch over column colName
      Since:
      2.0.0
    • countMinSketch

      public CountMinSketch countMinSketch(Column col, double eps, double confidence, int seed)
      Builds a Count-min Sketch over a specified column.

      Parameters:
      col - the column over which the sketch is built
      eps - relative error of the sketch
      confidence - confidence of the sketch
      seed - random seed
      Returns:
      a CountMinSketch over column colName
      Since:
      2.0.0
    • cov

      public double cov(String col1, String col2)
      Calculate the sample covariance of two numerical columns of a DataFrame.
      Parameters:
      col1 - the name of the first column
      col2 - the name of the second column
      Returns:
      the covariance of the two columns.

      
          val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
            .withColumn("rand2", rand(seed=27))
          df.stat.cov("rand1", "rand2")
          res1: Double = 0.065...
       

      Since:
      1.4.0
    • crosstab

      public Dataset<Row> crosstab(String col1, String col2)
      Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The first column of each row will be the distinct values of col1 and the column names will be the distinct values of col2. The name of the first column will be col1_col2. Counts will be returned as Longs. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.

      Parameters:
      col1 - The name of the first column. Distinct items will make the first item of each row.
      col2 - The name of the second column. Distinct items will make the column names of the DataFrame.
      Returns:
      A DataFrame containing for the contingency table.

      
          val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3)))
            .toDF("key", "value")
          val ct = df.stat.crosstab("key", "value")
          ct.show()
          +---------+---+---+---+
          |key_value|  1|  2|  3|
          +---------+---+---+---+
          |        2|  2|  0|  1|
          |        1|  1|  1|  0|
          |        3|  0|  1|  1|
          +---------+---+---+---+
       

      Since:
      1.4.0
    • freqItems

      public Dataset<Row> freqItems(String[] cols, double support)
      Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. The support should be greater than 1e-4.

      This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

      Parameters:
      cols - the names of the columns to search frequent items in.
      support - The minimum frequency for an item to be considered frequent. Should be greater than 1e-4.
      Returns:
      A Local DataFrame with the Array of frequent items for each column.

      
          val rows = Seq.tabulate(100) { i =>
            if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
          }
          val df = spark.createDataFrame(rows).toDF("a", "b")
          // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
          // "a" and "b"
          val freqSingles = df.stat.freqItems(Array("a", "b"), 0.4)
          freqSingles.show()
          +-----------+-------------+
          |a_freqItems|  b_freqItems|
          +-----------+-------------+
          |    [1, 99]|[-1.0, -99.0]|
          +-----------+-------------+
          // find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
          val pairDf = df.select(struct("a", "b").as("a-b"))
          val freqPairs = pairDf.stat.freqItems(Array("a-b"), 0.1)
          freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show()
          +----------+
          |   freq_ab|
          +----------+
          |  [1,-1.0]|
          |   ...    |
          +----------+
       

      Since:
      1.4.0
    • freqItems

      public Dataset<Row> freqItems(String[] cols)
      Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Uses a default support of 1%.

      This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

      Parameters:
      cols - the names of the columns to search frequent items in.
      Returns:
      A Local DataFrame with the Array of frequent items for each column.

      Since:
      1.4.0
    • freqItems

      public Dataset<Row> freqItems(scala.collection.Seq<String> cols, double support)
      (Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou.

      This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

      Parameters:
      cols - the names of the columns to search frequent items in.
      support - (undocumented)
      Returns:
      A Local DataFrame with the Array of frequent items for each column.

      
          val rows = Seq.tabulate(100) { i =>
            if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
          }
          val df = spark.createDataFrame(rows).toDF("a", "b")
          // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
          // "a" and "b"
          val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4)
          freqSingles.show()
          +-----------+-------------+
          |a_freqItems|  b_freqItems|
          +-----------+-------------+
          |    [1, 99]|[-1.0, -99.0]|
          +-----------+-------------+
          // find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
          val pairDf = df.select(struct("a", "b").as("a-b"))
          val freqPairs = pairDf.stat.freqItems(Seq("a-b"), 0.1)
          freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show()
          +----------+
          |   freq_ab|
          +----------+
          |  [1,-1.0]|
          |   ...    |
          +----------+
       

      Since:
      1.4.0
    • freqItems

      public Dataset<Row> freqItems(scala.collection.Seq<String> cols)
      (Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Uses a default support of 1%.

      This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

      Parameters:
      cols - the names of the columns to search frequent items in.
      Returns:
      A Local DataFrame with the Array of frequent items for each column.

      Since:
      1.4.0
    • sampleBy

      public <T> Dataset<Row> sampleBy(String col, scala.collection.immutable.Map<T,Object> fractions, long seed)
      Returns a stratified sample without replacement based on the fraction given on each stratum.
      Parameters:
      col - column that defines strata
      fractions - sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
      seed - random seed
      Returns:
      a new DataFrame that represents the stratified sample

      
          val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2),
            (3, 3))).toDF("key", "value")
          val fractions = Map(1 -> 1.0, 3 -> 0.5)
          df.stat.sampleBy("key", fractions, 36L).show()
          +---+-----+
          |key|value|
          +---+-----+
          |  1|    1|
          |  1|    2|
          |  3|    2|
          +---+-----+
       

      Since:
      1.5.0
    • sampleBy

      public <T> Dataset<Row> sampleBy(String col, Map<T,Double> fractions, long seed)
      Returns a stratified sample without replacement based on the fraction given on each stratum.
      Parameters:
      col - column that defines strata
      fractions - sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
      seed - random seed
      Returns:
      a new DataFrame that represents the stratified sample

      Since:
      1.5.0
    • sampleBy

      public <T> Dataset<Row> sampleBy(Column col, scala.collection.immutable.Map<T,Object> fractions, long seed)
      Returns a stratified sample without replacement based on the fraction given on each stratum.
      Parameters:
      col - column that defines strata
      fractions - sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
      seed - random seed
      Returns:
      a new DataFrame that represents the stratified sample

      The stratified sample can be performed over multiple columns:

      
          import org.apache.spark.sql.Row
          import org.apache.spark.sql.functions.struct
      
          val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17),
            ("Alice", 10))).toDF("name", "age")
          val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0)
          df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show()
          +-----+---+
          | name|age|
          +-----+---+
          | Nico|  8|
          |Alice| 10|
          +-----+---+
       

      Since:
      3.0.0
    • sampleBy

      public <T> Dataset<Row> sampleBy(Column col, Map<T,Double> fractions, long seed)
      (Java-specific) Returns a stratified sample without replacement based on the fraction given on each stratum.
      Parameters:
      col - column that defines strata
      fractions - sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
      seed - random seed
      Returns:
      a new DataFrame that represents the stratified sample

      Since:
      3.0.0