Class ChiSquareTest

Object
org.apache.spark.ml.stat.ChiSquareTest

public class ChiSquareTest extends Object
Chi-square hypothesis testing for categorical data.

See Wikipedia for more information on the Chi-squared test.

  • Constructor Details

    • ChiSquareTest

      public ChiSquareTest()
  • Method Details

    • test

      public static Dataset<Row> test(Dataset<Row> dataset, String featuresCol, String labelCol)
      Conduct Pearson's independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the Chi-squared statistic is computed. All label and feature values must be categorical.

      The null hypothesis is that the occurrence of the outcomes is statistically independent.

      Parameters:
      dataset - DataFrame of categorical labels and categorical features. Real-valued features will be treated as categorical for each distinct value.
      featuresCol - Name of features column in dataset, of type Vector (VectorUDT)
      labelCol - Name of label column in dataset, of any numerical type
      Returns:
      DataFrame containing the test result for every feature against the label. This DataFrame will contain a single Row with the following fields: - pValues: Vector - degreesOfFreedom: Array[Int] - statistics: Vector Each of these fields has one value per feature.
    • test

      public static Dataset<Row> test(Dataset<Row> dataset, String featuresCol, String labelCol, boolean flatten)
      Parameters:
      dataset - DataFrame of categorical labels and categorical features. Real-valued features will be treated as categorical for each distinct value.
      featuresCol - Name of features column in dataset, of type Vector (VectorUDT)
      labelCol - Name of label column in dataset, of any numerical type
      flatten - If false, the returned DataFrame contains only a single Row, otherwise, one row per feature.
      Returns:
      (undocumented)