object Statistics
API for statistical functions in MLlib.
- Annotations
 - @Since( "1.1.0" )
 - Source
 - Statistics.scala
 
- Alphabetic
 - By Inheritance
 
- Statistics
 - AnyRef
 - Any
 
- Hide All
 - Show All
 
- Public
 - All
 
Value Members
- 
      
      
      
        
      
    
      
        final 
        def
      
      
        !=(arg0: Any): Boolean
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        ##(): Int
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        ==(arg0: Any): Boolean
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        asInstanceOf[T0]: T0
      
      
      
- Definition Classes
 - Any
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        chiSqTest(data: JavaRDD[LabeledPoint]): Array[ChiSqTestResult]
      
      
      
Java-friendly version of
chiSqTest()Java-friendly version of
chiSqTest()- Annotations
 - @Since( "1.5.0" )
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult]
      
      
      
Conduct Pearson's independence test for every feature against the label across the input RDD.
Conduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical.
- data
 an
RDD[LabeledPoint]containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value.- returns
 an array containing the ChiSquaredTestResult for every feature against the label. The order of the elements in the returned array reflects the order of input features.
- Annotations
 - @Since( "1.1.0" )
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        chiSqTest(observed: Matrix): ChiSqTestResult
      
      
      
Conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0.
Conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0.
- observed
 The contingency matrix (containing either counts or relative frequencies).
- returns
 ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.
- Annotations
 - @Since( "1.1.0" )
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        chiSqTest(observed: Vector): ChiSqTestResult
      
      
      
Conduct Pearson's chi-squared goodness of fit test of the observed data against the uniform distribution, with each category having an expected frequency of
1 / observed.size.Conduct Pearson's chi-squared goodness of fit test of the observed data against the uniform distribution, with each category having an expected frequency of
1 / observed.size.- observed
 Vector containing the observed categorical counts/relative frequencies.
- returns
 ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.
- Annotations
 - @Since( "1.1.0" )
 - Note
 observedcannot contain negative values.
 - 
      
      
      
        
      
    
      
        
        def
      
      
        chiSqTest(observed: Vector, expected: Vector): ChiSqTestResult
      
      
      
Conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution.
Conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution.
- observed
 Vector containing the observed categorical counts/relative frequencies.
- expected
 Vector containing the expected categorical counts/relative frequencies.
expectedis rescaled if theexpectedsum differs from theobservedsum.- returns
 ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.
- Annotations
 - @Since( "1.1.0" )
 - Note
 The two input Vectors need to have the same size.
observedcannot contain negative values.expectedcannot contain nonpositive values.
 - 
      
      
      
        
      
    
      
        
        def
      
      
        clone(): AnyRef
      
      
      
- Attributes
 - protected[lang]
 - Definition Classes
 - AnyRef
 - Annotations
 - @throws( ... ) @native()
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        colStats(X: RDD[Vector]): MultivariateStatisticalSummary
      
      
      
Computes column-wise summary statistics for the input RDD[Vector].
Computes column-wise summary statistics for the input RDD[Vector].
- X
 an RDD[Vector] for which column-wise summary statistics are to be computed.
- returns
 MultivariateStatisticalSummary object containing column-wise summary statistics.
- Annotations
 - @Since( "1.1.0" )
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        corr(x: JavaRDD[Double], y: JavaRDD[Double], method: String): Double
      
      
      
Java-friendly version of
corr()Java-friendly version of
corr()- Annotations
 - @Since( "1.4.1" )
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        corr(x: RDD[Double], y: RDD[Double], method: String): Double
      
      
      
Compute the correlation for the input RDDs using the specified method.
Compute the correlation for the input RDDs using the specified method. Methods currently supported:
pearson(default),spearman.- x
 RDD[Double] of the same cardinality as y.
- y
 RDD[Double] of the same cardinality as x.
- method
 String specifying the method to use for computing correlation. Supported:
pearson(default),spearman- returns
 A Double containing the correlation between the two input RDD[Double]s using the specified method.
- Annotations
 - @Since( "1.1.0" )
 - Note
 The two input RDDs need to have the same number of partitions and the same number of elements in each partition.
 - 
      
      
      
        
      
    
      
        
        def
      
      
        corr(x: JavaRDD[Double], y: JavaRDD[Double]): Double
      
      
      
Java-friendly version of
corr()Java-friendly version of
corr()- Annotations
 - @Since( "1.4.1" )
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        corr(x: RDD[Double], y: RDD[Double]): Double
      
      
      
Compute the Pearson correlation for the input RDDs.
Compute the Pearson correlation for the input RDDs. Returns NaN if either vector has 0 variance.
- x
 RDD[Double] of the same cardinality as y.
- y
 RDD[Double] of the same cardinality as x.
- returns
 A Double containing the Pearson correlation between the two input RDD[Double]s
- Annotations
 - @Since( "1.1.0" )
 - Note
 The two input RDDs need to have the same number of partitions and the same number of elements in each partition.
 - 
      
      
      
        
      
    
      
        
        def
      
      
        corr(X: RDD[Vector], method: String): Matrix
      
      
      
Compute the correlation matrix for the input RDD of Vectors using the specified method.
Compute the correlation matrix for the input RDD of Vectors using the specified method. Methods currently supported:
pearson(default),spearman.- X
 an RDD[Vector] for which the correlation matrix is to be computed.
- method
 String specifying the method to use for computing correlation. Supported:
pearson(default),spearman- returns
 Correlation matrix comparing columns in X.
- Annotations
 - @Since( "1.1.0" )
 - Note
 For Spearman, a rank correlation, we need to create an RDD[Double] for each column and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], which is fairly costly. Cache the input RDD before calling corr with
method = "spearman"to avoid recomputing the common lineage.
 - 
      
      
      
        
      
    
      
        
        def
      
      
        corr(X: RDD[Vector]): Matrix
      
      
      
Compute the Pearson correlation matrix for the input RDD of Vectors.
Compute the Pearson correlation matrix for the input RDD of Vectors. Columns with 0 covariance produce NaN entries in the correlation matrix.
- X
 an RDD[Vector] for which the correlation matrix is to be computed.
- returns
 Pearson correlation matrix comparing columns in X.
- Annotations
 - @Since( "1.1.0" )
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        eq(arg0: AnyRef): Boolean
      
      
      
- Definition Classes
 - AnyRef
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        equals(arg0: Any): Boolean
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        finalize(): Unit
      
      
      
- Attributes
 - protected[lang]
 - Definition Classes
 - AnyRef
 - Annotations
 - @throws( classOf[java.lang.Throwable] )
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        getClass(): Class[_]
      
      
      
- Definition Classes
 - AnyRef → Any
 - Annotations
 - @native()
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        hashCode(): Int
      
      
      
- Definition Classes
 - AnyRef → Any
 - Annotations
 - @native()
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        isInstanceOf[T0]: Boolean
      
      
      
- Definition Classes
 - Any
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        kolmogorovSmirnovTest(data: JavaDoubleRDD, distName: String, params: Double*): KolmogorovSmirnovTestResult
      
      
      
Java-friendly version of
kolmogorovSmirnovTest()Java-friendly version of
kolmogorovSmirnovTest()- Annotations
 - @Since( "1.5.0" ) @varargs()
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        kolmogorovSmirnovTest(data: RDD[Double], distName: String, params: Double*): KolmogorovSmirnovTestResult
      
      
      
Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality.
Convenience function to conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality. Currently supports the normal distribution, taking as parameters the mean and standard deviation. (distName = "norm")
- data
 an
RDD[Double]containing the sample of data to test- distName
 a
Stringname for a theoretical distribution- params
 Double*specifying the parameters to be used for the theoretical distribution- returns
 org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult object containing test statistic, p-value, and null hypothesis.
- Annotations
 - @Since( "1.5.0" ) @varargs()
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        kolmogorovSmirnovTest(data: RDD[Double], cdf: (Double) ⇒ Double): KolmogorovSmirnovTestResult
      
      
      
Conduct the two-sided Kolmogorov-Smirnov (KS) test for data sampled from a continuous distribution.
Conduct the two-sided Kolmogorov-Smirnov (KS) test for data sampled from a continuous distribution. By comparing the largest difference between the empirical cumulative distribution of the sample data and the theoretical distribution we can provide a test for the the null hypothesis that the sample data comes from that theoretical distribution. For more information on KS Test:
- data
 an
RDD[Double]containing the sample of data to test- cdf
 a
Double => Doublefunction to calculate the theoretical CDF at a given value- returns
 org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult object containing test statistic, p-value, and null hypothesis.
- Annotations
 - @Since( "1.5.0" )
 - See also
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        ne(arg0: AnyRef): Boolean
      
      
      
- Definition Classes
 - AnyRef
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        notify(): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @native()
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        notifyAll(): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @native()
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        synchronized[T0](arg0: ⇒ T0): T0
      
      
      
- Definition Classes
 - AnyRef
 
 - 
      
      
      
        
      
    
      
        
        def
      
      
        toString(): String
      
      
      
- Definition Classes
 - AnyRef → Any
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        wait(): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @throws( ... )
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        wait(arg0: Long, arg1: Int): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @throws( ... )
 
 - 
      
      
      
        
      
    
      
        final 
        def
      
      
        wait(arg0: Long): Unit
      
      
      
- Definition Classes
 - AnyRef
 - Annotations
 - @throws( ... ) @native()