org.apache.spark.ml.clustering.GaussianMixture

All Implemented Interfaces:: Serializable, org.apache.spark.internal.Logging, GaussianMixtureParams, Params, HasAggregationDepth, HasFeaturesCol, HasMaxIter, HasPredictionCol, HasProbabilityCol, HasSeed, HasTol, HasWeightCol, DefaultParamsWritable, Identifiable, MLWritable, scala.Serializable

public class GaussianMixture extends Estimator<GaussianMixtureModel> implements GaussianMixtureParams, DefaultParamsWritable

Gaussian Mixture clustering.

This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.

Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.

See Also:

Serialized Form

Note:

This algorithm is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.SparkShellLoggingFilter
Constructor Summary

Constructors

Constructor

Description

GaussianMixture()

GaussianMixture(String uid)
Method Summary

Modifier and Type

Method

Description

final IntParam

aggregationDepth()

Param for suggested depth for treeAggregate (>= 2).

GaussianMixture

copy(ParamMap extra)

Creates a copy of this instance with the same UID and some extra params.

final Param<String>

featuresCol()

Param for features column name.

GaussianMixtureModel

fit(Dataset<?> dataset)

Fits a model to the input data.

final IntParam

k()

Number of independent Gaussians in the mixture model.

static GaussianMixture

load(String path)

final IntParam

maxIter()

Param for maximum number of iterations (>= 0).

final Param<String>

predictionCol()

Param for prediction column name.

final Param<String>

probabilityCol()

Param for Column name for predicted class conditional probabilities.

static MLReader<T>

read()

final LongParam

seed()

Param for random seed.

GaussianMixture

setAggregationDepth(int value)

GaussianMixture

setFeaturesCol(String value)

GaussianMixture

setK(int value)

GaussianMixture

setMaxIter(int value)

GaussianMixture

setPredictionCol(String value)

GaussianMixture

setProbabilityCol(String value)

GaussianMixture

setSeed(long value)

GaussianMixture

setTol(double value)

GaussianMixture

setWeightCol(String value)

final DoubleParam

tol()

Param for the convergence tolerance for iterative algorithms (>= 0).

StructType

transformSchema(StructType schema)

Check transform validity and derive the output schema from the input schema.

String

uid()

An immutable unique ID for the object and its derivatives.

final Param<String>

weightCol()

Param for weight column name.

Methods inherited from class org.apache.spark.ml.Estimator
fit, fit, fit, fit

Methods inherited from class org.apache.spark.ml.PipelineStage
params

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable
write

Methods inherited from interface org.apache.spark.ml.clustering.GaussianMixtureParams
getK, validateAndTransformSchema

Methods inherited from interface org.apache.spark.ml.param.shared.HasAggregationDepth
getAggregationDepth

Methods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol
getFeaturesCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasMaxIter
getMaxIter

Methods inherited from interface org.apache.spark.ml.param.shared.HasPredictionCol
getPredictionCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasProbabilityCol
getProbabilityCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasSeed
getSeed

Methods inherited from interface org.apache.spark.ml.param.shared.HasTol
getTol

Methods inherited from interface org.apache.spark.ml.param.shared.HasWeightCol
getWeightCol

Methods inherited from interface org.apache.spark.ml.util.Identifiable
toString

Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq

Methods inherited from interface org.apache.spark.ml.util.MLWritable
save

Methods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn

Constructor Details
- GaussianMixture
  
  public GaussianMixture(String uid)
- GaussianMixture
  
  public GaussianMixture()
Method Details
- load
  
  public static GaussianMixture load(String path)
- read
  
  public static MLReader<T> read()
- k
  
  public final IntParam k()
  
  Description copied from interface: GaussianMixtureParams
  
  Number of independent Gaussians in the mixture model. Must be greater than 1. Default: 2.
  
  Specified by:
  
  k in interface GaussianMixtureParams
  
  Returns:
  
  (undocumented)
- aggregationDepth
  
  public final IntParam aggregationDepth()
  
  Description copied from interface: HasAggregationDepth
  
  Param for suggested depth for treeAggregate (>= 2).
  
  Specified by:
  
  aggregationDepth in interface HasAggregationDepth
  
  Returns:
  
  (undocumented)
- tol
  
  public final DoubleParam tol()
  
  Description copied from interface: HasTol
  
  Param for the convergence tolerance for iterative algorithms (>= 0).
  
  Specified by:
  
  tol in interface HasTol
  
  Returns:
  
  (undocumented)
- probabilityCol
  
  public final Param<String> probabilityCol()
  
  Description copied from interface: HasProbabilityCol
  
  Param for Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.
  
  Specified by:
  
  probabilityCol in interface HasProbabilityCol
  
  Returns:
  
  (undocumented)
- weightCol
  
  public final Param<String> weightCol()
  
  Description copied from interface: HasWeightCol
  
  Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.
  
  Specified by:
  
  weightCol in interface HasWeightCol
  
  Returns:
  
  (undocumented)
- predictionCol
  
  public final Param<String> predictionCol()
  
  Description copied from interface: HasPredictionCol
  
  Param for prediction column name.
  
  Specified by:
  
  predictionCol in interface HasPredictionCol
  
  Returns:
  
  (undocumented)
- seed
  
  public final LongParam seed()
  
  Description copied from interface: HasSeed
  
  Param for random seed.
  
  Specified by:
  
  seed in interface HasSeed
  
  Returns:
  
  (undocumented)
- featuresCol
  
  public final Param<String> featuresCol()
  
  Description copied from interface: HasFeaturesCol
  
  Param for features column name.
  
  Specified by:
  
  featuresCol in interface HasFeaturesCol
  
  Returns:
  
  (undocumented)
- maxIter
  
  public final IntParam maxIter()
  
  Description copied from interface: HasMaxIter
  
  Param for maximum number of iterations (>= 0).
  
  Specified by:
  
  maxIter in interface HasMaxIter
  
  Returns:
  
  (undocumented)
- uid
  
  public String uid()
  
  Description copied from interface: Identifiable
  
  An immutable unique ID for the object and its derivatives.
  
  Specified by:
  
  uid in interface Identifiable
  
  Returns:
  
  (undocumented)
- copy
  
  public GaussianMixture copy(ParamMap extra)
  
  Description copied from interface: Params
  
  Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy().
  
  Specified by:
  
  copy in interface Params
  
  Specified by:
  
  copy in class Estimator<GaussianMixtureModel>
  
  Parameters:
  
  extra - (undocumented)
  
  Returns:
  
  (undocumented)
- setFeaturesCol
  
  public GaussianMixture setFeaturesCol(String value)
- setPredictionCol
  
  public GaussianMixture setPredictionCol(String value)
- setProbabilityCol
  
  public GaussianMixture setProbabilityCol(String value)
- setWeightCol
  
  public GaussianMixture setWeightCol(String value)
- setK
  
  public GaussianMixture setK(int value)
- setMaxIter
  
  public GaussianMixture setMaxIter(int value)
- setTol
  
  public GaussianMixture setTol(double value)
- setSeed
  
  public GaussianMixture setSeed(long value)
- setAggregationDepth
  
  public GaussianMixture setAggregationDepth(int value)
- fit
  
  public GaussianMixtureModel fit(Dataset<?> dataset)
  
  Description copied from class: Estimator
  
  Fits a model to the input data.
  
  Specified by:
  
  fit in class Estimator<GaussianMixtureModel>
  
  Parameters:
  
  dataset - (undocumented)
  
  Returns:
  
  (undocumented)
- transformSchema
  
  public StructType transformSchema(StructType schema)
  
  Description copied from class: PipelineStage
  
  Check transform validity and derive the output schema from the input schema.
  We check validity for interactions between parameters during transformSchema and raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled by Param.validate().
  Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks.
  
  Specified by:
  
  transformSchema in class PipelineStage
  
  Parameters:
  
  schema - (undocumented)
  
  Returns:
  
  (undocumented)

Class GaussianMixture

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class org.apache.spark.ml.Estimator

Methods inherited from class org.apache.spark.ml.PipelineStage

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable

Methods inherited from interface org.apache.spark.ml.clustering.GaussianMixtureParams

Methods inherited from interface org.apache.spark.ml.param.shared.HasAggregationDepth

Methods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasMaxIter

Methods inherited from interface org.apache.spark.ml.param.shared.HasPredictionCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasProbabilityCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasSeed

Methods inherited from interface org.apache.spark.ml.param.shared.HasTol

Methods inherited from interface org.apache.spark.ml.param.shared.HasWeightCol

Methods inherited from interface org.apache.spark.ml.util.Identifiable

Methods inherited from interface org.apache.spark.internal.Logging

Methods inherited from interface org.apache.spark.ml.util.MLWritable

Methods inherited from interface org.apache.spark.ml.param.Params

Constructor Details

GaussianMixture

GaussianMixture

Method Details

load

read

k

aggregationDepth

tol

probabilityCol

weightCol

predictionCol

seed

featuresCol

maxIter

uid

copy

setFeaturesCol

setPredictionCol

setProbabilityCol

setWeightCol

setK

setMaxIter

setTol

setSeed

setAggregationDepth

fit

transformSchema