org.apache.spark.mllib.clustering.BisectingKMeans

All Implemented Interfaces:: org.apache.spark.internal.Logging

public class BisectingKMeans extends Object implements org.apache.spark.internal.Logging

A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.

param: k the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters. param: maxIterations the max number of k-means iterations to split clusters (default: 20) param: minDivisibleClusterSize the minimum number of points (if greater than or equal 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1) param: seed a random seed (default: hash value of the class name)

See Also:

Steinbach, Karypis, and Kumar, A comparison of document clustering techniques, KDD Workshop on Text Mining, 2000.

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
Constructor Summary

Constructors

Constructor

Description

BisectingKMeans()

Constructs with the default configuration
Method Summary

Modifier and Type

Method

Description

String

getDistanceMeasure()

The distance suite used by the algorithm.

int

getK()

Gets the desired number of leaf clusters.

int

getMaxIterations()

Gets the max number of k-means iterations to split clusters.

double

getMinDivisibleClusterSize()

Gets the minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster.

long

getSeed()

Gets the random seed.

BisectingKMeansModel

run(JavaRDD<Vector> data)

Java-friendly version of run().

BisectingKMeansModel

run(RDD<Vector> input)

Runs the bisecting k-means algorithm.

BisectingKMeans

setDistanceMeasure(String distanceMeasure)

Set the distance suite used by the algorithm.

BisectingKMeans

setK(int k)

Sets the desired number of leaf clusters (default: 4).

BisectingKMeans

setMaxIterations(int maxIterations)

Sets the max number of k-means iterations to split clusters (default: 20).

BisectingKMeans

setMinDivisibleClusterSize(double minDivisibleClusterSize)

Sets the minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1).

BisectingKMeans

setSeed(long seed)

Sets the random seed (default: hash value of the class name).

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logBasedOnLevel, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, MDC, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext

Constructor Details
- BisectingKMeans
  
  public BisectingKMeans()
  
  Constructs with the default configuration
Method Details
- setK
  
  public BisectingKMeans setK(int k)
  
  Sets the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters.
  
  Parameters:
  
  k - (undocumented)
  
  Returns:
  
  (undocumented)
- getK
  
  public int getK()
  
  Gets the desired number of leaf clusters.
  
  Returns:
  
  (undocumented)
- setMaxIterations
  
  public BisectingKMeans setMaxIterations(int maxIterations)
  
  Sets the max number of k-means iterations to split clusters (default: 20).
  
  Parameters:
  
  maxIterations - (undocumented)
  
  Returns:
  
  (undocumented)
- getMaxIterations
  
  public int getMaxIterations()
  
  Gets the max number of k-means iterations to split clusters.
  
  Returns:
  
  (undocumented)
- setMinDivisibleClusterSize
  
  public BisectingKMeans setMinDivisibleClusterSize(double minDivisibleClusterSize)
  
  Sets the minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1).
  
  Parameters:
  
  minDivisibleClusterSize - (undocumented)
  
  Returns:
  
  (undocumented)
- getMinDivisibleClusterSize
  
  public double getMinDivisibleClusterSize()
  
  Gets the minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster.
  
  Returns:
  
  (undocumented)
- setSeed
  
  public BisectingKMeans setSeed(long seed)
  
  Sets the random seed (default: hash value of the class name).
  
  Parameters:
  
  seed - (undocumented)
  
  Returns:
  
  (undocumented)
- getSeed
  
  public long getSeed()
  
  Gets the random seed.
  
  Returns:
  
  (undocumented)
- getDistanceMeasure
  
  public String getDistanceMeasure()
  
  The distance suite used by the algorithm.
  
  Returns:
  
  (undocumented)
- setDistanceMeasure
  
  public BisectingKMeans setDistanceMeasure(String distanceMeasure)
  
  Set the distance suite used by the algorithm.
  
  Parameters:
  
  distanceMeasure - (undocumented)
  
  Returns:
  
  (undocumented)
- run
  
  public BisectingKMeansModel run(RDD<Vector> input)
  
  Runs the bisecting k-means algorithm.
  
  Parameters:
  
  input - RDD of vectors
  
  Returns:
  
  model for the bisecting kmeans
- run
  
  public BisectingKMeansModel run(JavaRDD<Vector> data)
  
  Java-friendly version of run().
  
  Parameters:
  
  data - (undocumented)
  
  Returns:
  
  (undocumented)

Class BisectingKMeans

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.spark.internal.Logging

Constructor Details

BisectingKMeans

Method Details

setK

getK

setMaxIterations

getMaxIterations

setMinDivisibleClusterSize

getMinDivisibleClusterSize

setSeed

getSeed

getDistanceMeasure

setDistanceMeasure

run

run