org.apache.spark.ml.evaluation.CosineSilhouette

public class CosineSilhouette extends Object

The algorithm which is implemented in this object, instead, is an efficient and parallel implementation of the Silhouette using the cosine distance measure. The cosine distance measure is defined as 1 - s where s is the cosine similarity between two points.

The total distance of the point X to the points $C_{i}$ belonging to the cluster $\Gamma$ is:

$$ \sum\limits_{i=1}^N d(X, C_{i} ) = \sum\limits_{i=1}^N \Big( 1 - \frac{\sum\limits_{j=1}^D x_{j}c_{ij} }{ \|X\|\|C_{i}\|} \Big) = \sum\limits_{i=1}^N 1 - \sum\limits_{i=1}^N \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} \frac{c_{ij}}{\|C_{i}\|} = N - \sum\limits_{j=1}^D \frac{x_{j}}{\|X\|} \Big( \sum\limits_{i=1}^N \frac{c_{ij}}{\|C_{i}\|} \Big) $$

where $x_{j}$ is the j-th dimension of the point X and $c_{ij}$ is the j-th dimension of the i-th point in cluster $\Gamma$ .

Then, we can define the vector:

$$ \xi_{X} : \xi_{X i} = \frac{x_{i}}{\|X\|}, i = 1, ..., D $$

which can be precomputed for each point and the vector

$$ \Omega_{\Gamma} : \Omega_{\Gamma i} = \sum\limits_{j=1}^N \xi_{C_{j}i}, i = 1, ..., D $$

which can be precomputed too for each cluster $\Gamma$ by its points $C_{i}$ .

With these definitions, the numerator becomes:

$$ N - \sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j} $$

Thus the average distance of a point X to the points of the cluster $\Gamma$ is:

$$ 1 - \frac{\sum\limits_{j=1}^D \xi_{X j} \Omega_{\Gamma j}}{N} $$

In the implementation, the precomputed values for the clusters are distributed among the worker nodes via broadcasted variables, because we can assume that the clusters are limited in number.

The main strengths of this algorithm are the low computational complexity and the intrinsic parallelism. The precomputed information for each point and for each cluster can be computed with a computational complexity which is O(N/W), where N is the number of points in the dataset and W is the number of worker nodes. After that, every point can be analyzed independently from the others.

For every point we need to compute the average distance to all the clusters. Since the formula above requires O(D) operations, this phase has a computational complexity which is O(C*D*N/W) where C is the number of clusters (which we assume quite low), D is the number of dimensions, N is the number of points in the dataset and W is the number of worker nodes.

Constructor Summary

Constructors

Constructor

Description

CosineSilhouette()
Method Summary

Modifier and Type

Method

Description

static scala.collection.immutable.Map<Object,scala.Tuple2<Vector,Object>>

computeClusterStats(Dataset<Row> df, String featuresCol, String predictionCol, String weightCol)

The method takes the input dataset and computes the aggregated values about a cluster which are needed by the algorithm.

static double

computeSilhouetteCoefficient(Broadcast<scala.collection.immutable.Map<Object,scala.Tuple2<Vector,Object>>> broadcastedClustersMap, Vector normalizedFeatures, double clusterId, double weight)

It computes the Silhouette coefficient for a point.

static double

computeSilhouetteScore(Dataset<?> dataset, String predictionCol, String featuresCol, String weightCol)

Compute the Silhouette score of the dataset using the cosine distance measure.

static double

overallScore(Dataset<Row> df, Column scoreColumn, Column weightColumn)

static double

pointSilhouetteCoefficient(scala.collection.immutable.Set<Object> clusterIds, double pointClusterId, double weightSum, double weight, scala.Function1<Object,Object> averageDistanceToCluster)

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- CosineSilhouette
  
  public CosineSilhouette()
Method Details
- computeClusterStats
  
  public static scala.collection.immutable.Map<Object,scala.Tuple2<Vector,Object>> computeClusterStats(Dataset<Row> df, String featuresCol, String predictionCol, String weightCol)
  
  The method takes the input dataset and computes the aggregated values about a cluster which are needed by the algorithm.
  
  Parameters:
  
  df - The DataFrame which contains the input data
  
  featuresCol - The name of the column which contains the feature vector of the point.
  
  predictionCol - The name of the column which contains the predicted cluster id for the point.
  
  weightCol - The name of the column which contains the instance weight.
  
  Returns:
  
  A Map which associates each cluster id to a its statistics (i.e. the precomputed values N and $\Omega_{\Gamma}$ ).
- computeSilhouetteCoefficient
  
  public static double computeSilhouetteCoefficient(Broadcast<scala.collection.immutable.Map<Object,scala.Tuple2<Vector,Object>>> broadcastedClustersMap, Vector normalizedFeatures, double clusterId, double weight)
  
  It computes the Silhouette coefficient for a point.
  
  Parameters:
  
  broadcastedClustersMap - A map of the precomputed values for each cluster.
  
  normalizedFeatures - The Vector representing the normalized features of the current point.
  
  clusterId - The id of the cluster the current point belongs to.
  
  weight - The instance weight of the current point.
  
  Returns:
  
  (undocumented)
- computeSilhouetteScore
  
  public static double computeSilhouetteScore(Dataset<?> dataset, String predictionCol, String featuresCol, String weightCol)
  
  Compute the Silhouette score of the dataset using the cosine distance measure.
  
  Parameters:
  
  dataset - The input dataset (previously clustered) on which compute the Silhouette.
  
  predictionCol - The name of the column which contains the predicted cluster id for the point.
  
  featuresCol - The name of the column which contains the feature vector of the point.
  
  weightCol - The name of the column which contains the instance weight.
  
  Returns:
  
  The average of the Silhouette values of the clustered data.
- pointSilhouetteCoefficient
  
  public static double pointSilhouetteCoefficient(scala.collection.immutable.Set<Object> clusterIds, double pointClusterId, double weightSum, double weight, scala.Function1<Object,Object> averageDistanceToCluster)
- overallScore
  
  public static double overallScore(Dataset<Row> df, Column scoreColumn, Column weightColumn)

Class CosineSilhouette

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

CosineSilhouette

Method Details

computeClusterStats

computeSilhouetteCoefficient

computeSilhouetteScore

pointSilhouetteCoefficient

overallScore