BisectingKMeansModel

class pyspark.mllib.clustering.BisectingKMeansModel(java_model: JavaObject)[source]

A clustering model derived from the bisecting k-means method.

New in version 2.0.0.

Examples

>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
>>> bskm = BisectingKMeans()
>>> model = bskm.train(sc.parallelize(data, 2), k=4)
>>> p = array([0.0, 0.0])
>>> model.predict(p)
0
>>> model.k
4
>>> model.computeCost(p)
0.0

Methods

call(name, *a)

Call method of java_model

computeCost(x)

Return the Bisecting K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.

predict(x)

Find the cluster that each of the points belongs to in this model.

Attributes

clusterCenters

Get the cluster centers, represented as a list of NumPy arrays.

k

Get the number of clusters

Methods Documentation

call(name: str, *a: Any) → Any

Call method of java_model

computeCost(x: Union[VectorLike, pyspark.rdd.RDD[VectorLike]]) → float[source]

Return the Bisecting K-means cost (sum of squared distances of points to their nearest center) for this model on the given data. If provided with an RDD of points returns the sum.

New in version 2.0.0.

Parameters
pointpyspark.mllib.linalg.Vector or pyspark.RDD

A data point (or RDD of points) to compute the cost(s). pyspark.mllib.linalg.Vector can be replaced with equivalent objects (list, tuple, numpy.ndarray).

predict(x: Union[VectorLike, pyspark.rdd.RDD[VectorLike]]) → Union[int, pyspark.rdd.RDD[int]][source]

Find the cluster that each of the points belongs to in this model.

New in version 2.0.0.

Parameters
xpyspark.mllib.linalg.Vector or pyspark.RDD

A data point (or RDD of points) to determine cluster index. pyspark.mllib.linalg.Vector can be replaced with equivalent objects (list, tuple, numpy.ndarray).

Returns
int orpy:class:pyspark.RDD of int

Predicted cluster index or an RDD of predicted cluster indices if the input is an RDD.

Attributes Documentation

clusterCenters

Get the cluster centers, represented as a list of NumPy arrays.

New in version 2.0.0.

k

Get the number of clusters

New in version 2.0.0.