StreamingKMeans

class pyspark.mllib.clustering.StreamingKMeans(k: int = 2, decayFactor: float = 1.0, timeUnit: str = 'batches')[source]

Provides methods to set k, decayFactor, timeUnit to configure the KMeans algorithm for fitting and predicting on incoming dstreams. More details on how the centroids are updated are provided under the docs of StreamingKMeansModel.

New in version 1.5.0.

Parameters
kint, optional

Number of clusters. (default: 2)

decayFactorfloat, optional

Forgetfulness of the previous centroids. (default: 1.0)

timeUnitstr, optional

Can be “batches” or “points”. If points, then the decay factor is raised to the power of number of new points and if batches, then decay factor will be used as is. (default: “batches”)

Methods

latestModel()

Return the latest model

predictOn(dstream)

Make predictions on a dstream.

predictOnValues(dstream)

Make predictions on a keyed dstream.

setDecayFactor(decayFactor)

Set decay factor.

setHalfLife(halfLife, timeUnit)

Set number of batches after which the centroids of that particular batch has half the weightage.

setInitialCenters(centers, weights)

Set initial centers.

setK(k)

Set number of clusters.

setRandomCenters(dim, weight, seed)

Set the initial centers to be random samples from a gaussian population with constant weights.

trainOn(dstream)

Train the model on the incoming dstream.

Methods Documentation

latestModel() → Optional[pyspark.mllib.clustering.StreamingKMeansModel][source]

Return the latest model

New in version 1.5.0.

predictOn(dstream: DStream[VectorLike]) → DStream[int][source]

Make predictions on a dstream. Returns a transformed dstream object

New in version 1.5.0.

predictOnValues(dstream: DStream[Tuple[T, VectorLike]]) → DStream[Tuple[T, int]][source]

Make predictions on a keyed dstream. Returns a transformed dstream object.

New in version 1.5.0.

setDecayFactor(decayFactor: float)pyspark.mllib.clustering.StreamingKMeans[source]

Set decay factor.

New in version 1.5.0.

setHalfLife(halfLife: float, timeUnit: str)pyspark.mllib.clustering.StreamingKMeans[source]

Set number of batches after which the centroids of that particular batch has half the weightage.

New in version 1.5.0.

setInitialCenters(centers: List[VectorLike], weights: List[float]) → StreamingKMeans[source]

Set initial centers. Should be set before calling trainOn.

New in version 1.5.0.

setK(k: int)pyspark.mllib.clustering.StreamingKMeans[source]

Set number of clusters.

New in version 1.5.0.

setRandomCenters(dim: int, weight: float, seed: int)pyspark.mllib.clustering.StreamingKMeans[source]

Set the initial centers to be random samples from a gaussian population with constant weights.

New in version 1.5.0.

trainOn(dstream: DStream[VectorLike]) → None[source]

Train the model on the incoming dstream.

New in version 1.5.0.