package clustering
- Alphabetic
- Public
- All
Type Members
-
class
BisectingKMeans extends Logging
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark.
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are
k
leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more thank
leaf clusters, larger clusters get higher priority.- Annotations
- @Since( "1.6.0" )
- See also
-
class
BisectingKMeansModel extends Serializable with Saveable with Logging
Clustering model produced by BisectingKMeans.
Clustering model produced by BisectingKMeans. The prediction is done level-by-level from the root node to a leaf node, and at each node among its children the closest to the input point is selected.
- Annotations
- @Since( "1.6.0" )
-
class
DistributedLDAModel extends LDAModel
Distributed LDA model.
Distributed LDA model. This model stores the inferred topics, the full training dataset, and the topic distributions.
- Annotations
- @Since( "1.3.0" )
-
final
class
EMLDAOptimizer extends LDAOptimizer
Optimizer for EM algorithm which stores data + parameter graph, plus algorithm parameters.
Optimizer for EM algorithm which stores data + parameter graph, plus algorithm parameters.
Currently, the underlying implementation uses Expectation-Maximization (EM), implemented according to the Asuncion et al. (2009) paper referenced below.
References:
- Original LDA paper (journal version):
Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
- This class implements their "smoothed" LDA model.
- Paper which clearly explains several algorithms, including EM: Asuncion, Welling, Smyth, and Teh. "On Smoothing and Inference for Topic Models." UAI, 2009.
- Annotations
- @Since( "1.4.0" )
- Original LDA paper (journal version):
Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
-
class
GaussianMixture extends Serializable
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs).
This class performs expectation maximization for multivariate Gaussian Mixture Models (GMMs). A GMM represents a composite distribution of independent Gaussian distributions with associated "mixing" weights specifying each's contribution to the composite.
Given a set of sample points, this class will maximize the log-likelihood for a mixture of k Gaussians, iterating until the log-likelihood changes by less than convergenceTol, or until it has reached the max number of iterations. While this process is generally guaranteed to converge, it is not guaranteed to find a global optimum.
- Annotations
- @Since( "1.3.0" )
- Note
This algorithm is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.
-
class
GaussianMixtureModel extends Serializable with Saveable
Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are the respective mean and covariance for each Gaussian distribution i=1..k.
Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are the respective mean and covariance for each Gaussian distribution i=1..k.
- Annotations
- @Since( "1.3.0" )
-
class
KMeans extends Serializable with Logging
K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).
K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al).
This is an iterative algorithm that will make multiple passes over the data, so any RDDs given to it should be cached by the user.
- Annotations
- @Since( "0.8.0" )
-
class
KMeansModel extends Saveable with Serializable with PMMLExportable
A clustering model for K-means.
A clustering model for K-means. Each point belongs to the cluster with the closest center.
- Annotations
- @Since( "0.8.0" )
-
class
LDA extends Logging
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Terminology:
- "word" = "term": an element of the vocabulary
- "token": instance of a term appearing in a document
- "topic": multinomial distribution over words representing some concept
References:
- Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
- Annotations
- @Since( "1.3.0" )
- See also
-
abstract
class
LDAModel extends Saveable
Latent Dirichlet Allocation (LDA) model.
Latent Dirichlet Allocation (LDA) model.
This abstraction permits for different underlying representations, including local and distributed data structures.
- Annotations
- @Since( "1.3.0" )
-
trait
LDAOptimizer extends AnyRef
An LDAOptimizer specifies which optimization/learning/inference algorithm to use, and it can hold optimizer-specific parameters for users to set.
An LDAOptimizer specifies which optimization/learning/inference algorithm to use, and it can hold optimizer-specific parameters for users to set.
- Annotations
- @Since( "1.4.0" )
-
class
LocalLDAModel extends LDAModel with Serializable
Local LDA model.
Local LDA model. This model stores only the inferred topics.
- Annotations
- @Since( "1.3.0" )
-
final
class
OnlineLDAOptimizer extends LDAOptimizer with Logging
An online optimizer for LDA.
An online optimizer for LDA. The Optimizer implements the Online variational Bayes LDA algorithm, which processes a subset of the corpus on each iteration, and updates the term-topic distribution adaptively.
Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010.
- Annotations
- @Since( "1.4.0" )
-
class
PowerIterationClustering extends Serializable
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen.
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.
- Annotations
- @Since( "1.3.0" )
- See also
-
class
PowerIterationClusteringModel extends Saveable with Serializable
Model produced by PowerIterationClustering.
Model produced by PowerIterationClustering.
- Annotations
- @Since( "1.3.0" )
-
class
StreamingKMeans extends Logging with Serializable
StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data.
StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data. See KMeansModel for details on algorithm and update rules.
Use a builder pattern to construct a streaming k-means analysis in an application, like:
val model = new StreamingKMeans() .setDecayFactor(0.5) .setK(3) .setRandomCenters(5, 100.0) .trainOn(DStream)
- Annotations
- @Since( "1.2.0" )
-
class
StreamingKMeansModel extends KMeansModel with Logging
StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.
StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.
The update algorithm uses the "mini-batch" KMeans rule, generalized to incorporate forgetfulness (i.e. decay). The update rule (for each cluster) is:
$$ \begin{align} c_{t+1} &= [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] \\ n_{t+1} &= n_t * a + m_t \end{align} $$
Where c_t is the previously estimated centroid for that cluster, n_t is the number of points assigned to it thus far, x_t is the centroid estimated on the current batch, and m_t is the number of points assigned to that centroid in the current batch.
The decay factor 'a' scales the contribution of the clusters as estimated thus far, by applying a as a discount weighting on the current point when evaluating new incoming data. If a=1, all batches are weighted equally. If a=0, new centroids are determined entirely by recent data. Lower values correspond to more forgetting.
Decay can optionally be specified by a half life and associated time unit. The time unit can either be a batch of data or a single data point. Considering data arrived at time t, the half life h is defined such that at time t + h the discount applied to the data from t is 0.5. The definition remains the same whether the time unit is given as batches or points.
- Annotations
- @Since( "1.2.0" )
Value Members
-
object
BisectingKMeansModel extends Loader[BisectingKMeansModel] with Serializable
- Annotations
- @Since( "2.0.0" )
-
object
DistanceMeasure extends Serializable
- Annotations
- @Since( "2.4.0" )
-
object
DistributedLDAModel extends Loader[DistributedLDAModel]
Distributed model fitted by LDA.
Distributed model fitted by LDA. This type of model is currently only produced by Expectation-Maximization (EM).
This model stores the inferred topics, the full training dataset, and the topic distribution for each training document.
- Annotations
- @Since( "1.5.0" )
-
object
GaussianMixtureModel extends Loader[GaussianMixtureModel] with Serializable
- Annotations
- @Since( "1.4.0" )
-
object
KMeans extends Serializable
Top-level methods for calling K-means clustering.
Top-level methods for calling K-means clustering.
- Annotations
- @Since( "0.8.0" )
-
object
KMeansModel extends Loader[KMeansModel] with Serializable
- Annotations
- @Since( "1.4.0" )
-
object
LocalLDAModel extends Loader[LocalLDAModel] with Serializable
Local (non-distributed) model fitted by LDA.
Local (non-distributed) model fitted by LDA.
This model stores the inferred topics only; it does not store info about the training dataset.
- Annotations
- @Since( "1.5.0" )
-
object
PowerIterationClustering extends Logging with Serializable
- Annotations
- @Since( "1.3.0" )
-
object
PowerIterationClusteringModel extends Loader[PowerIterationClusteringModel] with Serializable
- Annotations
- @Since( "1.4.0" )