public class LDA
extends Object
implements org.apache.spark.internal.Logging
Terminology: - "word" = "term": an element of the vocabulary - "token": instance of a term appearing in a document - "topic": multinomial distribution over words representing some concept
References: - Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
| Constructor and Description | 
|---|
| LDA()Constructs a LDA instance with default parameters. | 
| Modifier and Type | Method and Description | 
|---|---|
| double | getAlpha()Alias for  getDocConcentration | 
| Vector | getAsymmetricAlpha()Alias for  getAsymmetricDocConcentration | 
| Vector | getAsymmetricDocConcentration()Concentration parameter (commonly named "alpha") for the prior placed on documents'
 distributions over topics ("theta"). | 
| double | getBeta()Alias for  getTopicConcentration | 
| int | getCheckpointInterval()Period (in iterations) between checkpoints. | 
| double | getDocConcentration()Concentration parameter (commonly named "alpha") for the prior placed on documents'
 distributions over topics ("theta"). | 
| int | getK()Number of topics to infer, i.e., the number of soft cluster centers. | 
| int | getMaxIterations()Maximum number of iterations allowed. | 
| LDAOptimizer | getOptimizer()LDAOptimizer used to perform the actual calculation | 
| long | getSeed()Random seed for cluster initialization. | 
| double | getTopicConcentration()Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics'
 distributions over terms. | 
| LDAModel | run(JavaPairRDD<Long,Vector> documents)Java-friendly version of  run() | 
| LDAModel | run(RDD<scala.Tuple2<Object,Vector>> documents)Learn an LDA model using the given dataset. | 
| LDA | setAlpha(double alpha)Alias for  setDocConcentration() | 
| LDA | setAlpha(Vector alpha)Alias for  setDocConcentration() | 
| LDA | setBeta(double beta)Alias for  setTopicConcentration() | 
| LDA | setCheckpointInterval(int checkpointInterval)Parameter for set checkpoint interval (greater than or equal to 1) or disable checkpoint (-1). | 
| LDA | setDocConcentration(double docConcentration)Replicates a  DoubledocConcentration to create a symmetric prior. | 
| LDA | setDocConcentration(Vector docConcentration)Concentration parameter (commonly named "alpha") for the prior placed on documents'
 distributions over topics ("theta"). | 
| LDA | setK(int k)Set the number of topics to infer, i.e., the number of soft cluster centers. | 
| LDA | setMaxIterations(int maxIterations)Set the maximum number of iterations allowed. | 
| LDA | setOptimizer(LDAOptimizer optimizer)LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer) | 
| LDA | setOptimizer(String optimizerName)Set the LDAOptimizer used to perform the actual calculation by algorithm name. | 
| LDA | setSeed(long seed)Set the random seed for cluster initialization. | 
| LDA | setTopicConcentration(double topicConcentration)Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics'
 distributions over terms. | 
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait$init$, initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, initLock, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log__$eq, org$apache$spark$internal$Logging$$log_, uninitializepublic int getK()
public LDA setK(int k)
k - (undocumented)public Vector getAsymmetricDocConcentration()
This is the parameter to a Dirichlet distribution.
public double getDocConcentration()
 This method assumes the Dirichlet distribution is symmetric and can be described by a single
 Double parameter. It should fail if docConcentration is asymmetric.
public LDA setDocConcentration(Vector docConcentration)
This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).
 If set to a singleton vector Vector(-1), then docConcentration is set automatically. If set to
 singleton vector Vector(t) where t != -1, then t is replicated to a vector of length k during
 LDAOptimizer.initialize(). Otherwise, the docConcentration vector must be length k.
 (default = Vector(-1) = automatic)
 
Optimizer-specific parameter settings: - EM - Currently only supports symmetric distributions, so all values in the vector should be the same. - Values should be greater than 1.0 - default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Values should be greater than or equal to 0 - default = uniformly (1.0 / k), following the implementation from here.
docConcentration - (undocumented)public LDA setDocConcentration(double docConcentration)
Double docConcentration to create a symmetric prior.docConcentration - (undocumented)public Vector getAsymmetricAlpha()
getAsymmetricDocConcentrationpublic double getAlpha()
getDocConcentrationpublic LDA setAlpha(Vector alpha)
setDocConcentration()alpha - (undocumented)public LDA setAlpha(double alpha)
setDocConcentration()alpha - (undocumented)public double getTopicConcentration()
This is the parameter to a symmetric Dirichlet distribution.
public LDA setTopicConcentration(double topicConcentration)
This is the parameter to a symmetric Dirichlet distribution.
topicConcentration - (undocumented)If set to -1, then topicConcentration is set automatically. (default = -1 = automatic)
Optimizer-specific parameter settings: - EM - Value should be greater than 1.0 - default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Value should be greater than or equal to 0 - default = (1.0 / k), following the implementation from here.
public double getBeta()
getTopicConcentrationpublic LDA setBeta(double beta)
setTopicConcentration()beta - (undocumented)public int getMaxIterations()
public LDA setMaxIterations(int maxIterations)
maxIterations - (undocumented)public long getSeed()
public LDA setSeed(long seed)
seed - (undocumented)public int getCheckpointInterval()
public LDA setCheckpointInterval(int checkpointInterval)
SparkContext, this setting is ignored. (default = 10)
 checkpointInterval - (undocumented)SparkContext.setCheckpointDir(java.lang.String)public LDAOptimizer getOptimizer()
public LDA setOptimizer(LDAOptimizer optimizer)
optimizer - (undocumented)public LDA setOptimizer(String optimizerName)
optimizerName - (undocumented)public LDAModel run(RDD<scala.Tuple2<Object,Vector>> documents)
documents - RDD of documents, which are term (word) count vectors paired with IDs.
                   The term count vectors are "bags of words" with a fixed-size vocabulary
                   (where the vocabulary size is the length of the vector).
                   Document IDs must be unique and greater than or equal to 0.public LDAModel run(JavaPairRDD<Long,Vector> documents)
run()documents - (undocumented)