org.apache.spark.mllib.clustering.LDA

All Implemented Interfaces:: org.apache.spark.internal.Logging

public class LDA extends Object implements org.apache.spark.internal.Logging

Latent Dirichlet Allocation (LDA), a topic model designed for text documents.

Terminology: - "word" = "term": an element of the vocabulary - "token": instance of a term appearing in a document - "topic": multinomial distribution over words representing some concept

References: - Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.

See Also:

Latent Dirichlet allocation (Wikipedia)

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
Constructor Summary

Constructors

Constructor

Description

LDA()

Constructs a LDA instance with default parameters.
Method Summary

Modifier and Type

Method

Description

double

getAlpha()

Alias for getDocConcentration()

Vector

getAsymmetricAlpha()

Alias for getAsymmetricDocConcentration()

Vector

getAsymmetricDocConcentration()

Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

double

getBeta()

Alias for getTopicConcentration()

int

getCheckpointInterval()

Period (in iterations) between checkpoints.

double

getDocConcentration()

Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

int

getK()

Number of topics to infer, i.e., the number of soft cluster centers.

int

getMaxIterations()

Maximum number of iterations allowed.

LDAOptimizer

getOptimizer()

LDAOptimizer used to perform the actual calculation

long

getSeed()

Random seed for cluster initialization.

double

getTopicConcentration()

Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

LDAModel

run(JavaPairRDD<Long,Vector> documents)

Java-friendly version of run()

LDAModel

run(RDD<scala.Tuple2<Object,Vector>> documents)

Learn an LDA model using the given dataset.

LDA

setAlpha(double alpha)

Alias for setDocConcentration()

LDA

setAlpha(Vector alpha)

Alias for setDocConcentration()

LDA

setBeta(double beta)

Alias for setTopicConcentration()

LDA

setCheckpointInterval(int checkpointInterval)

Parameter for set checkpoint interval (greater than or equal to 1) or disable checkpoint (-1).

LDA

setDocConcentration(double docConcentration)

Replicates a Double docConcentration to create a symmetric prior.

LDA

setDocConcentration(Vector docConcentration)

Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

LDA

setK(int k)

Set the number of topics to infer, i.e., the number of soft cluster centers.

LDA

setMaxIterations(int maxIterations)

Set the maximum number of iterations allowed.

LDA

setOptimizer(String optimizerName)

Set the LDAOptimizer used to perform the actual calculation by algorithm name.

LDA

setOptimizer(LDAOptimizer optimizer)

LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)

LDA

setSeed(long seed)

Set the random seed for cluster initialization.

LDA

setTopicConcentration(double topicConcentration)

Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logBasedOnLevel, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, MDC, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext

Constructor Details
- LDA
  
  public LDA()
  
  Constructs a LDA instance with default parameters.
Method Details
- getK
  
  public int getK()
  
  Number of topics to infer, i.e., the number of soft cluster centers.
  
  Returns:
  
  (undocumented)
- setK
  
  public LDA setK(int k)
  
  Set the number of topics to infer, i.e., the number of soft cluster centers. (default = 10)
  
  Parameters:
  
  k - (undocumented)
  
  Returns:
  
  (undocumented)
- getAsymmetricDocConcentration
  
  public Vector getAsymmetricDocConcentration()
  
  Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
  This is the parameter to a Dirichlet distribution.
  
  Returns:
  
  (undocumented)
- getDocConcentration
  
  public double getDocConcentration()
  
  Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
  This method assumes the Dirichlet distribution is symmetric and can be described by a single Double parameter. It should fail if docConcentration is asymmetric.
  
  Returns:
  
  (undocumented)
- setDocConcentration
  
  public LDA setDocConcentration(Vector docConcentration)
  
  Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
  This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).
  If set to a singleton vector Vector(-1), then docConcentration is set automatically. If set to singleton vector Vector(t) where t != -1, then t is replicated to a vector of length k during LDAOptimizer.initialize(). Otherwise, the docConcentration vector must be length k. (default = Vector(-1) = automatic)
  Optimizer-specific parameter settings: - EM - Currently only supports symmetric distributions, so all values in the vector should be the same. - Values should be greater than 1.0 - default = uniformly (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Values should be greater than or equal to 0 - default = uniformly (1.0 / k), following the implementation from here.
  
  Parameters:
  
  docConcentration - (undocumented)
  
  Returns:
  
  (undocumented)
- setDocConcentration
  
  public LDA setDocConcentration(double docConcentration)
  
  Replicates a Double docConcentration to create a symmetric prior.
  
  Parameters:
  
  docConcentration - (undocumented)
  
  Returns:
  
  (undocumented)
- getAsymmetricAlpha
  
  public Vector getAsymmetricAlpha()
  
  Alias for getAsymmetricDocConcentration()
  
  Returns:
  
  (undocumented)
- getAlpha
  
  public double getAlpha()
  
  Alias for getDocConcentration()
  
  Returns:
  
  (undocumented)
- setAlpha
  
  public LDA setAlpha(Vector alpha)
  
  Alias for setDocConcentration()
  
  Parameters:
  
  alpha - (undocumented)
  
  Returns:
  
  (undocumented)
- setAlpha
  
  public LDA setAlpha(double alpha)
  
  Alias for setDocConcentration()
  
  Parameters:
  
  alpha - (undocumented)
  
  Returns:
  
  (undocumented)
- getTopicConcentration
  
  public double getTopicConcentration()
  
  Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
  This is the parameter to a symmetric Dirichlet distribution.
  
  Returns:
  
  (undocumented)
  
  Note:
  
  The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
- setTopicConcentration
  
  public LDA setTopicConcentration(double topicConcentration)
  
  Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
  This is the parameter to a symmetric Dirichlet distribution.
  
  Parameters:
  
  topicConcentration - (undocumented)
  
  Returns:
  
  (undocumented)
  
  Note:
  
  The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
  If set to -1, then topicConcentration is set automatically. (default = -1 = automatic)
  Optimizer-specific parameter settings: - EM - Value should be greater than 1.0 - default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM. - Online - Value should be greater than or equal to 0 - default = (1.0 / k), following the implementation from here.
- getBeta
  
  public double getBeta()
  
  Alias for getTopicConcentration()
  
  Returns:
  
  (undocumented)
- setBeta
  
  public LDA setBeta(double beta)
  
  Alias for setTopicConcentration()
  
  Parameters:
  
  beta - (undocumented)
  
  Returns:
  
  (undocumented)
- getMaxIterations
  
  public int getMaxIterations()
  
  Maximum number of iterations allowed.
  
  Returns:
  
  (undocumented)
- setMaxIterations
  
  public LDA setMaxIterations(int maxIterations)
  
  Set the maximum number of iterations allowed. (default = 20)
  
  Parameters:
  
  maxIterations - (undocumented)
  
  Returns:
  
  (undocumented)
- getSeed
  
  public long getSeed()
  
  Random seed for cluster initialization.
  
  Returns:
  
  (undocumented)
- setSeed
  
  public LDA setSeed(long seed)
  
  Set the random seed for cluster initialization.
  
  Parameters:
  
  seed - (undocumented)
  
  Returns:
  
  (undocumented)
- getCheckpointInterval
  
  public int getCheckpointInterval()
  
  Period (in iterations) between checkpoints.
  
  Returns:
  
  (undocumented)
- setCheckpointInterval
  
  public LDA setCheckpointInterval(int checkpointInterval)
  
  Parameter for set checkpoint interval (greater than or equal to 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Checkpointing helps with recovery (when nodes fail). It also helps with eliminating temporary shuffle files on disk, which can be important when LDA is run for many iterations. If the checkpoint directory is not set in SparkContext, this setting is ignored. (default = 10)
  Parameters:
  
  checkpointInterval - (undocumented)
  
  Returns:
  
  (undocumented)
  
  See Also:
  
  SparkContext.setCheckpointDir(java.lang.String)
- getOptimizer
  
  public LDAOptimizer getOptimizer()
  
  LDAOptimizer used to perform the actual calculation
  
  Returns:
  
  (undocumented)
- setOptimizer
  
  public LDA setOptimizer(LDAOptimizer optimizer)
  
  LDAOptimizer used to perform the actual calculation (default = EMLDAOptimizer)
  
  Parameters:
  
  optimizer - (undocumented)
  
  Returns:
  
  (undocumented)
- setOptimizer
  
  public LDA setOptimizer(String optimizerName)
  
  Set the LDAOptimizer used to perform the actual calculation by algorithm name. Currently "em", "online" are supported.
  
  Parameters:
  
  optimizerName - (undocumented)
  
  Returns:
  
  (undocumented)
- run
  
  public LDAModel run(RDD<scala.Tuple2<Object,Vector>> documents)
  
  Learn an LDA model using the given dataset.
  
  Parameters:
  
  documents - RDD of documents, which are term (word) count vectors paired with IDs. The term count vectors are "bags of words" with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and greater than or equal to 0.
  
  Returns:
  
  Inferred LDA model
- run
  
  public LDAModel run(JavaPairRDD<Long,Vector> documents)
  
  Java-friendly version of run()
  
  Parameters:
  
  documents - (undocumented)
  
  Returns:
  
  (undocumented)

Class LDA

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.spark.internal.Logging

Constructor Details

LDA

Method Details

getK

setK

getAsymmetricDocConcentration

getDocConcentration

setDocConcentration

setDocConcentration

getAsymmetricAlpha

getAlpha

setAlpha

setAlpha

getTopicConcentration

setTopicConcentration

getBeta

setBeta

getMaxIterations

setMaxIterations

getSeed

setSeed

getCheckpointInterval

setCheckpointInterval

getOptimizer

setOptimizer

setOptimizer

run

run