Object

org.apache.spark.mllib.clustering.LDAModel

org.apache.spark.mllib.clustering.DistributedLDAModel

All Implemented Interfaces:: Saveable

public class DistributedLDAModel extends LDAModel

Distributed LDA model. This model stores the inferred topics, the full training dataset, and the topic distributions.

Method Summary

Modifier and Type

Method

Description

scala.Tuple2<int[],double[]>[]

describeTopics(int maxTermsPerTopic)

Return the topics described by weighted terms.

Vector

docConcentration()

Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

JavaRDD<scala.Tuple3<Long,int[],int[]>>

javaTopicAssignments()

JavaPairRDD<Long,Vector>

javaTopicDistributions()

Java-friendly version of topicDistributions()

JavaRDD<scala.Tuple3<Long,int[],double[]>>

javaTopTopicsPerDocument(int k)

Java-friendly version of topTopicsPerDocument(int)

int

k()

Number of topics

static DistributedLDAModel

load(SparkContext sc, String path)

double

logLikelihood()

double

logPrior()

void

save(SparkContext sc, String path)

Save this model to the given path.

LocalLDAModel

toLocal()

Convert model to a local model.

scala.Tuple2<long[],double[]>[]

topDocumentsPerTopic(int maxDocumentsPerTopic)

Return the top documents for each topic

RDD<scala.Tuple3<Object,int[],int[]>>

topicAssignments()

double

topicConcentration()

Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

RDD<scala.Tuple2<Object,Vector>>

topicDistributions()

For each document in the training set, return the distribution over topics for that document ("theta_doc").

Matrix

topicsMatrix()

Inferred topics, where each topic is represented by a distribution over terms.

RDD<scala.Tuple3<Object,int[],double[]>>

topTopicsPerDocument(int k)

For each document, return the top k weighted topics for that document and their weights.

int

vocabSize()

Vocabulary size (number of terms or terms in the vocabulary)

Methods inherited from class org.apache.spark.mllib.clustering.LDAModel
describeTopics

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- load
  
  public static DistributedLDAModel load(SparkContext sc, String path)
- k
  
  public int k()
  
  Description copied from class: LDAModel
  
  Number of topics
  
  Specified by:
  
  k in class LDAModel
- vocabSize
  
  public int vocabSize()
  
  Description copied from class: LDAModel
  
  Vocabulary size (number of terms or terms in the vocabulary)
  
  Specified by:
  
  vocabSize in class LDAModel
- docConcentration
  
  public Vector docConcentration()
  
  Description copied from class: LDAModel
  
  Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
  This is the parameter to a Dirichlet distribution.
  
  Specified by:
  
  docConcentration in class LDAModel
  
  Returns:
  
  (undocumented)
- topicConcentration
  
  public double topicConcentration()
  
  Description copied from class: LDAModel
  
  Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
  This is the parameter to a symmetric Dirichlet distribution.
  
  Specified by:
  
  topicConcentration in class LDAModel
  
  Returns:
  
  (undocumented)
- toLocal
  
  public LocalLDAModel toLocal()
  
  Convert model to a local model. The local model stores the inferred topics but not the topic distributions for training documents.
  
  Returns:
  
  (undocumented)
- topicsMatrix
  
  public Matrix topicsMatrix()
  
  Description copied from class: LDAModel
  
  Inferred topics, where each topic is represented by a distribution over terms. This is a matrix of size vocabSize x k, where each column is a topic. No guarantees are given about the ordering of the topics.
  
  Specified by:
  
  topicsMatrix in class LDAModel
  
  Returns:
  
  (undocumented)
- describeTopics
  
  public scala.Tuple2<int[],double[]>[] describeTopics(int maxTermsPerTopic)
  
  Description copied from class: LDAModel
  
  Return the topics described by weighted terms.
  
  Specified by:
  
  describeTopics in class LDAModel
  
  Parameters:
  
  maxTermsPerTopic - Maximum number of terms to collect for each topic.
  
  Returns:
  
  Array over topics. Each topic is represented as a pair of matching arrays: (term indices, term weights in topic). Each topic's terms are sorted in order of decreasing weight.
- topDocumentsPerTopic
  
  public scala.Tuple2<long[],double[]>[] topDocumentsPerTopic(int maxDocumentsPerTopic)
  
  Return the top documents for each topic
  
  Parameters:
  
  maxDocumentsPerTopic - Maximum number of documents to collect for each topic.
  
  Returns:
  
  Array over topics. Each element represent as a pair of matching arrays: (IDs for the documents, weights of the topic in these documents). For each topic, documents are sorted in order of decreasing topic weights.
- topicAssignments
  
  public RDD<scala.Tuple3<Object,int[],int[]>> topicAssignments()
- javaTopicAssignments
  
  public JavaRDD<scala.Tuple3<Long,int[],int[]>> javaTopicAssignments()
- logLikelihood
  
  public double logLikelihood()
- logPrior
  
  public double logPrior()
- topicDistributions
  
  public RDD<scala.Tuple2<Object,Vector>> topicDistributions()
  
  For each document in the training set, return the distribution over topics for that document ("theta_doc").
  
  Returns:
  
  RDD of (document ID, topic distribution) pairs
- javaTopicDistributions
  
  public JavaPairRDD<Long,Vector> javaTopicDistributions()
  
  Java-friendly version of topicDistributions()
  
  Returns:
  
  (undocumented)
- topTopicsPerDocument
  
  public RDD<scala.Tuple3<Object,int[],double[]>> topTopicsPerDocument(int k)
  
  For each document, return the top k weighted topics for that document and their weights.
  
  Parameters:
  
  k - (undocumented)
  
  Returns:
  
  RDD of (doc ID, topic indices, topic weights)
- javaTopTopicsPerDocument
  
  public JavaRDD<scala.Tuple3<Long,int[],double[]>> javaTopTopicsPerDocument(int k)
  
  Java-friendly version of topTopicsPerDocument(int)
  
  Parameters:
  
  k - (undocumented)
  
  Returns:
  
  (undocumented)
- save
  
  public void save(SparkContext sc, String path)
  
  Description copied from interface: Saveable
  
  Save this model to the given path.
  This saves: - human-readable (JSON) model metadata to path/metadata/ - Parquet formatted data to path/data/
  The model may be loaded using Loader.load.
  
  Parameters:
  
  sc - Spark context used to save model data.
  
  path - Path specifying the directory in which to save this model. If the directory already exists, this method throws an exception.

Class DistributedLDAModel

Method Summary

Methods inherited from class org.apache.spark.mllib.clustering.LDAModel

Methods inherited from class java.lang.Object

Method Details

load

k

vocabSize

docConcentration

topicConcentration

toLocal

topicsMatrix

describeTopics

topDocumentsPerTopic

topicAssignments

javaTopicAssignments

logLikelihood

logPrior

topicDistributions

javaTopicDistributions

topTopicsPerDocument

javaTopTopicsPerDocument

save