Class DistributedLDAModel
Object
org.apache.spark.mllib.clustering.LDAModel
org.apache.spark.mllib.clustering.DistributedLDAModel
- All Implemented Interfaces:
Saveable
Distributed LDA model.
This model stores the inferred topics, the full training dataset, and the topic distributions.
-
Method Summary
Modifier and TypeMethodDescriptionscala.Tuple2<int[],double[]>[] describeTopics(int maxTermsPerTopic) Return the topics described by weighted terms.Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").Java-friendly version oftopicDistributions()javaTopTopicsPerDocument(int k) Java-friendly version oftopTopicsPerDocument(int)intk()Number of topicsstatic DistributedLDAModelload(SparkContext sc, String path) doubledoublelogPrior()voidsave(SparkContext sc, String path) Save this model to the given path.toLocal()Convert model to a local model.scala.Tuple2<long[],double[]>[] topDocumentsPerTopic(int maxDocumentsPerTopic) Return the top documents for each topicdoubleConcentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.For each document in the training set, return the distribution over topics for that document ("theta_doc").Inferred topics, where each topic is represented by a distribution over terms.topTopicsPerDocument(int k) For each document, return the top k weighted topics for that document and their weights.intVocabulary size (number of terms or terms in the vocabulary)Methods inherited from class org.apache.spark.mllib.clustering.LDAModel
describeTopics
-
Method Details
-
load
-
k
public int k()Description copied from class:LDAModelNumber of topics -
vocabSize
public int vocabSize()Description copied from class:LDAModelVocabulary size (number of terms or terms in the vocabulary) -
docConcentration
Description copied from class:LDAModelConcentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").This is the parameter to a Dirichlet distribution.
- Specified by:
docConcentrationin classLDAModel- Returns:
- (undocumented)
-
topicConcentration
public double topicConcentration()Description copied from class:LDAModelConcentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.This is the parameter to a symmetric Dirichlet distribution.
- Specified by:
topicConcentrationin classLDAModel- Returns:
- (undocumented)
-
toLocal
Convert model to a local model. The local model stores the inferred topics but not the topic distributions for training documents.- Returns:
- (undocumented)
-
topicsMatrix
Description copied from class:LDAModelInferred topics, where each topic is represented by a distribution over terms. This is a matrix of size vocabSize x k, where each column is a topic. No guarantees are given about the ordering of the topics.- Specified by:
topicsMatrixin classLDAModel- Returns:
- (undocumented)
-
describeTopics
public scala.Tuple2<int[],double[]>[] describeTopics(int maxTermsPerTopic) Description copied from class:LDAModelReturn the topics described by weighted terms.- Specified by:
describeTopicsin classLDAModel- Parameters:
maxTermsPerTopic- Maximum number of terms to collect for each topic.- Returns:
- Array over topics. Each topic is represented as a pair of matching arrays: (term indices, term weights in topic). Each topic's terms are sorted in order of decreasing weight.
-
topDocumentsPerTopic
public scala.Tuple2<long[],double[]>[] topDocumentsPerTopic(int maxDocumentsPerTopic) Return the top documents for each topic- Parameters:
maxDocumentsPerTopic- Maximum number of documents to collect for each topic.- Returns:
- Array over topics. Each element represent as a pair of matching arrays: (IDs for the documents, weights of the topic in these documents). For each topic, documents are sorted in order of decreasing topic weights.
-
topicAssignments
-
javaTopicAssignments
-
logLikelihood
public double logLikelihood() -
logPrior
public double logPrior() -
topicDistributions
For each document in the training set, return the distribution over topics for that document ("theta_doc").- Returns:
- RDD of (document ID, topic distribution) pairs
-
javaTopicDistributions
Java-friendly version oftopicDistributions()- Returns:
- (undocumented)
-
topTopicsPerDocument
For each document, return the top k weighted topics for that document and their weights.- Parameters:
k- (undocumented)- Returns:
- RDD of (doc ID, topic indices, topic weights)
-
javaTopTopicsPerDocument
Java-friendly version oftopTopicsPerDocument(int)- Parameters:
k- (undocumented)- Returns:
- (undocumented)
-
save
Description copied from interface:SaveableSave this model to the given path.This saves: - human-readable (JSON) model metadata to path/metadata/ - Parquet formatted data to path/data/
The model may be loaded using
Loader.load.- Parameters:
sc- Spark context used to save model data.path- Path specifying the directory in which to save this model. If the directory already exists, this method throws an exception.
-