Class DistributedLDAModel

Object
org.apache.spark.mllib.clustering.LDAModel
org.apache.spark.mllib.clustering.DistributedLDAModel
All Implemented Interfaces:
Saveable

public class DistributedLDAModel extends LDAModel
Distributed LDA model. This model stores the inferred topics, the full training dataset, and the topic distributions.
  • Method Details

    • load

      public static DistributedLDAModel load(SparkContext sc, String path)
    • k

      public int k()
      Description copied from class: LDAModel
      Number of topics
      Specified by:
      k in class LDAModel
    • vocabSize

      public int vocabSize()
      Description copied from class: LDAModel
      Vocabulary size (number of terms or terms in the vocabulary)
      Specified by:
      vocabSize in class LDAModel
    • docConcentration

      public Vector docConcentration()
      Description copied from class: LDAModel
      Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").

      This is the parameter to a Dirichlet distribution.

      Specified by:
      docConcentration in class LDAModel
      Returns:
      (undocumented)
    • topicConcentration

      public double topicConcentration()
      Description copied from class: LDAModel
      Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.

      This is the parameter to a symmetric Dirichlet distribution.

      Specified by:
      topicConcentration in class LDAModel
      Returns:
      (undocumented)
    • toLocal

      public LocalLDAModel toLocal()
      Convert model to a local model. The local model stores the inferred topics but not the topic distributions for training documents.
      Returns:
      (undocumented)
    • topicsMatrix

      public Matrix topicsMatrix()
      Description copied from class: LDAModel
      Inferred topics, where each topic is represented by a distribution over terms. This is a matrix of size vocabSize x k, where each column is a topic. No guarantees are given about the ordering of the topics.
      Specified by:
      topicsMatrix in class LDAModel
      Returns:
      (undocumented)
    • describeTopics

      public scala.Tuple2<int[],double[]>[] describeTopics(int maxTermsPerTopic)
      Description copied from class: LDAModel
      Return the topics described by weighted terms.

      Specified by:
      describeTopics in class LDAModel
      Parameters:
      maxTermsPerTopic - Maximum number of terms to collect for each topic.
      Returns:
      Array over topics. Each topic is represented as a pair of matching arrays: (term indices, term weights in topic). Each topic's terms are sorted in order of decreasing weight.
    • topDocumentsPerTopic

      public scala.Tuple2<long[],double[]>[] topDocumentsPerTopic(int maxDocumentsPerTopic)
      Return the top documents for each topic

      Parameters:
      maxDocumentsPerTopic - Maximum number of documents to collect for each topic.
      Returns:
      Array over topics. Each element represent as a pair of matching arrays: (IDs for the documents, weights of the topic in these documents). For each topic, documents are sorted in order of decreasing topic weights.
    • topicAssignments

      public RDD<scala.Tuple3<Object,int[],int[]>> topicAssignments()
    • javaTopicAssignments

      public JavaRDD<scala.Tuple3<Long,int[],int[]>> javaTopicAssignments()
    • logLikelihood

      public double logLikelihood()
    • logPrior

      public double logPrior()
    • topicDistributions

      public RDD<scala.Tuple2<Object,Vector>> topicDistributions()
      For each document in the training set, return the distribution over topics for that document ("theta_doc").

      Returns:
      RDD of (document ID, topic distribution) pairs
    • javaTopicDistributions

      public JavaPairRDD<Long,Vector> javaTopicDistributions()
      Java-friendly version of topicDistributions()
      Returns:
      (undocumented)
    • topTopicsPerDocument

      public RDD<scala.Tuple3<Object,int[],double[]>> topTopicsPerDocument(int k)
      For each document, return the top k weighted topics for that document and their weights.
      Parameters:
      k - (undocumented)
      Returns:
      RDD of (doc ID, topic indices, topic weights)
    • javaTopTopicsPerDocument

      public JavaRDD<scala.Tuple3<Long,int[],double[]>> javaTopTopicsPerDocument(int k)
      Java-friendly version of topTopicsPerDocument(int)
      Parameters:
      k - (undocumented)
      Returns:
      (undocumented)
    • save

      public void save(SparkContext sc, String path)
      Description copied from interface: Saveable
      Save this model to the given path.

      This saves: - human-readable (JSON) model metadata to path/metadata/ - Parquet formatted data to path/data/

      The model may be loaded using Loader.load.

      Parameters:
      sc - Spark context used to save model data.
      path - Path specifying the directory in which to save this model. If the directory already exists, this method throws an exception.