Class IDF

Object
org.apache.spark.mllib.feature.IDF

public class IDF extends Object
Inverse document frequency (IDF). The standard formulation is used: idf = log((m + 1) / (d(t) + 1)), where m is the total number of documents and d(t) is the number of documents that contain term t.

This implementation supports filtering out terms which do not appear in a minimum number of documents (controlled by the variable minDocFreq). For terms that are not in at least minDocFreq documents, the IDF is found as 0, resulting in TF-IDFs of 0. The document frequency is 0 as well for such terms

param: minDocFreq minimum of documents in which a term should appear for filtering

  • Constructor Details

    • IDF

      public IDF(int minDocFreq)
    • IDF

      public IDF()
  • Method Details

    • minDocFreq

      public int minDocFreq()
    • fit

      public IDFModel fit(RDD<Vector> dataset)
      Computes the inverse document frequency.
      Parameters:
      dataset - an RDD of term frequency vectors
      Returns:
      (undocumented)
    • fit

      public IDFModel fit(JavaRDD<Vector> dataset)
      Computes the inverse document frequency.
      Parameters:
      dataset - a JavaRDD of term frequency vectors
      Returns:
      (undocumented)