Class HashingTF

Object
org.apache.spark.mllib.feature.HashingTF
All Implemented Interfaces:
Serializable, scala.Serializable

public class HashingTF extends Object implements scala.Serializable
Maps a sequence of terms to their term frequencies using the hashing trick.

param: numFeatures number of features (default: 2^20^)

See Also:
  • Constructor Details

    • HashingTF

      public HashingTF(int numFeatures)
    • HashingTF

      public HashingTF()
  • Method Details

    • numFeatures

      public int numFeatures()
    • setBinary

      public HashingTF setBinary(boolean value)
      If true, term frequency vector will be binary such that non-zero term counts will be set to 1 (default: false)
      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • setHashAlgorithm

      public HashingTF setHashAlgorithm(String value)
      Set the hash algorithm used when mapping term to integer. (default: murmur3)
      Parameters:
      value - (undocumented)
      Returns:
      (undocumented)
    • indexOf

      public int indexOf(Object term)
      Returns the index of the input term.
      Parameters:
      term - (undocumented)
      Returns:
      (undocumented)
    • transform

      public Vector transform(scala.collection.Iterable<?> document)
      Transforms the input document into a sparse term frequency vector.
      Parameters:
      document - (undocumented)
      Returns:
      (undocumented)
    • transform

      public Vector transform(Iterable<?> document)
      Transforms the input document into a sparse term frequency vector (Java version).
      Parameters:
      document - (undocumented)
      Returns:
      (undocumented)
    • transform

      public <D extends scala.collection.Iterable<?>> RDD<Vector> transform(RDD<D> dataset)
      Transforms the input document to term frequency vectors.
      Parameters:
      dataset - (undocumented)
      Returns:
      (undocumented)
    • transform

      public <D extends Iterable<?>> JavaRDD<Vector> transform(JavaRDD<D> dataset)
      Transforms the input document to term frequency vectors (Java version).
      Parameters:
      dataset - (undocumented)
      Returns:
      (undocumented)