org.apache.spark.mllib.feature.Word2Vec

All Implemented Interfaces:: Serializable, org.apache.spark.internal.Logging, scala.Serializable

public class Word2Vec extends Object implements scala.Serializable, org.apache.spark.internal.Logging

Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

We used skip-gram model in our implementation and hierarchical softmax method to train the model. The variable names in the implementation matches the original C implementation.

For original C implementation, see https://code.google.com/p/word2vec/ For research papers, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

See Also:

Serialized Form

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.SparkShellLoggingFilter
Constructor Summary

Constructors

Constructor

Description

Word2Vec()
Method Summary

Modifier and Type

Method

Description

<S extends Iterable<String>> Word2VecModel

fit(JavaRDD<S> dataset)

Computes the vector representation of each word in vocabulary (Java version).

<S extends scala.collection.Iterable<String>> Word2VecModel

fit(RDD<S> dataset)

Computes the vector representation of each word in vocabulary.

Word2Vec

setLearningRate(double learningRate)

Sets initial learning rate (default: 0.025).

Word2Vec

setMaxSentenceLength(int maxSentenceLength)

Sets the maximum length (in words) of each sentence in the input data.

Word2Vec

setMinCount(int minCount)

Sets minCount, the minimum number of times a token must appear to be included in the word2vec model's vocabulary (default: 5).

Word2Vec

setNumIterations(int numIterations)

Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.

Word2Vec

setNumPartitions(int numPartitions)

Sets number of partitions (default: 1).

Word2Vec

setSeed(long seed)

Sets random seed (default: a random long integer).

Word2Vec

setVectorSize(int vectorSize)

Sets vector size (default: 100).

Word2Vec

setWindowSize(int window)

Sets the window of words (default: 5)

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq

Constructor Details
- Word2Vec
  
  public Word2Vec()
Method Details
- fit
  
  public <S extends scala.collection.Iterable<String>> Word2VecModel fit(RDD<S> dataset)
  
  Computes the vector representation of each word in vocabulary.
  
  Parameters:
  
  dataset - an RDD of sentences, each sentence is expressed as an iterable collection of words
  
  Returns:
  
  a Word2VecModel
- fit
  
  public <S extends Iterable<String>> Word2VecModel fit(JavaRDD<S> dataset)
  
  Computes the vector representation of each word in vocabulary (Java version).
  
  Parameters:
  
  dataset - a JavaRDD of words
  
  Returns:
  
  a Word2VecModel
- setLearningRate
  
  public Word2Vec setLearningRate(double learningRate)
  
  Sets initial learning rate (default: 0.025).
  
  Parameters:
  
  learningRate - (undocumented)
  
  Returns:
  
  (undocumented)
- setMaxSentenceLength
  
  public Word2Vec setMaxSentenceLength(int maxSentenceLength)
  
  Sets the maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up to maxSentenceLength size (default: 1000)
  
  Parameters:
  
  maxSentenceLength - (undocumented)
  
  Returns:
  
  (undocumented)
- setMinCount
  
  public Word2Vec setMinCount(int minCount)
  
  Sets minCount, the minimum number of times a token must appear to be included in the word2vec model's vocabulary (default: 5).
  
  Parameters:
  
  minCount - (undocumented)
  
  Returns:
  
  (undocumented)
- setNumIterations
  
  public Word2Vec setNumIterations(int numIterations)
  
  Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.
  
  Parameters:
  
  numIterations - (undocumented)
  
  Returns:
  
  (undocumented)
- setNumPartitions
  
  public Word2Vec setNumPartitions(int numPartitions)
  
  Sets number of partitions (default: 1). Use a small number for accuracy.
  
  Parameters:
  
  numPartitions - (undocumented)
  
  Returns:
  
  (undocumented)
- setSeed
  
  public Word2Vec setSeed(long seed)
  
  Sets random seed (default: a random long integer).
  
  Parameters:
  
  seed - (undocumented)
  
  Returns:
  
  (undocumented)
- setVectorSize
  
  public Word2Vec setVectorSize(int vectorSize)
  
  Sets vector size (default: 100).
  
  Parameters:
  
  vectorSize - (undocumented)
  
  Returns:
  
  (undocumented)
- setWindowSize
  
  public Word2Vec setWindowSize(int window)
  
  Sets the window of words (default: 5)
  
  Parameters:
  
  window - (undocumented)
  
  Returns:
  
  (undocumented)

Class Word2Vec

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.spark.internal.Logging

Constructor Details

Word2Vec

Method Details

fit

fit

setLearningRate

setMaxSentenceLength

setMinCount

setNumIterations

setNumPartitions

setSeed

setVectorSize

setWindowSize