org.apache.spark.ml.UnaryTransformer<String,scala.collection.immutable.Seq<String>,RegexTokenizer>

org.apache.spark.ml.feature.RegexTokenizer

All Implemented Interfaces:: Serializable, org.apache.spark.internal.Logging, Params, HasInputCol, HasOutputCol, DefaultParamsWritable, Identifiable, MLWritable

public class RegexTokenizer extends UnaryTransformer<String,scala.collection.immutable.Seq<String>,RegexTokenizer> implements DefaultParamsWritable

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.

See Also:

Serialized Form

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
Constructor Summary

Constructors

Constructor

Description

RegexTokenizer()

RegexTokenizer(String uid)
Method Summary

Modifier and Type

Method

Description

RegexTokenizer

copy(ParamMap extra)

Creates a copy of this instance with the same UID and some extra params.

BooleanParam

gaps()

Indicates whether regex splits on gaps (true) or matches tokens (false).

boolean

getGaps()

int

getMinTokenLength()

String

getPattern()

boolean

getToLowercase()

static RegexTokenizer

load(String path)

IntParam

minTokenLength()

Minimum token length, greater than or equal to 0.

Param<String>

pattern()

Regex pattern used to match delimiters if gaps() is true or tokens if gaps() is false.

static MLReader<T>

read()

RegexTokenizer

setGaps(boolean value)

RegexTokenizer

setMinTokenLength(int value)

RegexTokenizer

setPattern(String value)

RegexTokenizer

setToLowercase(boolean value)

final BooleanParam

toLowercase()

Indicates whether to convert all characters to lowercase before tokenizing.

String

toString()

String

uid()

An immutable unique ID for the object and its derivatives.

Methods inherited from class org.apache.spark.ml.UnaryTransformer
inputCol, outputCol, setInputCol, setOutputCol, transform, transformSchema

Methods inherited from class org.apache.spark.ml.Transformer
transform, transform, transform

Methods inherited from class org.apache.spark.ml.PipelineStage
params

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable
write

Methods inherited from interface org.apache.spark.ml.param.shared.HasInputCol
getInputCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasOutputCol
getOutputCol

Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logBasedOnLevel, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, MDC, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext

Methods inherited from interface org.apache.spark.ml.util.MLWritable
save

Methods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, estimateMatadataSize, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn

Constructor Details
- RegexTokenizer
  
  public RegexTokenizer(String uid)
- RegexTokenizer
  
  public RegexTokenizer()
Method Details
- load
  
  public static RegexTokenizer load(String path)
- read
  
  public static MLReader<T> read()
- uid
  
  public String uid()
  
  Description copied from interface: Identifiable
  
  An immutable unique ID for the object and its derivatives.
  
  Specified by:
  
  uid in interface Identifiable
  
  Returns:
  
  (undocumented)
- minTokenLength
  
  public IntParam minTokenLength()
  
  Minimum token length, greater than or equal to 0. Default: 1, to avoid returning empty strings
  
  Returns:
  
  (undocumented)
- setMinTokenLength
  
  public RegexTokenizer setMinTokenLength(int value)
- getMinTokenLength
  
  public int getMinTokenLength()
- gaps
  
  public BooleanParam gaps()
  
  Indicates whether regex splits on gaps (true) or matches tokens (false). Default: true
  
  Returns:
  
  (undocumented)
- setGaps
  
  public RegexTokenizer setGaps(boolean value)
- getGaps
  
  public boolean getGaps()
- pattern
  
  public Param<String> pattern()
  
  Regex pattern used to match delimiters if gaps() is true or tokens if gaps() is false. Default: "\\s+"
  
  Returns:
  
  (undocumented)
- setPattern
  
  public RegexTokenizer setPattern(String value)
- getPattern
  
  public String getPattern()
- toLowercase
  
  public final BooleanParam toLowercase()
  
  Indicates whether to convert all characters to lowercase before tokenizing. Default: true
  
  Returns:
  
  (undocumented)
- setToLowercase
  
  public RegexTokenizer setToLowercase(boolean value)
- getToLowercase
  
  public boolean getToLowercase()
- copy
  
  public RegexTokenizer copy(ParamMap extra)
  
  Description copied from interface: Params
  
  Creates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. See defaultCopy().
  
  Specified by:
  
  copy in interface Params
  
  Overrides:
  
  copy in class UnaryTransformer<String,scala.collection.immutable.Seq<String>,RegexTokenizer>
  
  Parameters:
  
  extra - (undocumented)
  
  Returns:
  
  (undocumented)
- toString
  
  public String toString()
  
  Specified by:
  
  toString in interface Identifiable
  
  Overrides:
  
  toString in class Object

Class RegexTokenizer

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class org.apache.spark.ml.UnaryTransformer

Methods inherited from class org.apache.spark.ml.Transformer

Methods inherited from class org.apache.spark.ml.PipelineStage

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable

Methods inherited from interface org.apache.spark.ml.param.shared.HasInputCol

Methods inherited from interface org.apache.spark.ml.param.shared.HasOutputCol

Methods inherited from interface org.apache.spark.internal.Logging

Methods inherited from interface org.apache.spark.ml.util.MLWritable

Methods inherited from interface org.apache.spark.ml.param.Params

Constructor Details

RegexTokenizer

RegexTokenizer

Method Details

load

read

uid

minTokenLength

setMinTokenLength

getMinTokenLength

gaps

setGaps

getGaps

pattern

setPattern

getPattern

toLowercase

setToLowercase

getToLowercase

copy

toString