Package org.apache.spark.ml.feature
Class RegexTokenizer
Object
org.apache.spark.ml.PipelineStage
org.apache.spark.ml.Transformer
org.apache.spark.ml.UnaryTransformer<String,scala.collection.immutable.Seq<String>,RegexTokenizer>
org.apache.spark.ml.feature.RegexTokenizer
- All Implemented Interfaces:
Serializable,org.apache.spark.internal.Logging,Params,HasInputCol,HasOutputCol,DefaultParamsWritable,Identifiable,MLWritable
public class RegexTokenizer
extends UnaryTransformer<String,scala.collection.immutable.Seq<String>,RegexTokenizer>
implements DefaultParamsWritable
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if
gaps is false).
Optional parameters also allow filtering tokens using a minimal length.
It returns an array of strings that can be empty.- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionCreates a copy of this instance with the same UID and some extra params.gaps()Indicates whether regex splits on gaps (true) or matches tokens (false).booleangetGaps()intbooleanstatic RegexTokenizerMinimum token length, greater than or equal to 0.pattern()static MLReader<T>read()setGaps(boolean value) setMinTokenLength(int value) setPattern(String value) setToLowercase(boolean value) final BooleanParamIndicates whether to convert all characters to lowercase before tokenizing.toString()uid()An immutable unique ID for the object and its derivatives.Methods inherited from class org.apache.spark.ml.UnaryTransformer
inputCol, outputCol, setInputCol, setOutputCol, transform, transformSchemaMethods inherited from class org.apache.spark.ml.Transformer
transform, transform, transformMethods inherited from class org.apache.spark.ml.PipelineStage
paramsMethods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, waitMethods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable
writeMethods inherited from interface org.apache.spark.ml.param.shared.HasInputCol
getInputColMethods inherited from interface org.apache.spark.ml.param.shared.HasOutputCol
getOutputColMethods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logBasedOnLevel, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, MDC, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContextMethods inherited from interface org.apache.spark.ml.util.MLWritable
saveMethods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, estimateMatadataSize, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn
-
Constructor Details
-
RegexTokenizer
-
RegexTokenizer
public RegexTokenizer()
-
-
Method Details
-
load
-
read
-
uid
Description copied from interface:IdentifiableAn immutable unique ID for the object and its derivatives.- Specified by:
uidin interfaceIdentifiable- Returns:
- (undocumented)
-
minTokenLength
Minimum token length, greater than or equal to 0. Default: 1, to avoid returning empty strings- Returns:
- (undocumented)
-
setMinTokenLength
-
getMinTokenLength
public int getMinTokenLength() -
gaps
Indicates whether regex splits on gaps (true) or matches tokens (false). Default: true- Returns:
- (undocumented)
-
setGaps
-
getGaps
public boolean getGaps() -
pattern
Regex pattern used to match delimiters ifgaps()is true or tokens ifgaps()is false. Default:"\\s+"- Returns:
- (undocumented)
-
setPattern
-
getPattern
-
toLowercase
Indicates whether to convert all characters to lowercase before tokenizing. Default: true- Returns:
- (undocumented)
-
setToLowercase
-
getToLowercase
public boolean getToLowercase() -
copy
Description copied from interface:ParamsCreates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. SeedefaultCopy().- Specified by:
copyin interfaceParams- Overrides:
copyin classUnaryTransformer<String,scala.collection.immutable.Seq<String>, RegexTokenizer> - Parameters:
extra- (undocumented)- Returns:
- (undocumented)
-
toString
- Specified by:
toStringin interfaceIdentifiable- Overrides:
toStringin classObject
-