|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
Object org.apache.spark.ml.PipelineStage org.apache.spark.ml.Estimator<VectorIndexerModel> org.apache.spark.ml.feature.VectorIndexer
public class VectorIndexer
:: Experimental ::
Class for indexing categorical feature columns in a dataset of Vector
.
This has 2 usage modes: - Automatically identify categorical features (default behavior) - This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter. - Set maxCategories to the maximum number of categorical any categorical feature should have. - E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 will be declared categorical and use indices {0, 1}, and feature 1 will be declared continuous. - Index all features, if all features are categorical - If maxCategories is set to be very large, then this will build an index of unique values for all features. - Warning: This can cause problems if features are continuous since this will collect ALL unique values to the driver. - E.g.: Feature 0 has unique values {-1.0, 0.0}, and feature 1 values {1.0, 3.0, 5.0}. If maxCategories >= 3, then both features will be declared categorical.
This returns a model which can transform categorical features to use 0-based indices.
Index stability: - This is not guaranteed to choose the same category index across multiple runs. - If a categorical feature includes value 0, then this is guaranteed to map value 0 to index 0. This maintains vector sparsity. - More stability may be added in the future.
TODO: Future extensions: The following functionality is planned for the future: - Preserve metadata in transform; if a feature's metadata is already present, do not recompute. - Specify certain features to not index, either via a parameter or via existing metadata. - Add warning if a categorical feature has only 1 category. - Add option for allowing unknown categories.
Nested Class Summary | |
---|---|
static class |
VectorIndexer.CategoryStats
Helper class for tracking unique values for each feature. |
Constructor Summary | |
---|---|
VectorIndexer()
|
|
VectorIndexer(String uid)
|
Method Summary | |
---|---|
VectorIndexer |
copy(ParamMap extra)
Creates a copy of this instance with the same UID and some extra params. |
VectorIndexerModel |
fit(DataFrame dataset)
Fits a model to the input data. |
int |
getMaxCategories()
|
IntParam |
maxCategories()
Threshold for the number of values a categorical feature can take. |
VectorIndexer |
setInputCol(String value)
|
VectorIndexer |
setMaxCategories(int value)
|
VectorIndexer |
setOutputCol(String value)
|
StructType |
transformSchema(StructType schema)
:: DeveloperApi :: |
String |
uid()
|
Methods inherited from class org.apache.spark.ml.Estimator |
---|
fit, fit, fit, fit |
Methods inherited from class Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.spark.ml.param.Params |
---|
clear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, paramMap, params, set, set, set, setDefault, setDefault, setDefault, shouldOwn, validateParams |
Methods inherited from interface org.apache.spark.Logging |
---|
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning |
Constructor Detail |
---|
public VectorIndexer(String uid)
public VectorIndexer()
Method Detail |
---|
public String uid()
public VectorIndexer setMaxCategories(int value)
public VectorIndexer setInputCol(String value)
public VectorIndexer setOutputCol(String value)
public VectorIndexerModel fit(DataFrame dataset)
Estimator
fit
in class Estimator<VectorIndexerModel>
dataset
- (undocumented)
public StructType transformSchema(StructType schema)
PipelineStage
Derives the output schema from the input schema.
transformSchema
in class PipelineStage
schema
- (undocumented)
public VectorIndexer copy(ParamMap extra)
Params
copy
in interface Params
copy
in class Estimator<VectorIndexerModel>
extra
- (undocumented)
defaultCopy()
public IntParam maxCategories()
(default = 20)
public int getMaxCategories()
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |