Package org.apache.spark.ml.feature
Class ChiSqSelector
Object
org.apache.spark.ml.PipelineStage
org.apache.spark.ml.Estimator<T>
org.apache.spark.ml.feature.ChiSqSelector
- All Implemented Interfaces:
Serializable,org.apache.spark.internal.Logging,SelectorParams,Params,HasFeaturesCol,HasLabelCol,HasOutputCol,DefaultParamsWritable,Identifiable,MLWritable
Deprecated.
use UnivariateFeatureSelector instead. Since 3.1.1.
Chi-Squared feature selection, which selects categorical features to use for predicting a
categorical label.
The selector supports different selection methods:
numTopFeatures, percentile, fpr,
fdr, fwe.
- numTopFeatures chooses a fixed number of top features according to a chi-squared test.
- percentile is similar but chooses a fraction of all features instead of a fixed number.
- fpr chooses all features whose p-value are below a threshold, thus controlling the false
positive rate of selection.
- fdr uses the [Benjamini-Hochberg procedure]
(https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
to choose all features whose false discovery rate is below a threshold.
- fwe chooses all features whose p-values are below a threshold. The threshold is scaled by
1/numFeatures, thus controlling the family-wise error rate of selection.
By default, the selection method is numTopFeatures, with the default number of top features
set to 50.- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionDeprecated.Creates a copy of this instance with the same UID and some extra params.final DoubleParamfdr()The upper bound of the expected false discovery rate.Param for features column name.Deprecated.Fits a model to the input data.final DoubleParamfpr()The highest p-value for features to be kept.final DoubleParamfwe()The upper bound of the expected family-wise error rate.labelCol()Param for label column name.static ChiSqSelectorDeprecated.final IntParamNumber of features that selector will select, ordered by ascending p-value.Param for output column name.final DoubleParamPercentile of features that selector will select, ordered by ascending p-value.static MLReader<T>read()Deprecated.The selector type.setFdr(double value) Deprecated.setFeaturesCol(String value) Deprecated.setFpr(double value) Deprecated.setFwe(double value) Deprecated.setLabelCol(String value) Deprecated.setNumTopFeatures(int value) Deprecated.setOutputCol(String value) Deprecated.setPercentile(double value) Deprecated.setSelectorType(String value) Deprecated.transformSchema(StructType schema) Deprecated.Check transform validity and derive the output schema from the input schema.uid()Deprecated.An immutable unique ID for the object and its derivatives.Methods inherited from class org.apache.spark.ml.PipelineStage
paramsMethods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.apache.spark.ml.util.DefaultParamsWritable
writeMethods inherited from interface org.apache.spark.ml.param.shared.HasFeaturesCol
getFeaturesColMethods inherited from interface org.apache.spark.ml.param.shared.HasLabelCol
getLabelColMethods inherited from interface org.apache.spark.ml.param.shared.HasOutputCol
getOutputColMethods inherited from interface org.apache.spark.ml.util.Identifiable
toStringMethods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logBasedOnLevel, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContextMethods inherited from interface org.apache.spark.ml.util.MLWritable
saveMethods inherited from interface org.apache.spark.ml.param.Params
clear, copyValues, defaultCopy, defaultParamMap, estimateMatadataSize, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, onParamChange, paramMap, params, set, set, set, setDefault, setDefault, shouldOwnMethods inherited from interface org.apache.spark.ml.feature.SelectorParams
getFdr, getFpr, getFwe, getNumTopFeatures, getPercentile, getSelectorType
-
Constructor Details
-
ChiSqSelector
Deprecated. -
ChiSqSelector
public ChiSqSelector()Deprecated.
-
-
Method Details
-
load
Deprecated. -
read
Deprecated. -
uid
Deprecated.Description copied from interface:IdentifiableAn immutable unique ID for the object and its derivatives.- Returns:
- (undocumented)
-
setNumTopFeatures
Deprecated. -
setPercentile
Deprecated. -
setFpr
Deprecated. -
setFdr
Deprecated. -
setFwe
Deprecated. -
setSelectorType
Deprecated. -
setFeaturesCol
Deprecated. -
setOutputCol
Deprecated. -
setLabelCol
Deprecated. -
fit
Deprecated.Description copied from class:EstimatorFits a model to the input data.- Parameters:
dataset- (undocumented)- Returns:
- (undocumented)
-
transformSchema
Deprecated.Description copied from class:PipelineStageCheck transform validity and derive the output schema from the input schema.We check validity for interactions between parameters during
transformSchemaand raise an exception if any parameter value is invalid. Parameter value checks which do not depend on other parameters are handled byParam.validate().Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks.
- Parameters:
schema- (undocumented)- Returns:
- (undocumented)
-
copy
Deprecated.Description copied from interface:ParamsCreates a copy of this instance with the same UID and some extra params. Subclasses should implement this method and set the return type properly. SeedefaultCopy(). -
fdr
Description copied from interface:SelectorParamsThe upper bound of the expected false discovery rate. Only applicable when selectorType = "fdr". Default value is 0.05.- Specified by:
fdrin interfaceSelectorParams- Returns:
- (undocumented)
-
featuresCol
Description copied from interface:HasFeaturesColParam for features column name.- Specified by:
featuresColin interfaceHasFeaturesCol- Returns:
- (undocumented)
-
fpr
Description copied from interface:SelectorParamsThe highest p-value for features to be kept. Only applicable when selectorType = "fpr". Default value is 0.05.- Specified by:
fprin interfaceSelectorParams- Returns:
- (undocumented)
-
fwe
Description copied from interface:SelectorParamsThe upper bound of the expected family-wise error rate. Only applicable when selectorType = "fwe". Default value is 0.05.- Specified by:
fwein interfaceSelectorParams- Returns:
- (undocumented)
-
labelCol
Description copied from interface:HasLabelColParam for label column name.- Specified by:
labelColin interfaceHasLabelCol- Returns:
- (undocumented)
-
numTopFeatures
Description copied from interface:SelectorParamsNumber of features that selector will select, ordered by ascending p-value. If the number of features is less than numTopFeatures, then this will select all features. Only applicable when selectorType = "numTopFeatures". The default value of numTopFeatures is 50.- Specified by:
numTopFeaturesin interfaceSelectorParams- Returns:
- (undocumented)
-
outputCol
Description copied from interface:HasOutputColParam for output column name.- Specified by:
outputColin interfaceHasOutputCol- Returns:
- (undocumented)
-
percentile
Description copied from interface:SelectorParamsPercentile of features that selector will select, ordered by ascending p-value. Only applicable when selectorType = "percentile". Default value is 0.1.- Specified by:
percentilein interfaceSelectorParams- Returns:
- (undocumented)
-
selectorType
Description copied from interface:SelectorParamsThe selector type. Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe"- Specified by:
selectorTypein interfaceSelectorParams- Returns:
- (undocumented)
-