pyspark.sql.plot.core.PySparkPlotAccessor.kde#

PySparkPlotAccessor.kde(bw_method, column=None, ind=None, **kwargs)[source]#

Generate Kernel Density Estimate plot using Gaussian kernels.

In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable. This function uses Gaussian kernels and includes automatic bandwidth determination.

Parameters
bw_methodint or float

The method used to calculate the estimator bandwidth. See KernelDensity in PySpark for more information.

column: str or list of str, optional

Column name or list of names to be used for creating the kde plot. If None (default), all numeric columns will be used.

indList of float, NumPy array or integer, optional

Evaluation points for the estimated PDF. If None (default), 1000 equally spaced points are used. If ind is a NumPy array, the KDE is evaluated at the points passed. If ind is an integer, ind number of equally spaced points are used.

**kwargsoptional

Additional keyword arguments.

Returns
plotly.graph_objs.Figure

Examples

>>> data = [(5.1, 3.5, 0), (4.9, 3.0, 0), (7.0, 3.2, 1), (6.4, 3.2, 1), (5.9, 3.0, 2)]
>>> columns = ["length", "width", "species"]
>>> df = spark.createDataFrame(data, columns)
>>> df.plot.kde(bw_method=0.3, ind=100)  
>>> df.plot.kde(column=["length", "width"], bw_method=0.3, ind=100)  
>>> df.plot.kde(column="length", bw_method=0.3, ind=100)