pyspark.sql.functions.hll_sketch_agg

pyspark.sql.functions.hll_sketch_agg(col: ColumnOrName, lgConfigK: Union[int, pyspark.sql.column.Column, None] = None) → pyspark.sql.column.Column[source]

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.

New in version 3.5.0.

Parameters
colColumn or str or int
lgConfigKint, optional

The log-base-2 of K, where K is the number of buckets or slots for the HllSketch

Returns
Column

The binary representation of the HllSketch.

Examples

>>> df = spark.createDataFrame([1,2,2,3], "INT")
>>> df1 = df.agg(hll_sketch_estimate(hll_sketch_agg("value")).alias("distinct_cnt"))
>>> df1.show()
+------------+
|distinct_cnt|
+------------+
|           3|
+------------+
>>> df2 = df.agg(hll_sketch_estimate(
...     hll_sketch_agg("value", lit(12))
... ).alias("distinct_cnt"))
>>> df2.show()
+------------+
|distinct_cnt|
+------------+
|           3|
+------------+
>>> df3 = df.agg(hll_sketch_estimate(
...     hll_sketch_agg(col("value"), lit(12))).alias("distinct_cnt"))
>>> df3.show()
+------------+
|distinct_cnt|
+------------+
|           3|
+------------+