pyspark.sql.functions.tuple_union_agg_integer#

pyspark.sql.functions.tuple_union_agg_integer(col, lgNomEntries=None, mode=None)[source]#

Aggregate function: returns the compact binary representation of the Datasketches TupleSketch that is the union of the integer TupleSketch objects in the input column.

New in version 4.2.0.

Parameters
colColumn or column name

The column containing binary TupleSketch representations

lgNomEntriesColumn or int, optional

The log-base-2 of nominal entries (must be between 4 and 26, defaults to 12)

modeColumn or str, optional

The summary mode: “sum” (default), “min”, “max”, or “alwaysone”

Returns
Column

The binary representation of the merged TupleSketch.

Examples

>>> from pyspark.sql import functions as sf
>>> df1 = spark.createDataFrame([(1, 10), (2, 20)], ["key", "value"])
>>> df1 = df1.agg(sf.tuple_sketch_agg_integer("key", "value").alias("sketch"))
>>> df2 = spark.createDataFrame([(3, 30), (4, 40)], ["key", "value"])
>>> df2 = df2.agg(sf.tuple_sketch_agg_integer("key", "value").alias("sketch"))
>>> df3 = df1.union(df2)
>>> df3.agg(sf.tuple_sketch_estimate_integer(sf.tuple_union_agg_integer("sketch"))).show()
+-----------------------------------------------------------------------+
|tuple_sketch_estimate_integer(tuple_union_agg_integer(sketch, 12, sum))|
+-----------------------------------------------------------------------+
|                                                                    4.0|
+-----------------------------------------------------------------------+