pyspark.sql.functions.udtf

pyspark.sql.functions.udtf(cls: Optional[Type] = None, *, returnType: Union[pyspark.sql.types.StructType, str], useArrow: Optional[bool] = None) → Union[pyspark.sql.udtf.UserDefinedTableFunction, Callable[[Type], pyspark.sql.udtf.UserDefinedTableFunction]][source]

Creates a user defined table function (UDTF).

New in version 3.5.0.

Parameters
clsclass

the Python user-defined table function handler class.

returnTypepyspark.sql.types.StructType or str

the return type of the user-defined table function. The value can be either a pyspark.sql.types.StructType object or a DDL-formatted struct type string.

useArrowbool or None, optional

whether to use Arrow to optimize the (de)serializations. When it’s set to None, the Spark config “spark.sql.execution.pythonUDTF.arrow.enabled” is used.

Notes

User-defined table functions (UDTFs) are considered non-deterministic by default. Use asDeterministic() to mark a function as deterministic. E.g.:

>>> class PlusOne:
...     def eval(self, a: int):
...         yield a + 1,
>>> plus_one = udtf(PlusOne, returnType="r: int").asDeterministic()

Use “yield” to produce one row for the UDTF result relation as many times as needed. In the context of a lateral join, each such result row will be associated with the most recent input row consumed from the “eval” method.

User-defined table functions are considered opaque to the optimizer by default. As a result, operations like filters from WHERE clauses or limits from LIMIT/OFFSET clauses that appear after the UDTF call will execute on the UDTF’s result relation. By the same token, any relations forwarded as input to UDTFs will plan as full table scans in the absence of any explicit such filtering or other logic explicitly written in a table subquery surrounding the provided input relation.

User-defined table functions do not accept keyword arguments on the calling side.

Examples

Implement the UDTF class and create a UDTF:

>>> class TestUDTF:
...     def eval(self, *args: Any):
...         yield "hello", "world"
...
>>> from pyspark.sql.functions import udtf
>>> test_udtf = udtf(TestUDTF, returnType="c1: string, c2: string")
>>> test_udtf().show()
+-----+-----+
|   c1|   c2|
+-----+-----+
|hello|world|
+-----+-----+

UDTF can also be created using the decorator syntax:

>>> @udtf(returnType="c1: int, c2: int")
... class PlusOne:
...     def eval(self, x: int):
...         yield x, x + 1
...
>>> from pyspark.sql.functions import lit
>>> PlusOne(lit(1)).show()
+---+---+
| c1| c2|
+---+---+
|  1|  2|
+---+---+

Arrow optimization can be explicitly enabled when creating UDTFs:

>>> @udtf(returnType="c1: int, c2: int", useArrow=True)
... class ArrowPlusOne:
...     def eval(self, x: int):
...         yield x, x + 1
...
>>> ArrowPlusOne(lit(1)).show()
+---+---+
| c1| c2|
+---+---+
|  1|  2|
+---+---+