pyspark.sql.DataFrame.inputFiles

DataFrame.inputFiles() → List[str][source]

Returns a best-effort snapshot of the files that compose this DataFrame. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.

New in version 3.1.0.

Changed in version 3.4.0: Supports Spark Connect.

Returns
list

List of file paths.

Examples

>>> import tempfile
>>> with tempfile.TemporaryDirectory() as d:
...     # Write a single-row DataFrame into a JSON file
...     spark.createDataFrame(
...         [{"age": 100, "name": "Hyukjin Kwon"}]
...     ).repartition(1).write.json(d, mode="overwrite")
...
...     # Read the JSON file as a DataFrame.
...     df = spark.read.format("json").load(d)
...
...     # Returns the number of input files.
...     len(df.inputFiles())
1