pyspark.SparkContext.addArchive#

SparkContext.addArchive(path)[source]#

Add an archive to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

To access the file in Spark jobs, use SparkFiles.get() with the filename to find its download/unpacked location. The given path should be one of .zip, .tar, .tar.gz, .tgz and .jar.

New in version 3.3.0.

Parameters

pathstr: can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get() to find its download location.

See also

SparkContext.listArchives()
SparkFiles.get()

Notes

A path can be added only once. Subsequent additions of the same path are ignored. This API is experimental.

Examples

Creates a zipped file that contains a text file written ‘100’.

>>> import os
>>> import tempfile
>>> import zipfile
>>> from pyspark import SparkFiles

>>> with tempfile.TemporaryDirectory(prefix="addArchive") as d:
...     path = os.path.join(d, "test.txt")
...     with open(path, "w") as f:
...         _ = f.write("100")
...
...     zip_path1 = os.path.join(d, "test1.zip")
...     with zipfile.ZipFile(zip_path1, "w", zipfile.ZIP_DEFLATED) as z:
...         z.write(path, os.path.basename(path))
...
...     zip_path2 = os.path.join(d, "test2.zip")
...     with zipfile.ZipFile(zip_path2, "w", zipfile.ZIP_DEFLATED) as z:
...         z.write(path, os.path.basename(path))
...
...     sc.addArchive(zip_path1)
...     arch_list1 = sorted(sc.listArchives)
...
...     sc.addArchive(zip_path2)
...     arch_list2 = sorted(sc.listArchives)
...
...     # add zip_path2 twice, this addition will be ignored
...     sc.addArchive(zip_path2)
...     arch_list3 = sorted(sc.listArchives)
...
...     def func(iterator):
...         with open("%s/test.txt" % SparkFiles.get("test1.zip")) as f:
...             mul = int(f.readline())
...             return [x * mul for x in iterator]
...
...     collected = sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect()

>>> arch_list1
['file:/.../test1.zip']
>>> arch_list2
['file:/.../test1.zip', 'file:/.../test2.zip']
>>> arch_list3
['file:/.../test1.zip', 'file:/.../test2.zip']
>>> collected
[100, 200, 300, 400]