pyspark.SparkContext.binaryRecords¶

SparkContext.binaryRecords(path: str, recordLength: int) → pyspark.rdd.RDD[bytes][source]¶

Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.

New in version 1.3.0.

Parameters

pathstr: Directory to the input data files
recordLengthint: The length at which to split the records

Returns

RDD: RDD of data with values, represented as byte arrays

See also

SparkContext.binaryFiles()

Examples

>>> import os
>>> import tempfile
>>> with tempfile.TemporaryDirectory() as d:
...     # Write a temporary file
...     with open(os.path.join(d, "1.bin"), "w") as f:
...         for i in range(3):
...             _ = f.write("%04d" % i)
...
...     # Write another file
...     with open(os.path.join(d, "2.bin"), "w") as f:
...         for i in [-1, -2, -10]:
...             _ = f.write("%04d" % i)
...
...     collected = sorted(sc.binaryRecords(d, 4).collect())

>>> collected
[b'-001', b'-002', b'-010', b'0000', b'0001', b'0002']

pyspark.SparkContext.binaryFiles pyspark.SparkContext.broadcast