org.apache.spark.sql.ForeachWriter<T>

All Implemented Interfaces:: Serializable

public abstract class ForeachWriter<T> extends Object implements Serializable

The abstract class for writing custom logic to process data generated by a query. This is often used to write the output of a streaming query to arbitrary storage systems. Any implementation of this base class will be used by Spark in the following way.

A single instance of this class is responsible of all the data generated by a single task in a query. In other words, one instance is responsible for processing one partition of the data generated in a distributed manner.
Any implementation of this class must be serializable because each task will get a fresh serialized-deserialized copy of the provided object. Hence, it is strongly recommended that any initialization for writing data (e.g. opening a connection or starting a transaction) is done after the open(...) method has been called, which signifies that the task is ready to generate data.

The lifecycle of the methods are as follows.

   For each partition with `partitionId`:
       For each batch/epoch of streaming data (if its streaming query) with `epochId`:
           Method `open(partitionId, epochId)` is called.
           If `open` returns true:
                For each row in the partition and batch/epoch, method `process(row)` is called.
           Method `close(errorOrNull)` is called with error (if any) seen while processing rows.

Important points to note:

Spark doesn't guarantee same output for (partitionId, epochId), so deduplication cannot be achieved with (partitionId, epochId). e.g. source provides different number of partitions for some reason, Spark optimization changes number of partitions, etc. Refer SPARK-28650 for more details. If you need deduplication on output, try out foreachBatch instead.
The close() method will be called if open() method returns successfully (irrespective of the return value), except if the JVM crashes in the middle.

Scala example:


   datasetOfString.writeStream.foreach(new ForeachWriter[String] {

     def open(partitionId: Long, version: Long): Boolean = {
       // open connection
     }

     def process(record: String) = {
       // write string to connection
     }

     def close(errorOrNull: Throwable): Unit = {
       // close the connection
     }
   })

Java example:


  datasetOfString.writeStream().foreach(new ForeachWriter<String>() {

    @Override
    public boolean open(long partitionId, long version) {
      // open connection
    }

    @Override
    public void process(String value) {
      // write string to connection
    }

    @Override
    public void close(Throwable errorOrNull) {
      // close the connection
    }
  });

Since:

2.0.0

See Also:

Serialized Form

Constructor Summary

Constructors

Constructor

Description

ForeachWriter()
Method Summary

Modifier and Type

Method

Description

abstract void

close(Throwable errorOrNull)

Called when stopping to process one partition of new data in the executor side.

abstract boolean

open(long partitionId, long epochId)

Called when starting to process one partition of new data in the executor.

abstract void

process(T value)

Called to process the data in the executor side.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- ForeachWriter
  
  public ForeachWriter()
Method Details
- close
  
  public abstract void close(Throwable errorOrNull)
  Called when stopping to process one partition of new data in the executor side. This is guaranteed to be called either open returns true or false. However, close won't be called in the following cases:
  
  JVM crashes without throwing a Throwable
  
  open throws a Throwable.
  Parameters:
  
  errorOrNull - the error thrown during processing data or null if there was no error.
- open
  
  public abstract boolean open(long partitionId, long epochId)
  
  Called when starting to process one partition of new data in the executor. See the class docs for more information on how to use the partitionId and epochId.
  
  Parameters:
  
  partitionId - the partition id.
  
  epochId - a unique id for data deduplication.
  
  Returns:
  
  true if the corresponding partition and version id should be processed. false indicates the partition should be skipped.
- process
  
  public abstract void process(T value)
  
  Called to process the data in the executor side. This method will be called only if open returns true.
  
  Parameters:
  
  value - (undocumented)

Class ForeachWriter<T>

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

ForeachWriter

Method Details

close

open

process