@InterfaceStability.Evolving
public interface DataSourceReader
ReadSupport.createReader(DataSourceOptions)
or
ReadSupportWithSchema.createReader(StructType, DataSourceOptions)
.
It can mix in various query optimization interfaces to speed up the data scan. The actual scan
logic is delegated to DataReaderFactory
s that are returned by
createDataReaderFactories()
.
There are mainly 3 kinds of query optimizations:
1. Operators push-down. E.g., filter push-down, required columns push-down(aka column
pruning), etc. Names of these interfaces start with `SupportsPushDown`.
2. Information Reporting. E.g., statistics reporting, ordering reporting, etc.
Names of these interfaces start with `SupportsReporting`.
3. Special scans. E.g, columnar scan, unsafe row scan, etc.
Names of these interfaces start with `SupportsScan`. Note that a reader should only
implement at most one of the special scans, if more than one special scans are implemented,
only one of them would be respected, according to the priority list from high to low:
SupportsScanColumnarBatch
, SupportsScanUnsafeRow
.
If an exception was throw when applying any of these query optimizations, the action would fail
and no Spark job was submitted.
Spark first applies all operator push-down optimizations that this data source supports. Then
Spark collects information this data source reported for further optimizations. Finally Spark
issues the scan request and does the actual data reading.Modifier and Type | Method and Description |
---|---|
java.util.List<DataReaderFactory<Row>> |
createDataReaderFactories()
Returns a list of reader factories.
|
StructType |
readSchema()
Returns the actual schema of this data source reader, which may be different from the physical
schema of the underlying storage, as column pruning or other optimizations may happen.
|
StructType readSchema()
java.util.List<DataReaderFactory<Row>> createDataReaderFactories()