- ORC Implementation
- Vectorized Reader
- Schema Merging
- Bloom Filters
- Columnar Encryption
- Hive metastore ORC table conversion
- Data Source Option
Apache ORC is a columnar format which has more advanced features like native zstd compression, bloom filter and columnar encryption.
Spark supports two ORC implementations (
hive) which is controlled by
Two implementations share most functionalities with different design goals.
nativeimplementation is designed to follow Spark’s data source behavior like
hiveimplementation is designed to follow Hive’s behavior and uses Hive SerDe.
For example, historically,
native implementation handles
CHAR/VARCHAR with Spark’s native
hive implementation handles it via Hive
CHAR/VARCHAR. The query results are different. Since Spark 3.1.0, SPARK-33480 removes this difference by supporting
CHAR/VARCHAR from Spark-side.
native implementation supports a vectorized ORC reader and has been the default ORC implementaion since Spark 2.3.
The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause
USING ORC) when
spark.sql.orc.impl is set to
spark.sql.orc.enableVectorizedReader is set to
For nested data types (array, map and struct), vectorized reader is disabled by default. Set
true to enable vectorized reader for these types.
For the Hive ORC serde tables (e.g., the ones created using the clause
USING HIVE OPTIONS (fileFormat 'ORC')),
the vectorized reader is used when
spark.sql.hive.convertMetastoreOrc is also set to
true, and is turned on by default.
Like Protocol Buffer, Avro, and Thrift, ORC also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple ORC files with different but mutually compatible schemas. The ORC data source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default . You may enable it by
- setting data source option
truewhen reading ORC files, or
- setting the global SQL option
Spark supports both Hadoop 2 and 3. Since Spark 3.2, you can take advantage of Zstandard compression in ORC files on both Hadoop versions. Please see Zstandard for the benefits.
You can control bloom filters and dictionary encodings for ORC data sources. The following ORC example will create bloom filter and use dictionary encoding only for
favorite_color. To find more detailed information about the extra ORC options, visit the official Apache ORC websites.
Since Spark 3.2, columnar encryption is supported for ORC tables with Apache ORC 1.6. The following example is using Hadoop KMS as a key provider with the given location. Please visit Apache Hadoop KMS for the detail.
Hive metastore ORC table conversion
When reading from Hive metastore ORC tables and inserting to Hive metastore ORC tables, Spark SQL will try to use its own ORC support instead of Hive SerDe for better performance. For CTAS statement, only non-partitioned Hive metastore ORC tables are converted. This behavior is controlled by the
spark.sql.hive.convertMetastoreOrc configuration, and is turned on by default.
|Property Name||Default||Meaning||Since Version|
The name of ORC implementation. It can be one of
Enables vectorized orc decoding in
Enables vectorized orc decoding in
When true, the ORC data source merges schemas collected from all data files, otherwise the schema is picked from a random data file.
||true||When set to false, Spark SQL will use the Hive SerDe for ORC tables instead of the built in support.||2.0.0|
Data Source Option
Data source options of ORC can be set via:
OPTIONSclause at CREATE TABLE USING DATA_SOURCE
||sets whether we should merge schemas collected from all ORC part-files. This will override
||compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, lzo, zstd and lz4). This will override
Other generic options can be found in Generic File Source Options.