Read modes
Every loader has modes to decide what to do when spark comes across a malformed row/file.
There are 3 read modes:
PERMISSIVE
- Default mode. When a corrupt record is read, it sets all column values to null, and places the malformed row's values in a string col called_corrupt_record
(you can change the name of this column by setting thespark.sql.columnNameOfCorruptRecord
configuration. I don't recommend changing it - to maintain conventions.)DROPMALFORMED
- Removes malformed record. It basically skips over malformed records and just loads the healthy onesFAILFAST
- raises exception, terminates immediately
The default mode is is PERMISSIVE
While using PERMISSIVE
mode, you can override the name of the corrupt record column using the option columnNameOfCorruptRecord
(changing the default column name is not recommended, to maintain common programming standard).
...
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord","corrupt_record")
...
How to set read mode
There are 2 ways to set read mode
- Using the
option()
method to set read mode. Example:
sales_df = (
spark.read
.format("parquet")
.option("mode", "error")
.load("data/sales/2022/01/*.parquet")
)
- Using the
mode()
method to set read mode. Example:
sales_df = (
spark.read
.format("parquet")
.mode("error")
.load("data/sales/2022/01/*.parquet")
)
Write modes
Every writer has modes to decide what to do when data already exists in the destination.
There are 4 write modes:
append
- Create new files without touching existing data (append to existing data)overwrite
- If data files already exist, then remove existing data files and write new oneserror
orerrorIfExists
- If data files already exist, then stop execution and throw exceptionignore
- Write data if and only if target directory is empty. If data files already exist, silently ignore the write operation (it will do nothing)
The default write mode is errorIfExists
How to set write mode
There are 2 ways to set write mode
- Using the
option()
method to set wrote mode. Example:
(
df.write
.format("parquet")
.option("mode", "append")
.option("path", "/data/myparquetfiles/")
.save()
)
- Using the
mode()
method to set write mode. Example:
(
df.write
.format("parquet")
.mode("append")
.option("path", "/data/myparquetfiles/")
.save()
)