Cast string column to Date or DateTime using to_date() or to_timestamp()

By Niraj Zade | 2024-01-28 | Tags: guide


There are 2 time formats that we deal with - Date and DateTime (timestamp).

In pySpark, we use:

  1. to_date() for generating Date
  2. to_timestamp() for generating DateTime(timestamp) upto microsecond precision.

to_date()

Documentation link - pyspark.sql.functions.to_date

Parameters:

  • col or str - column values to convert.
  • format: str (optional parameter) - format string used to convert timestamp values.

Returns: Column. Date value as pyspark.sql.types.DateType type.

BEST PRACTICE The format parameter is optional, but it is best practice to specify it whenever you use to_date() (or any time based casting function). Even if you're using the default format.

Example:

from pyspark.sql import functions as F

# create dataframe for example
df = spark.createDataFrame([("2024-01-28 10:30:00",)], ["datetime_str"])


# cast string col "datetime_str" to datetime col "datetime"
df_casted = df.select(
    F.to_date(col=F.col("datetime_str"), format="yyyy-MM-dd HH:mm:ss")
    .alias("date")
)

# show the dataframe
df_casted.show()

Output:

+----------+
|      date|
+----------+
|2024-01-28|
+----------+

to_timestamp()

Documentation link - pyspark.sql.functions.to_timestamp

Parameters:

  • col: col or str - input column of values to convert.
  • format: str (optional parameter) - format string used to convert timestamp values.

Returns: Column. Timestamp value as pyspark.sql.types.TimestampType type.

BEST PRACTICE The format parameter is optional, but it is best practice to specify it whenever you use to_timestamp() (or any time based casting function). Even if you're using the default format.

Example:

from pyspark.sql import functions as F

# create dataframe for example
df = spark.createDataFrame([("2023-01-28 10:30:00",)], ["datetime_str"])


# cast string col "datetime_str" to datetime col "datetime"
format_string = "yyyy-MM-dd HH:mm:ss"
df_casted = df.select(
    F.to_timestamp(col=F.col("datetime_str"), format=format_string)
    .alias("datetime")
)
# show the dataframe
df_casted.show()

Output:

+-------------------+
|           datetime|
+-------------------+
|2023-01-28 10:30:00|
+-------------------+

All date & time formatters

NOTE:

PySpark It uses Java datetime formatters. Not python date time formatters.

This is a list of all formatters that you can use to compile the format string for your data:

Symbol Meaning Presentation Examples
G era text AD; Anno Domini
y year year 2020; 20
D day-of-year number(3) 189
M/L month-of-year month 7; 07; Jul; July
d day-of-month number(2) 28
Q/q quarter-of-year number/text 3; 03; Q3; 3rd quarter
E day-of-week text Tue; Tuesday
F aligned day of week in month number(1) 3
a am-pm-of-day am-pm PM
h clock-hour-of-am-pm (1-12) number(2) 12
K hour-of-am-pm (0-11) number(2) 0
k clock-hour-of-day (1-24) number(2) 0
H hour-of-day (0-23) number(2) 0
m minute-of-hour number(2) 30
s second-of-minute number(2) 55
S fraction-of-second fraction 978
V time-zone ID zone-id America/Los_Angeles; Z; -08:30
z time-zone name zone-name Pacific Standard Time; PST
O localized zone-offset offset-O GMT+8; GMT+08:00; UTC-08:00;
X zone-offset ‘Z’ for zero offset-X Z; -08; -0830; -08:30; -083015; -08:30:15;
x zone-offset offset-x +0000; -08; -0830; -08:30; -083015; -08:30:15;
Z zone-offset offset-Z +0000; -0800; -08:00;
escape for text delimiter
’‘ single quote literal
[ optional section start
] optional section end

Source - Documentation - Datetime Patterns for Formatting and Parsing

This section took more time to create than expected. The only place where you get a list of these formatters is in a url in a runtime exception's message.

SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.DATETIME_PATTERN_RECOGNITION] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to recognize pattern in the DateTimeFormatter.
1) You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0.
2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html.