Cast string to Date or DateTime using to_date() and to_timestamp() - work in progress

By Niraj Zade | 2024-01-28   | Tags: guide


There are 2 time formats that we deal with - Date and DateTime (timestamp).

In pySpark, we use:

  1. to_date() for generating Date
  2. to_timestamp() for generating DateTime(timestamp) upto microsecond precision.

to_date()

Documentation link - pyspark.sql.functions.to_date

Parameters:

  • col or str - column values to convert.
  • format: str (optional parameter) - format string used to convert timestamp values.

Returns: Column. Date value as pyspark.sql.types.DateType type.

The format parameter is optional, but it is best practice to specify it whenever you use to_date() (or any time based casting function)

Example:

from pyspark.sql import functions as F

# create dataframe
df = spark.createDataFrame([('2024-01-28 10:30:00',)], ['datetime_str'])

# cast string col "datetime_str" to datetime col "datetime"
df_casted = df.select(F.to_date(F.col("datetime_str"), 'yyyy-MM-dd HH:mm:ss').alias('date'))

# show the dataframe
df_casted.show()

Output:

+----------+
|      date|
+----------+
|2024-01-28|
+----------+

to_timestamp()

Documentation link - pyspark.sql.functions.to_timestamp

Parameters:

  • col or str - input column of values to convert.
  • format: str (optional parameter) - format string used to convert timestamp values.

Returns: Column. Timestamp value as pyspark.sql.types.TimestampType type.

The format parameter is optional, but it is best practice to specify it whenever you use to_timestamp() (or any time based casting function)

Example:

from pyspark.sql import functions as F

# create dataframe
df = spark.createDataFrame([('2023-01-28 10:30:00',)], ['datetime_str'])

# cast string col "datetime_str" to datetime col "datetime"
df_casted = df.select(F.to_timestamp(F.col("datetime_str"), 'yyyy-MM-dd HH:mm:ss').alias('datetime'))

# show the dataframe
df_casted.show()

Output:

+-------------------+
|           datetime|
+-------------------+
|2023-01-28 10:30:00|
+-------------------+

All date & time formatters

WARNING:

pySpark doesn't use python date time formatters.>

This is a list of all formatters that you can use to compile the format string for your data:

Symbol Meaning Presentation Examples
G era text AD; Anno Domini
y year year 2020; 20
D day-of-year number(3) 189
M/L month-of-year month 7; 07; Jul; July
d day-of-month number(2) 28
Q/q quarter-of-year number/text 3; 03; Q3; 3rd quarter
E day-of-week text Tue; Tuesday
F aligned day of week in month number(1) 3
a am-pm-of-day am-pm PM
h clock-hour-of-am-pm (1-12) number(2) 12
K hour-of-am-pm (0-11) number(2) 0
k clock-hour-of-day (1-24) number(2) 0
H hour-of-day (0-23) number(2) 0
m minute-of-hour number(2) 30
s second-of-minute number(2) 55
S fraction-of-second fraction 978
V time-zone ID zone-id America/Los_Angeles; Z; -08:30
z time-zone name zone-name Pacific Standard Time; PST
O localized zone-offset offset-O GMT+8; GMT+08:00; UTC-08:00;
X zone-offset ‘Z’ for zero offset-X Z; -08; -0830; -08:30; -083015; -08:30:15;
x zone-offset offset-x +0000; -08; -0830; -08:30; -083015; -08:30:15;
Z zone-offset offset-Z +0000; -0800; -08:00;
escape for text delimiter
’‘ single quote literal
[ optional section start
] optional section end

Source - Documentation - Datetime Patterns for Formatting and Parsing

This section took more time to create than expected. The only place where you get a list of these formatters is in a url in a runtime exception's message.

SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.DATETIME_PATTERN_RECOGNITION] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to recognize pattern in the DateTimeFormatter.
1) You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0.
2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html.

Year

Month

Day

Hour

Minute

Second

Fraction of second (upto microsecond)


© pySparkGuide.com 2024 | Website was autogenerated on 2024-04-24

Brought to you by Niraj Zade - Website, Linkedin

~ whoever owns storage, owns computing ~