Tested on spark 3.5
There are 3 ways through which we control a window:
- Creating an ordered or un-ordered window by specifying or not specifying the ordering of the window (
ORDER BY
) - Control the nature of window growth using
ROWS BETWEEN
orRANGE BETWEEN
- Control the window bounds using
UNBOUNDED PRECEDING
,UNBOUNDED FOLLOWING
,CURRENT ROW
For this article, I'm assuming that you know the keywords in the bullet points above.
We have a lot of ways to specify a window. However, the PySpark SQL engine only accepts specific kinds of windows for specific window functions. We cannot use any specification of window for any function.
Note - When we create an ordered window, the window frame specification will default to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
rank, dense_rank, percent_rank, row_number, ntile
These functions:
- Only work with ordered windows
- Only work with
ROWS BETWEEN
. Won't work withRANGE BETWEEN
. - Only work if the window bounds are set to
UNBOUNDED PRECEDING AND CURRENT ROW
# Only acceptable combination
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
If you use some other combination, spark will throw an exception:
AnalysisException: Window Frame ... must match the required frame specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$()).
So, as an example, this is the only window specification that will work with these functions:
acceptable_window = (
Window
.partitionBy(F.col("col_partition"))
.orderBy(F.col("col_order"))
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
lag, lead
These are the most restrictive window functions
These functions:
- Only work with ordered windows
You cannot even specify rowsBetween() or rangeBetween(), even if it matches the window requirement of the lag and lead functions.
AnalysisException: Cannot specify window frame for lag function.
So, as an example, this is the only window specification that will work with these functions:
acceptable_window = (
Window
.partitionBy(F.col("col_partition"))
.orderBy(F.col("col_order"))
)
avg, sum, min, max
These functions accept any and every kind of window:
- Accept both ordered or un-ordered
- Accept both kinds of window expansion -
ROWS BETWEEN
orRANGE BETWEEN
- Accept any combination of
UNBOUNDED PRECEDING
,UNBOUNDED FOLLOWING
,CURRENT ROW
I didn't find this information in almost all of the window function related articles I found online. So I hope this article fills up a hole in the internet.
That's it. Enjoy.