Note on window restrictions for window functions

By Niraj Zade | 2024-04-16   | Tags: theory


Tested on spark 3.5

There are 3 ways through which we control a window:

  • Creating an ordered or un-ordered window by specifying or not specifying the ordering of the window (ORDER BY)
  • Control the nature of window growth using ROWS BETWEEN or RANGE BETWEEN
  • Control the window bounds using UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING, CURRENT ROW

For this article, I'm assuming that you know the keywords in the bullet points above.

We have a lot of ways to specify a window. However, the PySpark SQL engine only accepts specific kinds of windows for specific window functions. We cannot use any specification of window for any function.

Note - When we create an ordered window, the window frame specification will default to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

rank, dense_rank, percent_rank, row_number, ntile

These functions:

  • Only work with ordered windows
  • Only work with ROWS BETWEEN. Won't work with RANGE BETWEEN.
  • Only work if the window bounds are set to UNBOUNDED PRECEDING AND CURRENT ROW
# Only acceptable combination
.rowsBetween(Window.unboundedPreceding, Window.currentRow)

If you use some other combination, spark will throw an exception:

AnalysisException: Window Frame ... must match the required frame specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$()).

So, as an example, this is the only window specification that will work with these functions:

acceptable_window  = (
    Window
    .partitionBy(F.col("col_partition"))
    .orderBy(F.col("col_order"))
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)
)

lag, lead

These are the most restrictive window functions

These functions:

  • Only work with ordered windows

You cannot even specify rowsBetween() or rangeBetween(), even if it matches the window requirement of the lag and lead functions.

AnalysisException: Cannot specify window frame for lag function.

So, as an example, this is the only window specification that will work with these functions:

acceptable_window  = (
    Window
    .partitionBy(F.col("col_partition"))
    .orderBy(F.col("col_order"))
)

avg, sum, min, max

These functions accept any and every kind of window:

  • Accept both ordered or un-ordered
  • Accept both kinds of window expansion - ROWS BETWEEN or RANGE BETWEEN
  • Accept any combination of UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING, CURRENT ROW

I didn't find this information in almost all of the window function related articles I found online. So I hope this article fills up a hole in the internet.

That's it. Enjoy.


© pySparkGuide.com 2024 | Website was autogenerated on 2024-04-24

Brought to you by Niraj Zade - Website, Linkedin

~ whoever owns storage, owns computing ~