python - Optimisation of window aggregations: Pushing per-element expressions out of the window aggregation - Stack Overflow

I want to understand the performance implications of elementwise transformations on rolling window aggr

I want to understand the performance implications of elementwise transformations on rolling window aggregation. Consider the following two versions of a rolling aggregation (of floating values):

I)

X = frame.rolling(index_column="date", group_by="group", period="360d").agg(
    pl.col("value").sin().sum().alias("sin(value)"),
    pl.col("value").cos().sum().alias("cos(value)"),
    pl.col("value").sum()
)

II)

Y = frame.with_columns(
    pl.col("value").sin().alias("sin(value)"),
    pl.col("value").cos().alias("cos(value)")
).rolling(index_column="date", group_by="group", period="360d").agg(
    pl.col("sin(value)").sum(),
    pl.col("cos(value)").sum(),
    pl.col("value").sum())

Naively I'd expect the second version to be universally faster than the first version, since by design it avoids redundant re-computation of sin(value) and cos(value) per each window (and group).

I was however surprised to find that both versions are almost identical in runtime for different size of the group and the time dimension. How is that possible? Is polars automagically pushing the elementwise transformations (sin and cos) out of the rolling window aggregation?

In addition for a large number of dates the second version can be slower than the first version, cf. image below.

Can anyone help me understand what is going on here?

Full code for the experiment below

import datetime
import itertools
import time

import numpy as np
import polars as pl
import polars.testing

def run_experiment():
    start = datetime.date.fromisoformat("1991-01-01")
    result = {"num_dates": [], "num_groups": [], "version1": [], "version2": [], }
    for n_dates in [1000, 2000, 5000, 10000]:
        end = start + datetime.timedelta(days=(n_dates - 1))
        dates = pl.date_range(start, end, eager=True)
        for m_groups in [10, 20, 50, 100, 200, 500, 1000]:
            groups = [f"g_{i + 1}" for i in range(m_groups)]
            groups_, dates_ = list(zip(*itertools.product(groups, dates)))

            frame = pl.from_dict({"group": groups_, "date": dates_, "value": np.random.rand(n_dates * m_groups)})

            t0 = time.time()
            X = frame.rolling(index_column="date", group_by="group", period="360d").agg(
                pl.col("value").sin().sum().alias("sin(value)"),
                pl.col("value").cos().sum().alias("cos(value)"),
                pl.col("value").sum()
            )
            t1 = time.time() - t0

            t0 = time.time()
            Y = frame.with_columns(
                pl.col("value").sin().alias("sin(value)"),
                pl.col("value").cos().alias("cos(value)")
            ).rolling(index_column="date", group_by="group", period="360d").agg(
                pl.col("sin(value)").sum(),
                pl.col("cos(value)").sum(),
                pl.col("value").sum()
            )
            t2 = time.time() - t0
            polars.testing.assert_frame_equal(X, Y)

            result["num_dates"].append(n_dates)
            result["num_groups"].append(m_groups)
            result["version1"].append(t1)
            result["version2"].append(t2)

   return pl.from_dict(result)

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745236459a4617921.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信