Cumulative Elementwise Sum by Python Polars - Stack Overflow

I have a weight vector:weight_vec = pl.Series("weights", [0.125, 0.0625, 0.03125])And also a

I have a weight vector:

weight_vec = pl.Series("weights", [0.125, 0.0625, 0.03125])

And also a DataFrame containing up to m variables. For simplicity, we will only have two varaibles:

df = pl.DataFrame(
    {
        "row_index": [0, 1, 2, 3, 4],
        "var1": [1, 2, 3, 4, 5],
        "var2": [6, 7, 8, 9, 10],
    }
)

The size (number of observations) for these variables can be very large (tens of millions of rows).

I would like to:

  • For each variable, and each observation x_i, where i is the row index [0,...,4], I want to transform the value of x_i to the sumproduct of all past n's x_i value (including the current value [x_i,...x_i+n-1]), and the weight vector. n is the length of the given weight vector and n varies for different weight vector definition.

    Numerically, the value of var1 at observation index 0 is the sumproduct of the values of all [x_0, x_1, x_2] and all the values of the weight vector. When the row index appraoches to and end (e.g., max index - row index + 1 < n) => all the values will be assigned None.

  • We can assume that the height of the DataFrame is always larger or equal to the length of the weight vector to result in at least one valid result.

The resulting DataFrame should look like this:

shape: (5, 3)
┌───────────┬─────────┬─────────┐
│ row_index ┆ var1    ┆ var2    │
│ ---       ┆ ---     ┆ ---     │
│ i64       ┆ f64     ┆ f64     │
╞═══════════╪═════════╪═════════╡
│ 0         ┆ 0.34375 ┆ 1.4375  │
│ 1         ┆ 0.5625  ┆ 1.65625 │
│ 2         ┆ 0.78125 ┆ 1.875   │
│ 3         ┆ null    ┆ null    │
│ 4         ┆ null    ┆ null    │
└───────────┴─────────┴─────────┘

Numeric Caldulations:

  • x_0_var1: (0.125 * 1 + 0.0625 * 2 + 0.03125 * 3 = 0.34375)
  • x_2_var2: (0.125 * 8 + 0.0625 * 9 + 0.03125 * 10 = 1.875)

I am looking for a memory efficient, vectorized Polars operation to achieve such results.

I have a weight vector:

weight_vec = pl.Series("weights", [0.125, 0.0625, 0.03125])

And also a DataFrame containing up to m variables. For simplicity, we will only have two varaibles:

df = pl.DataFrame(
    {
        "row_index": [0, 1, 2, 3, 4],
        "var1": [1, 2, 3, 4, 5],
        "var2": [6, 7, 8, 9, 10],
    }
)

The size (number of observations) for these variables can be very large (tens of millions of rows).

I would like to:

  • For each variable, and each observation x_i, where i is the row index [0,...,4], I want to transform the value of x_i to the sumproduct of all past n's x_i value (including the current value [x_i,...x_i+n-1]), and the weight vector. n is the length of the given weight vector and n varies for different weight vector definition.

    Numerically, the value of var1 at observation index 0 is the sumproduct of the values of all [x_0, x_1, x_2] and all the values of the weight vector. When the row index appraoches to and end (e.g., max index - row index + 1 < n) => all the values will be assigned None.

  • We can assume that the height of the DataFrame is always larger or equal to the length of the weight vector to result in at least one valid result.

The resulting DataFrame should look like this:

shape: (5, 3)
┌───────────┬─────────┬─────────┐
│ row_index ┆ var1    ┆ var2    │
│ ---       ┆ ---     ┆ ---     │
│ i64       ┆ f64     ┆ f64     │
╞═══════════╪═════════╪═════════╡
│ 0         ┆ 0.34375 ┆ 1.4375  │
│ 1         ┆ 0.5625  ┆ 1.65625 │
│ 2         ┆ 0.78125 ┆ 1.875   │
│ 3         ┆ null    ┆ null    │
│ 4         ┆ null    ┆ null    │
└───────────┴─────────┴─────────┘

Numeric Caldulations:

  • x_0_var1: (0.125 * 1 + 0.0625 * 2 + 0.03125 * 3 = 0.34375)
  • x_2_var2: (0.125 * 8 + 0.0625 * 9 + 0.03125 * 10 = 1.875)

I am looking for a memory efficient, vectorized Polars operation to achieve such results.

Share edited Mar 9 at 8:59 jqurious 22.2k5 gold badges20 silver badges39 bronze badges asked Mar 9 at 0:33 Kevin LiKevin Li 6604 silver badges13 bronze badges 3
  • 5 Are you able to update the question with some example working for row index 1 and 2 in the "var1" column? E.g., for row index 0, something like (1 * 0.125) + (2 * 0.0625) + (3 * 0.03125) = 0.34375. I've tried various ways (product then sum, sum then product), but am not able to reach your expected output. – Henry Harbeck Commented Mar 9 at 4:37
  • @HenryHarbeck I think every result uses row index to get values from [1, 2, 3, 4, 5] - first uses [1,2,3] (var1[0:0+3]), second uses [2,3,4] (var1[1:1+3]), third uses [3,4,5] (var1[2:2+3]). But you are right OP should better describe it - and OP show more calcuations. – furas Commented Mar 9 at 12:23
  • I don't know Polars but maybe it can use numpy - like this (np.array([0.125, 0.0625, 0.03125]) * [1, 2, 3]).sum() or more like (np.array(weight_vec) * row[row_index:row_index+3]).sum() – furas Commented Mar 9 at 12:30
Add a comment  | 

2 Answers 2

Reset to default 2

Here is a solution that uses rolling.

import numpy as np

weight_vec_len: int = weight_vec.len()
period = f"{weight_vec_len}i"

df.rolling("row_index", period=period, offset=f"-1i").agg(
    pl.col(r"^var\d$")
    .extend_constant(np.nan, weight_vec_len - pl.len())
    .dot(weight_vec)
    .fill_nan(None)
    .name.keep()
)
shape: (5, 3)
┌───────────┬─────────┬─────────┐
│ row_index ┆ var1    ┆ var2    │
│ ---       ┆ ---     ┆ ---     │
│ i64       ┆ f64     ┆ f64     │
╞═══════════╪═════════╪═════════╡
│ 0         ┆ 0.34375 ┆ 1.4375  │
│ 1         ┆ 0.5625  ┆ 1.65625 │
│ 2         ┆ 0.78125 ┆ 1.875   │
│ 3         ┆ null    ┆ null    │
│ 4         ┆ null    ┆ null    │
└───────────┴─────────┴─────────┘

This has to use a when/then to create nulls when there isn't enough forward data but otherwise it should be pretty good.

(
    df.with_columns(
        pl.when((pl.col("row_index").max()-pl.col("row_index")+1)>=weight_vec.shape[0])
        .then((
            pl.concat_arr(
                pl.col(col).shift(-x) for x in range(weight_vec.shape[0])
            ) * weight_vec.reshape((1,-1))
            ).arr.sum()
        )
        for col in ["var1", "var2"]
    )
)
shape: (5, 3)
┌───────────┬─────────┬─────────┐
│ row_index ┆ var1    ┆ var2    │
│ ---       ┆ ---     ┆ ---     │
│ i64       ┆ f64     ┆ f64     │
╞═══════════╪═════════╪═════════╡
│ 0         ┆ 0.34375 ┆ 1.4375  │
│ 1         ┆ 0.5625  ┆ 1.65625 │
│ 2         ┆ 0.78125 ┆ 1.875   │
│ 3         ┆ null    ┆ null    │
│ 4         ┆ null    ┆ null    │
└───────────┴─────────┴─────────┘

The way this works is that for each row, it creates an Array of the future rows' values up to the size of the weights. It also reshapes the weights Series into an array and multiplies them together to create an array of products. Lastly, it takes the sum of that Array. All of that is wrapped in a then for only the rows where there are at least as many future values as there are weights.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744881721a4598875.html

相关推荐

  • Cumulative Elementwise Sum by Python Polars - Stack Overflow

    I have a weight vector:weight_vec = pl.Series("weights", [0.125, 0.0625, 0.03125])And also a

    1天前
    60

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信