python - Create a uniform dataset in Polars with cross joins - Stack Overflow

I am working with Polars and need to ensure that my dataset contains all possible combinations of uniqu

I am working with Polars and need to ensure that my dataset contains all possible combinations of unique values in certain index columns. If a combination is missing in the original data, it should be filled with null.

Currently, I use the following approach with sequential cross joins:

def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
    # Quick exit
    if len(index_cols) == 1:
        return df

    # Get unique values of the first index column
    uniform_df = df.select(index_cols[0]).unique(maintain_order=True)

    # Cross join with other unique index columns
    for i in range(1, len(index_cols)):
        unique_index_values = df.select(index_cols[i]).unique(maintain_order=True)
        uniform_df = uniform_df.join(unique_index_values, how="cross")

    # Left join with the original DataFrame to preserve existing values
    return uniform_df.join(df, on=index_cols, how="left")

Here is an example:

df = pl.from_repr('''
┌─────┬─────┬─────┬───────┐
│ g1  ┆ g2  ┆ g3  ┆ value │
│ --- ┆ --- ┆ --- ┆ ---   │
│ str ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ A   ┆ 1   ┆ 1   ┆ 10    │
│ A   ┆ 1   ┆ 2   ┆ 20    │
│ B   ┆ 2   ┆ 1   ┆ 30    │
│ B   ┆ 2   ┆ 2   ┆ 40    │
└─────┴─────┴─────┴───────┘
''')

uniform_df = ensure_uniform(df, index_cols=["g1", "g2", "g3"])
print(uniform_df)
# ┌─────┬─────┬─────┬───────┐
# │ g1  ┆ g2  ┆ g3  ┆ value │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ str ┆ i64 ┆ i64 ┆ i64   │
# ╞═════╪═════╪═════╪═══════╡
# │ A   ┆ 1   ┆ 1   ┆ 10    │
# │ A   ┆ 1   ┆ 2   ┆ 20    │
# │ A   ┆ 2   ┆ 1   ┆ null  │
# │ A   ┆ 2   ┆ 2   ┆ null  │
# │ B   ┆ 1   ┆ 1   ┆ null  │
# │ B   ┆ 1   ┆ 2   ┆ null  │
# │ B   ┆ 2   ┆ 1   ┆ 30    │
# │ B   ┆ 2   ┆ 2   ┆ 40    │
# └─────┴─────┴─────┴───────┘

Any suggestions for making this more graceful and efficient?

Edit: @Dean MacGregor & @orlp Thank you for your answers. All approaches show comparable performance (+/-10%), with @Dean MacGregor's proposal consistently being slightly faster than the others. After testing on multiple setups, I found that the main bottleneck seems to be the final join process, combining the uniform multiindex with the original dataset, rather than building the Cartesian product beforehand. This suggests that speed and peak memory consumption remain similar regardless of how the Cartesian product is computed, especially as dataset sizes grow.

Both proposals enable lazy operations, which could be useful depending on the use case. Since there is no dedicated method for computing the Cartesian product, all options seem to be valid.

I am working with Polars and need to ensure that my dataset contains all possible combinations of unique values in certain index columns. If a combination is missing in the original data, it should be filled with null.

Currently, I use the following approach with sequential cross joins:

def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
    # Quick exit
    if len(index_cols) == 1:
        return df

    # Get unique values of the first index column
    uniform_df = df.select(index_cols[0]).unique(maintain_order=True)

    # Cross join with other unique index columns
    for i in range(1, len(index_cols)):
        unique_index_values = df.select(index_cols[i]).unique(maintain_order=True)
        uniform_df = uniform_df.join(unique_index_values, how="cross")

    # Left join with the original DataFrame to preserve existing values
    return uniform_df.join(df, on=index_cols, how="left")

Here is an example:

df = pl.from_repr('''
┌─────┬─────┬─────┬───────┐
│ g1  ┆ g2  ┆ g3  ┆ value │
│ --- ┆ --- ┆ --- ┆ ---   │
│ str ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ A   ┆ 1   ┆ 1   ┆ 10    │
│ A   ┆ 1   ┆ 2   ┆ 20    │
│ B   ┆ 2   ┆ 1   ┆ 30    │
│ B   ┆ 2   ┆ 2   ┆ 40    │
└─────┴─────┴─────┴───────┘
''')

uniform_df = ensure_uniform(df, index_cols=["g1", "g2", "g3"])
print(uniform_df)
# ┌─────┬─────┬─────┬───────┐
# │ g1  ┆ g2  ┆ g3  ┆ value │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ str ┆ i64 ┆ i64 ┆ i64   │
# ╞═════╪═════╪═════╪═══════╡
# │ A   ┆ 1   ┆ 1   ┆ 10    │
# │ A   ┆ 1   ┆ 2   ┆ 20    │
# │ A   ┆ 2   ┆ 1   ┆ null  │
# │ A   ┆ 2   ┆ 2   ┆ null  │
# │ B   ┆ 1   ┆ 1   ┆ null  │
# │ B   ┆ 1   ┆ 2   ┆ null  │
# │ B   ┆ 2   ┆ 1   ┆ 30    │
# │ B   ┆ 2   ┆ 2   ┆ 40    │
# └─────┴─────┴─────┴───────┘

Any suggestions for making this more graceful and efficient?

Edit: @Dean MacGregor & @orlp Thank you for your answers. All approaches show comparable performance (+/-10%), with @Dean MacGregor's proposal consistently being slightly faster than the others. After testing on multiple setups, I found that the main bottleneck seems to be the final join process, combining the uniform multiindex with the original dataset, rather than building the Cartesian product beforehand. This suggests that speed and peak memory consumption remain similar regardless of how the Cartesian product is computed, especially as dataset sizes grow.

Both proposals enable lazy operations, which could be useful depending on the use case. Since there is no dedicated method for computing the Cartesian product, all options seem to be valid.

Share Improve this question edited Mar 13 at 16:08 Olibarer asked Mar 12 at 16:46 OlibarerOlibarer 3731 silver badge9 bronze badges 2
  • Just to note that @DeanMacGregor's answer can also run under lazy: ensure_uniform(df.lazy(), ["g1","g2","g3"]).collect() - Only operations that would result in a "dynamic schema" are invalid for LazyFrames. (e.g. pivot) – jqurious Commented Mar 13 at 14:43
  • @jqurious thank you. I will add this to the remark above. – Olibarer Commented Mar 13 at 16:07
Add a comment  | 

2 Answers 2

Reset to default 3

I think your approach is sensible, it can however be done lazily:

from functools import reduce

def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
    if len(index_cols) == 1:
        return df

    lf = df.lazy()
    uniques = [lf.select(col).unique(maintain_order=True) for col in index_cols]
    product = reduce(lambda a, b: a.join(b, how="cross"), uniques)
    out = product.join(lf, on=index_cols, how="left", maintain_order="left")
    return out.collect()

I know this came up before but I can't find the old question. From memory, I think the fastest solution was to implode all the unique columns and then explode them one at a time to create a df where every unique combination exists. Then you join that to your actual df to get the final output.

def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
    # Quick exit
    if len(index_cols) == 1:
        return df

    uniform_df = df.select(pl.col(index_cols).unique().implode())
    for col in index_cols:
        uniform_df=uniform_df.explode(col)
    return uniform_df.join(df, on=index_cols, how="left", maintain_order="left")

ensure_uniform(df, ["g1","g2","g3"])
print(ensure_uniform(df, ["g1","g2","g3"]))
shape: (8, 4)
┌─────┬─────┬─────┬───────┐
│ g1  ┆ g2  ┆ g3  ┆ value │
│ --- ┆ --- ┆ --- ┆ ---   │
│ str ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ A   ┆ 1   ┆ 1   ┆ 10    │
│ A   ┆ 1   ┆ 2   ┆ 20    │
│ A   ┆ 2   ┆ 1   ┆ null  │
│ A   ┆ 2   ┆ 2   ┆ null  │
│ B   ┆ 1   ┆ 1   ┆ null  │
│ B   ┆ 1   ┆ 2   ┆ null  │
│ B   ┆ 2   ┆ 1   ┆ 30    │
│ B   ┆ 2   ┆ 2   ┆ 40    │
└─────┴─────┴─────┴───────┘

I probably wouldn't argue this is more graceful since there's still the necessity of doing repeated assignments for the explodes but I think it is faster. Give it a try with your data.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744739245a4590931.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信