python - Create a uniform dataset in Polars with cross joins

I am working with Polars and need to ensure that my dataset contains all possible combinations of unique values in certain index columns. If a combination is missing in the original data, it should be filled with null.

Currently, I use the following approach with sequential cross joins:

def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
    # Quick exit
    if len(index_cols) == 1:
        return df

    # Get unique values of the first index column
    uniform_df = df.select(index_cols[0]).unique(maintain_order=True)

    # Cross join with other unique index columns
    for i in range(1, len(index_cols)):
        unique_index_values = df.select(index_cols[i]).unique(maintain_order=True)
        uniform_df = uniform_df.join(unique_index_values, how="cross")

    # Left join with the original DataFrame to preserve existing values
    return uniform_df.join(df, on=index_cols, how="left")

Here is an example:

df = pl.from_repr('''
┌─────┬─────┬─────┬───────┐
│ g1  ┆ g2  ┆ g3  ┆ value │
│ --- ┆ --- ┆ --- ┆ ---   │
│ str ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ A   ┆ 1   ┆ 1   ┆ 10    │
│ A   ┆ 1   ┆ 2   ┆ 20    │
│ B   ┆ 2   ┆ 1   ┆ 30    │
│ B   ┆ 2   ┆ 2   ┆ 40    │
└─────┴─────┴─────┴───────┘
''')

uniform_df = ensure_uniform(df, index_cols=["g1", "g2", "g3"])
print(uniform_df)
# ┌─────┬─────┬─────┬───────┐
# │ g1  ┆ g2  ┆ g3  ┆ value │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ str ┆ i64 ┆ i64 ┆ i64   │
# ╞═════╪═════╪═════╪═══════╡
# │ A   ┆ 1   ┆ 1   ┆ 10    │
# │ A   ┆ 1   ┆ 2   ┆ 20    │
# │ A   ┆ 2   ┆ 1   ┆ null  │
# │ A   ┆ 2   ┆ 2   ┆ null  │
# │ B   ┆ 1   ┆ 1   ┆ null  │
# │ B   ┆ 1   ┆ 2   ┆ null  │
# │ B   ┆ 2   ┆ 1   ┆ 30    │
# │ B   ┆ 2   ┆ 2   ┆ 40    │
# └─────┴─────┴─────┴───────┘

Any suggestions for making this more graceful and efficient?

Edit: @Dean MacGregor & @orlp Thank you for your answers. All approaches show comparable performance (+/-10%), with @Dean MacGregor's proposal consistently being slightly faster than the others. After testing on multiple setups, I found that the main bottleneck seems to be the final join process, combining the uniform multiindex with the original dataset, rather than building the Cartesian product beforehand. This suggests that speed and peak memory consumption remain similar regardless of how the Cartesian product is computed, especially as dataset sizes grow.

Both proposals enable lazy operations, which could be useful depending on the use case. Since there is no dedicated method for computing the Cartesian product, all options seem to be valid.

Currently, I use the following approach with sequential cross joins:

def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
    # Quick exit
    if len(index_cols) == 1:
        return df

    # Get unique values of the first index column
    uniform_df = df.select(index_cols[0]).unique(maintain_order=True)

    # Cross join with other unique index columns
    for i in range(1, len(index_cols)):
        unique_index_values = df.select(index_cols[i]).unique(maintain_order=True)
        uniform_df = uniform_df.join(unique_index_values, how="cross")

    # Left join with the original DataFrame to preserve existing values
    return uniform_df.join(df, on=index_cols, how="left")

Here is an example:

df = pl.from_repr('''
┌─────┬─────┬─────┬───────┐
│ g1  ┆ g2  ┆ g3  ┆ value │
│ --- ┆ --- ┆ --- ┆ ---   │
│ str ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ A   ┆ 1   ┆ 1   ┆ 10    │
│ A   ┆ 1   ┆ 2   ┆ 20    │
│ B   ┆ 2   ┆ 1   ┆ 30    │
│ B   ┆ 2   ┆ 2   ┆ 40    │
└─────┴─────┴─────┴───────┘
''')

uniform_df = ensure_uniform(df, index_cols=["g1", "g2", "g3"])
print(uniform_df)
# ┌─────┬─────┬─────┬───────┐
# │ g1  ┆ g2  ┆ g3  ┆ value │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ str ┆ i64 ┆ i64 ┆ i64   │
# ╞═════╪═════╪═════╪═══════╡
# │ A   ┆ 1   ┆ 1   ┆ 10    │
# │ A   ┆ 1   ┆ 2   ┆ 20    │
# │ A   ┆ 2   ┆ 1   ┆ null  │
# │ A   ┆ 2   ┆ 2   ┆ null  │
# │ B   ┆ 1   ┆ 1   ┆ null  │
# │ B   ┆ 1   ┆ 2   ┆ null  │
# │ B   ┆ 2   ┆ 1   ┆ 30    │
# │ B   ┆ 2   ┆ 2   ┆ 40    │
# └─────┴─────┴─────┴───────┘

Any suggestions for making this more graceful and efficient?

Both proposals enable lazy operations, which could be useful depending on the use case. Since there is no dedicated method for computing the Cartesian product, all options seem to be valid.

Share Improve this question edited Mar 13 at 16:08 asked Mar 12 at 16:46 Olibarer 3731 silver badge9 bronze badges

Just to note that @DeanMacGregor's answer can also run under lazy: ensure_uniform(df.lazy(), ["g1","g2","g3"]).collect() - Only operations that would result in a "dynamic schema" are invalid for LazyFrames. (e.g. pivot) – jqurious Commented Mar 13 at 14:43
@jqurious thank you. I will add this to the remark above. – Olibarer Commented Mar 13 at 16:07

Add a comment |

2 Answers 2

Sorted by: Reset to default 3

I think your approach is sensible, it can however be done lazily:

from functools import reduce

def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
    if len(index_cols) == 1:
        return df

    lf = df.lazy()
    uniques = [lf.select(col).unique(maintain_order=True) for col in index_cols]
    product = reduce(lambda a, b: a.join(b, how="cross"), uniques)
    out = product.join(lf, on=index_cols, how="left", maintain_order="left")
    return out.collect()

I know this came up before but I can't find the old question. From memory, I think the fastest solution was to implode all the unique columns and then explode them one at a time to create a df where every unique combination exists. Then you join that to your actual df to get the final output.

def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
    # Quick exit
    if len(index_cols) == 1:
        return df

    uniform_df = df.select(pl.col(index_cols).unique().implode())
    for col in index_cols:
        uniform_df=uniform_df.explode(col)
    return uniform_df.join(df, on=index_cols, how="left", maintain_order="left")

ensure_uniform(df, ["g1","g2","g3"])
print(ensure_uniform(df, ["g1","g2","g3"]))
shape: (8, 4)
┌─────┬─────┬─────┬───────┐
│ g1  ┆ g2  ┆ g3  ┆ value │
│ --- ┆ --- ┆ --- ┆ ---   │
│ str ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ A   ┆ 1   ┆ 1   ┆ 10    │
│ A   ┆ 1   ┆ 2   ┆ 20    │
│ A   ┆ 2   ┆ 1   ┆ null  │
│ A   ┆ 2   ┆ 2   ┆ null  │
│ B   ┆ 1   ┆ 1   ┆ null  │
│ B   ┆ 1   ┆ 2   ┆ null  │
│ B   ┆ 2   ┆ 1   ┆ 30    │
│ B   ┆ 2   ┆ 2   ┆ 40    │
└─────┴─────┴─────┴───────┘

I probably wouldn't argue this is more graceful since there's still the necessity of doing repeated assignments for the explodes but I think it is faster. Give it a try with your data.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744739245a4590931.html

python - Create a uniform dataset in Polars with cross joins - Stack Overflow

2 Answers 2

发表回复

评论列表（0条）

联系我们

400-800-8888

python - Create a uniform dataset in Polars with cross joins - Stack Overflow

2 Answers 2

相关推荐