I am working with Polars and need to ensure that my dataset contains all possible combinations of unique values in certain index columns. If a combination is missing in the original data, it should be filled with null.
Currently, I use the following approach with sequential cross joins:
def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
# Quick exit
if len(index_cols) == 1:
return df
# Get unique values of the first index column
uniform_df = df.select(index_cols[0]).unique(maintain_order=True)
# Cross join with other unique index columns
for i in range(1, len(index_cols)):
unique_index_values = df.select(index_cols[i]).unique(maintain_order=True)
uniform_df = uniform_df.join(unique_index_values, how="cross")
# Left join with the original DataFrame to preserve existing values
return uniform_df.join(df, on=index_cols, how="left")
Here is an example:
df = pl.from_repr('''
┌─────┬─────┬─────┬───────┐
│ g1 ┆ g2 ┆ g3 ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═══════╡
│ A ┆ 1 ┆ 1 ┆ 10 │
│ A ┆ 1 ┆ 2 ┆ 20 │
│ B ┆ 2 ┆ 1 ┆ 30 │
│ B ┆ 2 ┆ 2 ┆ 40 │
└─────┴─────┴─────┴───────┘
''')
uniform_df = ensure_uniform(df, index_cols=["g1", "g2", "g3"])
print(uniform_df)
# ┌─────┬─────┬─────┬───────┐
# │ g1 ┆ g2 ┆ g3 ┆ value │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╪═══════╡
# │ A ┆ 1 ┆ 1 ┆ 10 │
# │ A ┆ 1 ┆ 2 ┆ 20 │
# │ A ┆ 2 ┆ 1 ┆ null │
# │ A ┆ 2 ┆ 2 ┆ null │
# │ B ┆ 1 ┆ 1 ┆ null │
# │ B ┆ 1 ┆ 2 ┆ null │
# │ B ┆ 2 ┆ 1 ┆ 30 │
# │ B ┆ 2 ┆ 2 ┆ 40 │
# └─────┴─────┴─────┴───────┘
Any suggestions for making this more graceful and efficient?
Edit: @Dean MacGregor & @orlp Thank you for your answers. All approaches show comparable performance (+/-10%), with @Dean MacGregor's proposal consistently being slightly faster than the others. After testing on multiple setups, I found that the main bottleneck seems to be the final join process, combining the uniform multiindex with the original dataset, rather than building the Cartesian product beforehand. This suggests that speed and peak memory consumption remain similar regardless of how the Cartesian product is computed, especially as dataset sizes grow.
Both proposals enable lazy operations, which could be useful depending on the use case. Since there is no dedicated method for computing the Cartesian product, all options seem to be valid.
I am working with Polars and need to ensure that my dataset contains all possible combinations of unique values in certain index columns. If a combination is missing in the original data, it should be filled with null.
Currently, I use the following approach with sequential cross joins:
def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
# Quick exit
if len(index_cols) == 1:
return df
# Get unique values of the first index column
uniform_df = df.select(index_cols[0]).unique(maintain_order=True)
# Cross join with other unique index columns
for i in range(1, len(index_cols)):
unique_index_values = df.select(index_cols[i]).unique(maintain_order=True)
uniform_df = uniform_df.join(unique_index_values, how="cross")
# Left join with the original DataFrame to preserve existing values
return uniform_df.join(df, on=index_cols, how="left")
Here is an example:
df = pl.from_repr('''
┌─────┬─────┬─────┬───────┐
│ g1 ┆ g2 ┆ g3 ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═══════╡
│ A ┆ 1 ┆ 1 ┆ 10 │
│ A ┆ 1 ┆ 2 ┆ 20 │
│ B ┆ 2 ┆ 1 ┆ 30 │
│ B ┆ 2 ┆ 2 ┆ 40 │
└─────┴─────┴─────┴───────┘
''')
uniform_df = ensure_uniform(df, index_cols=["g1", "g2", "g3"])
print(uniform_df)
# ┌─────┬─────┬─────┬───────┐
# │ g1 ┆ g2 ┆ g3 ┆ value │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╪═══════╡
# │ A ┆ 1 ┆ 1 ┆ 10 │
# │ A ┆ 1 ┆ 2 ┆ 20 │
# │ A ┆ 2 ┆ 1 ┆ null │
# │ A ┆ 2 ┆ 2 ┆ null │
# │ B ┆ 1 ┆ 1 ┆ null │
# │ B ┆ 1 ┆ 2 ┆ null │
# │ B ┆ 2 ┆ 1 ┆ 30 │
# │ B ┆ 2 ┆ 2 ┆ 40 │
# └─────┴─────┴─────┴───────┘
Any suggestions for making this more graceful and efficient?
Edit: @Dean MacGregor & @orlp Thank you for your answers. All approaches show comparable performance (+/-10%), with @Dean MacGregor's proposal consistently being slightly faster than the others. After testing on multiple setups, I found that the main bottleneck seems to be the final join process, combining the uniform multiindex with the original dataset, rather than building the Cartesian product beforehand. This suggests that speed and peak memory consumption remain similar regardless of how the Cartesian product is computed, especially as dataset sizes grow.
Both proposals enable lazy operations, which could be useful depending on the use case. Since there is no dedicated method for computing the Cartesian product, all options seem to be valid.
Share Improve this question edited Mar 13 at 16:08 Olibarer asked Mar 12 at 16:46 OlibarerOlibarer 3731 silver badge9 bronze badges 2 |2 Answers
Reset to default 3I think your approach is sensible, it can however be done lazily:
from functools import reduce
def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
if len(index_cols) == 1:
return df
lf = df.lazy()
uniques = [lf.select(col).unique(maintain_order=True) for col in index_cols]
product = reduce(lambda a, b: a.join(b, how="cross"), uniques)
out = product.join(lf, on=index_cols, how="left", maintain_order="left")
return out.collect()
I know this came up before but I can't find the old question. From memory, I think the fastest solution was to implode all the unique columns and then explode them one at a time to create a df where every unique combination exists. Then you join that to your actual df to get the final output.
def ensure_uniform(df: pl.DataFrame, index_cols: Sequence[str]) -> pl.DataFrame:
# Quick exit
if len(index_cols) == 1:
return df
uniform_df = df.select(pl.col(index_cols).unique().implode())
for col in index_cols:
uniform_df=uniform_df.explode(col)
return uniform_df.join(df, on=index_cols, how="left", maintain_order="left")
ensure_uniform(df, ["g1","g2","g3"])
print(ensure_uniform(df, ["g1","g2","g3"]))
shape: (8, 4)
┌─────┬─────┬─────┬───────┐
│ g1 ┆ g2 ┆ g3 ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═══════╡
│ A ┆ 1 ┆ 1 ┆ 10 │
│ A ┆ 1 ┆ 2 ┆ 20 │
│ A ┆ 2 ┆ 1 ┆ null │
│ A ┆ 2 ┆ 2 ┆ null │
│ B ┆ 1 ┆ 1 ┆ null │
│ B ┆ 1 ┆ 2 ┆ null │
│ B ┆ 2 ┆ 1 ┆ 30 │
│ B ┆ 2 ┆ 2 ┆ 40 │
└─────┴─────┴─────┴───────┘
I probably wouldn't argue this is more graceful since there's still the necessity of doing repeated assignments for the explodes but I think it is faster. Give it a try with your data.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744739245a4590931.html
ensure_uniform(df.lazy(), ["g1","g2","g3"]).collect()
- Only operations that would result in a "dynamic schema" are invalid for LazyFrames. (e.g. pivot) – jqurious Commented Mar 13 at 14:43