I am trying to run a custom function on a lazy dataframe on a row-by-row basis. Function itself does not matter, so I'm using softmax as a stand-in. All that matters about it is that it is not computable via pl expressions.
I get about this far:
import polars as pl
def softmax(t):
a = np.exp(np.array(t))
return tuple(t/np.sum(t))
ldf = pl.DataFrame({ 'id': [1,2,3], 'a': [0.2,0.1,0.3], 'b': [0.4,0.1,0.3], 'c': [0.4,0.8,0.4]}).lazy()
cols = ['a','b','c']
redict = { f'column_{i}':c for i,c in enumerate(cols) }
ldf.select(cols).map_batches(lambda bdf: bdf.map_rows(softmax).rename(redict)).collect()
However, if I want to get a resulting lazy df that contains columns other than cols (such as id), I get stuck, because
ldf.with_columns(pl.col(cols).map_batches(lambda bdf: bdf.map_rows(softmax).rename(redict))).collect()
no longer works, because pl.col(cols).map_batches is done column-by-column...
This does not seem like it would be an uncommon use case, so I'm wondering if I'm missing something.
I am trying to run a custom function on a lazy dataframe on a row-by-row basis. Function itself does not matter, so I'm using softmax as a stand-in. All that matters about it is that it is not computable via pl expressions.
I get about this far:
import polars as pl
def softmax(t):
a = np.exp(np.array(t))
return tuple(t/np.sum(t))
ldf = pl.DataFrame({ 'id': [1,2,3], 'a': [0.2,0.1,0.3], 'b': [0.4,0.1,0.3], 'c': [0.4,0.8,0.4]}).lazy()
cols = ['a','b','c']
redict = { f'column_{i}':c for i,c in enumerate(cols) }
ldf.select(cols).map_batches(lambda bdf: bdf.map_rows(softmax).rename(redict)).collect()
However, if I want to get a resulting lazy df that contains columns other than cols (such as id), I get stuck, because
ldf.with_columns(pl.col(cols).map_batches(lambda bdf: bdf.map_rows(softmax).rename(redict))).collect()
no longer works, because pl.col(cols).map_batches is done column-by-column...
This does not seem like it would be an uncommon use case, so I'm wondering if I'm missing something.
Share Improve this question edited Mar 3 at 14:53 jqurious 22.3k5 gold badges20 silver badges39 bronze badges asked Mar 3 at 14:52 velochyvelochy 4553 silver badges14 bronze badges 1- FWIW polars is very resistant to row-by-row operations and the apis are in my experience correspondingly limited – 2e0byo Commented Mar 3 at 14:55
2 Answers
Reset to default 3Polars is pretty averse to row by row operations. Generally if you need that, I'd suggest unpivoting (formerly, “melting”) and computing over the id
column.
ldf.unpivot(index="id").with_columns(
pl.col("value").map_batches(softmax).over("id")
).collect()
shape: (9, 3)
┌─────┬──────────┬──────────┐
│ id ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 │
╞═════╪══════════╪══════════╡
│ 1 ┆ a ┆ 0.290461 │
│ 2 ┆ a ┆ 0.249143 │
│ 3 ┆ a ┆ 0.322043 │
│ 1 ┆ b ┆ 0.35477 │
│ 2 ┆ b ┆ 0.249143 │
│ 3 ┆ b ┆ 0.322043 │
│ 1 ┆ c ┆ 0.35477 │
│ 2 ┆ c ┆ 0.501713 │
│ 3 ┆ c ┆ 0.355913 │
└─────┴──────────┴──────────┘
If you need this back in wide format, you can pivot the resulting DataFrame.
ldf.unpivot(index="id").with_columns(
pl.col("value").map_batches(softmax).over("id")
).collect().pivot("variable", index="id")
shape: (3, 4)
┌─────┬──────────┬──────────┬──────────┐
│ id ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╪══════════╡
│ 1 ┆ 0.290461 ┆ 0.35477 ┆ 0.35477 │
│ 2 ┆ 0.249143 ┆ 0.249143 ┆ 0.501713 │
│ 3 ┆ 0.322043 ┆ 0.322043 ┆ 0.355913 │
└─────┴──────────┴──────────┴──────────┘
I actually found a relatively nice solution that just takes advantage of batches being materialized in memory.
import polars as pl
def softmax(ar):
a = np.exp(ar)
return a/np.sum(a,axis=-1)
def apply_npf_on_pl_df(df,cols,npf):
df[cols] = npf(df[cols].to_numpy())
return df
ldf = pl.DataFrame({ 'id': [1,2,3], 'a': [0.2,0.1,0.3], 'b': [0.4,0.1,0.3], 'c': [0.4,0.8,0.4]}).lazy()
cols = ['a','b','c']
redict = { f'column_{i}':c for i,c in enumerate(cols) }
ldf.map_batches(lambda bdf: apply_npf_on_pl_df(bdf,cols,softmax)).collect()
This is likely not ideal if there are a lot of other rows, but for my use case (with very few additional columns) this looks pretty efficient.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745088149a4610528.html
评论列表(0条)