python - Built-in Pandas function for cell-level apply with indexcolumn awareness?

I am cleaning historical data for forecasting for exponential smoothing. I have data at the US county level (i.e., second-level administrative division), but there are a lot of zero values (due to low volume) that lead to issues with the forecasting model.

Since the data is highly seasonal, what I am doing is checking each county's data for each year. If a particular year's data contains zeroes, I am replacing the county's data for that year with an adjusted dataset that applies the state-level seasonality to the county-level annual volume.

After much trial and error to avoid iteration that led me down paths of nested apply functions with reset indices (e.g., df.apply(lambda x: x.reset_index().apply(lambda y: [calculation])), I eventually wrote the data cleanup using iteration that calculates the seasonality and then multiplies the seasonal data by a dataframe that has the annual volume stored in monthly columns:

# Initialize empty seasonality df with the index and column values from the source data
cty_season = pd.DataFrame(index=cty_data.index, columns=cty_data.columns)

# Iterate through the index and columns to populate each value
for idx in cty_season.index:
   for col in cty_season.columns:
      cty_season.loc[idx,col] = [calculation referring to helper dfs with identical indices and columns]

# Combine seasonality data with sales totals to get revised dataset
cty_adj = cty_season * cty_annual

Is there a way to do this more efficiently or in a way that is more "pandic" (or whatever the Pandas equivalent to pythonic is)? The only thing that comes to mind is breaking apart the columns so that each year is a separate row, which might allow a more straightforward apply statement since the replacement is being done year by year.

Edit: here is an example of the data cleaning procedure. As I suggest above, for this specific use case, the answer is probably to split off each year into a separate row. However, I've run into this scenario in other use cases that might not have that same solution. One difference in this code is that I normally pivot sales records to create the dataframe, so the NaN values are already in the df instead of me having to replace the 0s with NaNs in this example.

import pandas as pd
import numpy as np

data = [[73,  0,  0, 22,  0, 34,  5, 46],
        [51, 12, 77,  0, 19,  3,  0, 34],
        [73, 44,  1, 72,  0, 56, 21,  3],
        [ 3, 74,  2, 24,  4, 60,  8, 39],
        [70,  0, 36, 50,  3,  1, 59,  1],
        [14, 37, 26, 27, 87, 58, 95,  2],
        [ 4,  1, 17, 34, 25,  1,  1,  2],
        [ 0,  0,  0,  4, 18,  1,  8,  0],
        [42, 27, 41, 15, 67,  2, 25,  6]]

df = pd.DataFrame(data,
                  index=pd.MultiIndex.from_product([['County 1','County 2','County 3'],['A','B','C']],names=['County','Product']),
                  columns=pd.Series(['Y1Q1','Y1Q2','Y1Q3','Y1Q4','Y2Q1','Y2Q2','Y2Q3','Y2Q4'],name='Quarter'))

# Roll up totals by product
tot_df = df.groupby('Product').sum()

# Find out how many non-zero data points there should be per year
# (this is done to allow for YTD analysis instead of assuming each year should have 4 quarterly points or 12 monthly points)
# There is an assumption that the data doesn't have any zeroes at the total level
tot_values = tot_df.apply(lambda x: x.groupby(x.index.str[:2]).count(),axis=1)

# Calculate seasonality/share of year for each product, each year
tot_season = tot_df.apply(lambda x: x.reset_index().apply(lambda y: y[x.name]/x[x.index.str[:2]==y.Quarter[:2]].sum(),axis=1),axis=1)
tot_season.columns = tot_df.columns

# Look for zeroes to determine if the data for a particular product and year can be used
cty_valid = df.replace({0:np.nan}).apply(lambda x: x.groupby(x.index.str[:2]).count().eq(tot_values.loc[x.name[-1]]),axis=1)

# Total up annual numbers by county/product.
# These numbers are repeated at the quarterly level so that the annual data
# can be directly multiplied with the county seasonality to be generated
cty_annual = df.apply(lambda x: x.groupby(x.index.str[:2]).sum(),axis=1)
cty_annual.columns = [x + 'Q1' for x in cty_annual.columns]
cty_annual = cty_annual.reindex(columns=df.columns).ffill(axis=1)

# Create a dataframe with the needed index and columns
cty_season = pd.DataFrame(index=df.index,columns=df.columns)

# Iterate through each county/product and period combination to populate the dataframe
for idx in cty_season.index:
    for col in cty_season.columns:
        # Use the actual seasonality (period sales / annual sales) if the year has non-zero values for that product/county.
        # If not, use the seasonality calculated at the total level for that product
        cty_season.loc[idx,col] = df.loc[idx,col]/cty_annual.loc[idx,col] if cty_valid.loc[idx,col[:2]] else tot_season.loc[idx[-1],col]

# Multiply the seasonality df with the annual sales df to get an adjusted sales history.
cty_adj = cty_season * cty_annual

# Initialize empty seasonality df with the index and column values from the source data
cty_season = pd.DataFrame(index=cty_data.index, columns=cty_data.columns)

# Iterate through the index and columns to populate each value
for idx in cty_season.index:
   for col in cty_season.columns:
      cty_season.loc[idx,col] = [calculation referring to helper dfs with identical indices and columns]

# Combine seasonality data with sales totals to get revised dataset
cty_adj = cty_season * cty_annual

import pandas as pd
import numpy as np

data = [[73,  0,  0, 22,  0, 34,  5, 46],
        [51, 12, 77,  0, 19,  3,  0, 34],
        [73, 44,  1, 72,  0, 56, 21,  3],
        [ 3, 74,  2, 24,  4, 60,  8, 39],
        [70,  0, 36, 50,  3,  1, 59,  1],
        [14, 37, 26, 27, 87, 58, 95,  2],
        [ 4,  1, 17, 34, 25,  1,  1,  2],
        [ 0,  0,  0,  4, 18,  1,  8,  0],
        [42, 27, 41, 15, 67,  2, 25,  6]]

df = pd.DataFrame(data,
                  index=pd.MultiIndex.from_product([['County 1','County 2','County 3'],['A','B','C']],names=['County','Product']),
                  columns=pd.Series(['Y1Q1','Y1Q2','Y1Q3','Y1Q4','Y2Q1','Y2Q2','Y2Q3','Y2Q4'],name='Quarter'))

# Roll up totals by product
tot_df = df.groupby('Product').sum()

# Find out how many non-zero data points there should be per year
# (this is done to allow for YTD analysis instead of assuming each year should have 4 quarterly points or 12 monthly points)
# There is an assumption that the data doesn't have any zeroes at the total level
tot_values = tot_df.apply(lambda x: x.groupby(x.index.str[:2]).count(),axis=1)

# Calculate seasonality/share of year for each product, each year
tot_season = tot_df.apply(lambda x: x.reset_index().apply(lambda y: y[x.name]/x[x.index.str[:2]==y.Quarter[:2]].sum(),axis=1),axis=1)
tot_season.columns = tot_df.columns

# Look for zeroes to determine if the data for a particular product and year can be used
cty_valid = df.replace({0:np.nan}).apply(lambda x: x.groupby(x.index.str[:2]).count().eq(tot_values.loc[x.name[-1]]),axis=1)

# Total up annual numbers by county/product.
# These numbers are repeated at the quarterly level so that the annual data
# can be directly multiplied with the county seasonality to be generated
cty_annual = df.apply(lambda x: x.groupby(x.index.str[:2]).sum(),axis=1)
cty_annual.columns = [x + 'Q1' for x in cty_annual.columns]
cty_annual = cty_annual.reindex(columns=df.columns).ffill(axis=1)

# Create a dataframe with the needed index and columns
cty_season = pd.DataFrame(index=df.index,columns=df.columns)

# Iterate through each county/product and period combination to populate the dataframe
for idx in cty_season.index:
    for col in cty_season.columns:
        # Use the actual seasonality (period sales / annual sales) if the year has non-zero values for that product/county.
        # If not, use the seasonality calculated at the total level for that product
        cty_season.loc[idx,col] = df.loc[idx,col]/cty_annual.loc[idx,col] if cty_valid.loc[idx,col[:2]] else tot_season.loc[idx[-1],col]

# Multiply the seasonality df with the annual sales df to get an adjusted sales history.
cty_adj = cty_season * cty_annual

Share Improve this question edited Nov 22, 2024 at 18:42 asked Nov 22, 2024 at 15:06 Carl Peterson 1068 bronze badges

2 I believe the term is Pandastic (satire). You can just say pandas-idiomatic (less fun). – EuanG Commented Nov 22, 2024 at 15:21
1 Please edit your question to include a minimal reproducible example so that readers can run your code to answer your question. – ouroboros1 Commented Nov 22, 2024 at 15:23
1 I suspect yes, there is a better way without iterating to handle this scenario. Please create a mock input dataset and with the expected output and the function you want to apply. – Scott Boston Commented Nov 22, 2024 at 16:50
please take a look at my answer – amance Commented Nov 26, 2024 at 18:03

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

It would make calculations a bit easier if we first rearrange the data to make it long rather than wide. Next we create a new column for each new element rather than a separate df with each step.

# wide to long
df2 = df.stack().rename('values').reset_index()
# recreate tot_df in a new column
df2['prod_quart_tot'] = df2.groupby(['Product', 'Quarter'])['values'].transform('sum')
# create year column
df2['year'] = df2['Quarter'].str[:2]
# recreate tot_values
df2['tot_values'] = df2.groupby(['year', 'Product', 'County'])['prod_quart_tot'].transform(lambda x: x.gt(0).count())
# in between step needed for tot_seas
df2['prod_year_tot'] = df2.groupby(['Product', 'year'])['values'].transform('sum')
# recreate tot_seas
df2['tot_seas'] = df2['prod_quart_tot']/df2['prod_year_tot']
# recreate cty_valid
df2['cty_valid'] = df2.groupby(['County', 'Product', 'year'])['values'].transform(lambda x: 0 not in x.values)
# recreate cty_annual
df2['cty_annual'] = df2.groupby(['County', 'Product', 'year'])['values'].transform('sum')
# recreate cty_season
df2['cty_season'] = np.where(df2['cty_valid'], df2['values'].div(df2['cty_annual']), df2['tot_seas'])
# recreate cty_adj
df2['cty_adj'] = df2['cty_season'].mul(df2['cty_annual'])

# final values in original format
df_out = df2.set_index(['County', 'Product', 'Quarter'])['cty_adj'].unstack()

#check if df_out matches cty_adj
print(df_out.eq(cty_adj).all().all())

True

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1742289952a4416000.html

python - Built-in Pandas function for cell-level apply with indexcolumn awareness? - Stack Overflow

1 Answer 1

发表回复

评论列表（0条）

联系我们

400-800-8888

python - Built-in Pandas function for cell-level apply with indexcolumn awareness? - Stack Overflow

1 Answer 1

相关推荐