python - Explode Dataframe and add new columns with specific values based on a condition

I have a dataframe with 6 columns: 'Name', 'A', 'B', 'C', 'Val', 'Category'

It looks like this:

Name   A     B     C   Val   Category
 x    1.1   0     0.2  NA    NA
 y    0     0.1   0    NA    NA
 z    0.5   0.1   0.3  NA    NA

I want to expand the dataframe such that for each value that is not 0 in columns 'A', 'B', 'C' you get an extra row. The column 'Val' is assigned the non-zero value that led to the expansion and the 'Category' is arbitrarily based on where the value came from.

The result should look like this:

Name   A    B     C    Val   Category
 x    1.1   0     0.2  1.1   first
 x    1.1   0     0.2  0.2   third
 y    0     0.1   0    0.1   second
 z    0.5   0.1   0.3  0.5   fisrt
 z    0.5   0.1   0.3  0.1   second
 z    0.5   0.1   0.3  0.3   third

This is probably the wrong approach, but I thought since I only have three columns I should be repeating all the rows 3 times by using the repeat function on the index and then looping through the rows based on a for loop with a skip to apply 3 functions to assign the target and AICN all rows and then dropping rows where the target is 0.

def targeta(row):
    target = row
    val = 'first'
    return target, val

def targetb(row):
    target = row
    val = 'second'
    return target, val

def targetc(row):
    target = row
    val = 'third'
    return target, val

df_repeat = df.loc[df.index.repeat(3)]

for i in range(1,len(df_repeat)-3,3):
    df_repeat.iloc[i][['Target','Category']]=targeta(df_repeat.iloc[i]['A'])
    df_repeat.iloc[i+1][['Target','Category']]=targetb(df_repeat.iloc[i+1]['B'])
    df_repeat.iloc[i+2][['Target','Category']]=targetc(df_repeat.iloc[i+2]['C'])

I only got to this point and realized I am getting an empty dataframe. Any suggestions on what to do?

I have a dataframe with 6 columns: 'Name', 'A', 'B', 'C', 'Val', 'Category'

It looks like this:

Name   A     B     C   Val   Category
 x    1.1   0     0.2  NA    NA
 y    0     0.1   0    NA    NA
 z    0.5   0.1   0.3  NA    NA

The result should look like this:

Name   A    B     C    Val   Category
 x    1.1   0     0.2  1.1   first
 x    1.1   0     0.2  0.2   third
 y    0     0.1   0    0.1   second
 z    0.5   0.1   0.3  0.5   fisrt
 z    0.5   0.1   0.3  0.1   second
 z    0.5   0.1   0.3  0.3   third

def targeta(row):
    target = row
    val = 'first'
    return target, val

def targetb(row):
    target = row
    val = 'second'
    return target, val

def targetc(row):
    target = row
    val = 'third'
    return target, val

df_repeat = df.loc[df.index.repeat(3)]

for i in range(1,len(df_repeat)-3,3):
    df_repeat.iloc[i][['Target','Category']]=targeta(df_repeat.iloc[i]['A'])
    df_repeat.iloc[i+1][['Target','Category']]=targetb(df_repeat.iloc[i+1]['B'])
    df_repeat.iloc[i+2][['Target','Category']]=targetc(df_repeat.iloc[i+2]['C'])

I only got to this point and realized I am getting an empty dataframe. Any suggestions on what to do?

Share Improve this question edited Feb 12 at 21:36 ouroboros1 14.9k7 gold badges48 silver badges58 bronze badges asked Feb 12 at 20:57 Bell 131 silver badge2 bronze badges

Add a comment |

5 Answers 5

Sorted by: Reset to default 1

You could replace the 0s with NaNs, rename the columns to your categories, reshape to long with stack, and join back to the original to duplicate the rows:

out = (df
       .drop(columns=['Val', 'Category'])
       .join(df[['A', 'B', 'C']]
             .set_axis(['first', 'second', 'third'], axis=1)
             .rename_axis(columns='Category')
             .replace(0, pd.NA)
             .stack()
             .rename('Val')
             .reset_index(-1)
            )
       )

Output:

  Name    A    B    C Category  Val
0    x  1.1  0.0  0.2    first  1.1
0    x  1.1  0.0  0.2    third  0.2
1    y  0.0  0.1  0.0   second  0.1
2    z  0.5  0.1  0.3    first  0.5
2    z  0.5  0.1  0.3   second  0.1
2    z  0.5  0.1  0.3    third  0.3

Here's one approach:

tmp = (df[[*'ABC']].where(lambda x: x != 0)
       .stack()
       .rename_axis([None, 'Category'])
       .reset_index(1, name='Val')
       .iloc[:, [1,0]]
       )

out = (df.iloc[:, :4]
       .merge(tmp, 
              left_index=True, 
              right_index=True, 
              how='right')
       )

Output:

  Name    A    B    C  Val Category
0    x  1.1  0.0  0.2  1.1        A
0    x  1.1  0.0  0.2  0.2        C
1    y  0.0  0.1  0.0  0.1        B
2    z  0.5  0.1  0.3  0.5        A
2    z  0.5  0.1  0.3  0.1        B
2    z  0.5  0.1  0.3  0.3        C

Explanation / intermediates

Select the relevant columns (['A','B','C']) and apply df.where to replace 0 with np.nan (or indeed: use df.replace) and df.stack. Note that NaN values will be dropped.

df[[*'ABC']].where(lambda x: x != 0).stack()

0  A    1.1
   C    0.2
1  B    0.1
2  A    0.5
   B    0.1
   C    0.3
dtype: float64

Next, use Series.rename_axis to rename level 1 to 'Category' and add it as a column via Series.reset_index. Adding name sets the values to 'Val' column.
Add df.iloc to switch the order:

tmp

   Val Category
0  1.1        A
0  0.2        C
1  0.1        B
2  0.5        A
2  0.1        B
2  0.3        C

Ignoring ['Val', 'Category'] from the original df, apply df.merge on the indices with how='right'.

To customize the categories, you can use Series.map:

out['Category'] = out['Category'].map({'A': 'first', ...})

Or, since your columns are in alphabetical order, you can use pd.Categorical with Categorical.rename_categories.

pd.Categorical(out['Category'])

['A', 'C', 'B', 'A', 'B', 'C']
Categories (3, object): ['A', 'B', 'C']

Hence:

out['Category'] = (pd.Categorical(out['Category'])
                   .rename_categories(['first', 'second', 'third']))

Data used

import pandas as pd
import numpy as np

data = {'Name': {0: 'x', 1: 'y', 2: 'z'}, 
        'A': {0: 1.1, 1: 0.0, 2: 0.5}, 
        'B': {0: 0.0, 1: 0.1, 2: 0.1}, 
        'C': {0: 0.2, 1: 0.0, 2: 0.3}, 
        'Val': {0: np.nan, 1: np.nan, 2: np.nan}, 
        'Category': {0: np.nan, 1: np.nan, 2: np.nan}
        }
df = pd.DataFrame(data)

Using numpy indexing:

nms = np.array(['first', 'second', 'third'])
d = df[['A', 'B', 'C']]
row, col = np.argwhere(d.gt(0)).T
res = df.iloc[row].assign(Val = d.to_numpy()[row, col], Category = nms[col])
res

 Name    A    B    C  Val Category
0    x  1.1  0.0  0.2  1.1    first
0    x  1.1  0.0  0.2  0.2    third
1    y  0.0  0.1  0.0  0.1   second
2    z  0.5  0.1  0.3  0.5    first
2    z  0.5  0.1  0.3  0.1   second
2    z  0.5  0.1  0.3  0.3    third

Categories without word-based representations of numbers

Another possible solution:

cols = ['Val', 'Category']

df[cols] = df.loc[:, 'A':'C'].replace(0, np.nan).apply(
    lambda x: pd.Series(
        [x.dropna(), (1 + np.where(x.notna())[0])]), 
    axis=1)

df.explode(cols)

The steps are:

The code selects columns A to C from the dataframe and replaces all 0 values with NaN values using the replace method.
Then, it applies a lambda function to each row using the apply method. Inside the lambda function, the dropna method is used to remove any NaN values from the row. The notna method is then used to identify the positions of the non-NaN values, and these positions are incremented by 1.
The results assigned to columns Val and Category.
Finally, the explode method is used to transform each list-like element in the Val and Category columns into individual rows.

Output:

  Name    A    B    C  Val Category
0    x  1.1  0.0  0.2  1.1        1
0    x  1.1  0.0  0.2  0.2        3
1    y  0.0  0.1  0.0  0.1        2
2    z  0.5  0.1  0.3  0.5        1
2    z  0.5  0.1  0.3  0.1        2
2    z  0.5  0.1  0.3  0.3        3

Categories with word-based representations of numbers

In case, we really need the categories as words, we can use inflect library as follows:

# pip install inflect
import inflect

p = inflect.engine()

df.explode(cols).assign(
    Category=lambda x: x['Category']
    .map(lambda n: p.number_to_words(p.ordinal(n))))

Output:

  Name    A    B    C  Val Category
0    x  1.1  0.0  0.2  1.1    first
0    x  1.1  0.0  0.2  0.2    third
1    y  0.0  0.1  0.0  0.1   second
2    z  0.5  0.1  0.3  0.5    first
2    z  0.5  0.1  0.3  0.1   second
2    z  0.5  0.1  0.3  0.3    third

import pandas as pd

data = {
    'Name': ['x', 'y', 'z'],
    'A': [1.1, 0, 0.5],
    'B': [0, 0.1, 0.1],
    'C': [0.2, 0, 0.3],
}
df = pd.DataFrame(data)
print(df)
'''
  Name    A    B    C
0    x  1.1  0.0  0.2
1    y  0.0  0.1  0.0
2    z  0.5  0.1  0.3
'''
# All columns except 'Name'
value_vars = [col for col in df.columns if col != 'Name'] 
'''
['A', 'B', 'C']
'''
category_map = {col: f"{i+1}" for i, col in enumerate(value_vars)}
'''
{'A': '1', 'B': '2', 'C': '3'}
'''

res = (
pd.melt(df,id_vars = ['Name'], value_vars = value_vars, var_name = 'Category',value_name = 'Val')
.query('Val != 0')
.assign(Category = lambda x : x['Category'].map(category_map) )
.reset_index(drop=True)    
)

print(res)
'''
 Name Category  Val
0    x        1  1.1
1    z        1  0.5
2    y        2  0.1
3    z        2  0.1
4    x        3  0.2
5    z        3  0.3
'''

Method 2(Better) :

import pandas as pd
import numpy as np

data = {
    'Name': np.array(['x', 'y', 'z']),
    'A': np.array([1.1, 0, 0.5]),
    'B': np.array([0, 0.1, 0.1]),
    'C': np.array([0.2, 0, 0.3]),
}
df = pd.DataFrame(data)
print(df)
'''
  Name    A    B    C
0    x  1.1  0.0  0.2
1    y  0.0  0.1  0.0
2    z  0.5  0.1  0.3
'''

valueVars = df.columns[df.columns != 'Name']
#Index(['A', 'B', 'C'], dtype='object')

categoryLabels = np.array(
[f"{i +1}" for i in range(len(valueVars))]    
)
#['1' '2' '3']

namesExpanded = np.repeat(df['Name'].values,len(valueVars))
#['x' 'x' 'x' 'y' 'y' 'y' 'z' 'z' 'z']

categoriesExpanded = np.tile(categoryLabels,len(valueVars))
#['1' '2' '3' '1' '2' '3' '1' '2' '3']

valuesExpanded = df[valueVars].values.ravel()
#[1.1 0.  0.2 0.  0.1 0.  0.5 0.1 0.3]

mask = valuesExpanded != 0 

df1 = pd.DataFrame(
{ 
'Name' : namesExpanded[mask],
'Category' : categoriesExpanded[mask],
'Val' : valuesExpanded[mask]   
})

print(df1)
'''
  Name Category  Val
0    x        1  1.1
1    x        3  0.2
2    y        2  0.1
3    z        1  0.5
4    z        2  0.1
5    z        3  0.3
'''

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1745202434a4616408.html

python - Explode Dataframe and add new columns with specific values based on a condition - Stack Overflow

5 Answers 5

发表回复

评论列表（0条）

联系我们

400-800-8888

python - Explode Dataframe and add new columns with specific values based on a condition - Stack Overflow

5 Answers 5

相关推荐