python - Explode Dataframe and add new columns with specific values based on a condition - Stack Overflow

I have a dataframe with 6 columns: 'Name', 'A', 'B', 'C', '

I have a dataframe with 6 columns: 'Name', 'A', 'B', 'C', 'Val', 'Category'

It looks like this:

Name   A     B     C   Val   Category
 x    1.1   0     0.2  NA    NA
 y    0     0.1   0    NA    NA
 z    0.5   0.1   0.3  NA    NA

I want to expand the dataframe such that for each value that is not 0 in columns 'A', 'B', 'C' you get an extra row. The column 'Val' is assigned the non-zero value that led to the expansion and the 'Category' is arbitrarily based on where the value came from.

The result should look like this:

Name   A    B     C    Val   Category
 x    1.1   0     0.2  1.1   first
 x    1.1   0     0.2  0.2   third
 y    0     0.1   0    0.1   second
 z    0.5   0.1   0.3  0.5   fisrt
 z    0.5   0.1   0.3  0.1   second
 z    0.5   0.1   0.3  0.3   third

This is probably the wrong approach, but I thought since I only have three columns I should be repeating all the rows 3 times by using the repeat function on the index and then looping through the rows based on a for loop with a skip to apply 3 functions to assign the target and AICN all rows and then dropping rows where the target is 0.

def targeta(row):
    target = row
    val = 'first'
    return target, val

def targetb(row):
    target = row
    val = 'second'
    return target, val

def targetc(row):
    target = row
    val = 'third'
    return target, val

df_repeat = df.loc[df.index.repeat(3)]

for i in range(1,len(df_repeat)-3,3):
    df_repeat.iloc[i][['Target','Category']]=targeta(df_repeat.iloc[i]['A'])
    df_repeat.iloc[i+1][['Target','Category']]=targetb(df_repeat.iloc[i+1]['B'])
    df_repeat.iloc[i+2][['Target','Category']]=targetc(df_repeat.iloc[i+2]['C'])

I only got to this point and realized I am getting an empty dataframe. Any suggestions on what to do?

I have a dataframe with 6 columns: 'Name', 'A', 'B', 'C', 'Val', 'Category'

It looks like this:

Name   A     B     C   Val   Category
 x    1.1   0     0.2  NA    NA
 y    0     0.1   0    NA    NA
 z    0.5   0.1   0.3  NA    NA

I want to expand the dataframe such that for each value that is not 0 in columns 'A', 'B', 'C' you get an extra row. The column 'Val' is assigned the non-zero value that led to the expansion and the 'Category' is arbitrarily based on where the value came from.

The result should look like this:

Name   A    B     C    Val   Category
 x    1.1   0     0.2  1.1   first
 x    1.1   0     0.2  0.2   third
 y    0     0.1   0    0.1   second
 z    0.5   0.1   0.3  0.5   fisrt
 z    0.5   0.1   0.3  0.1   second
 z    0.5   0.1   0.3  0.3   third

This is probably the wrong approach, but I thought since I only have three columns I should be repeating all the rows 3 times by using the repeat function on the index and then looping through the rows based on a for loop with a skip to apply 3 functions to assign the target and AICN all rows and then dropping rows where the target is 0.

def targeta(row):
    target = row
    val = 'first'
    return target, val

def targetb(row):
    target = row
    val = 'second'
    return target, val

def targetc(row):
    target = row
    val = 'third'
    return target, val

df_repeat = df.loc[df.index.repeat(3)]

for i in range(1,len(df_repeat)-3,3):
    df_repeat.iloc[i][['Target','Category']]=targeta(df_repeat.iloc[i]['A'])
    df_repeat.iloc[i+1][['Target','Category']]=targetb(df_repeat.iloc[i+1]['B'])
    df_repeat.iloc[i+2][['Target','Category']]=targetc(df_repeat.iloc[i+2]['C'])

I only got to this point and realized I am getting an empty dataframe. Any suggestions on what to do?

Share Improve this question edited Feb 12 at 21:36 ouroboros1 14.9k7 gold badges48 silver badges58 bronze badges asked Feb 12 at 20:57 BellBell 131 silver badge2 bronze badges
Add a comment  | 

5 Answers 5

Reset to default 1

You could replace the 0s with NaNs, rename the columns to your categories, reshape to long with stack, and join back to the original to duplicate the rows:

out = (df
       .drop(columns=['Val', 'Category'])
       .join(df[['A', 'B', 'C']]
             .set_axis(['first', 'second', 'third'], axis=1)
             .rename_axis(columns='Category')
             .replace(0, pd.NA)
             .stack()
             .rename('Val')
             .reset_index(-1)
            )
       )

Output:

  Name    A    B    C Category  Val
0    x  1.1  0.0  0.2    first  1.1
0    x  1.1  0.0  0.2    third  0.2
1    y  0.0  0.1  0.0   second  0.1
2    z  0.5  0.1  0.3    first  0.5
2    z  0.5  0.1  0.3   second  0.1
2    z  0.5  0.1  0.3    third  0.3

Here's one approach:

tmp = (df[[*'ABC']].where(lambda x: x != 0)
       .stack()
       .rename_axis([None, 'Category'])
       .reset_index(1, name='Val')
       .iloc[:, [1,0]]
       )

out = (df.iloc[:, :4]
       .merge(tmp, 
              left_index=True, 
              right_index=True, 
              how='right')
       )

Output:

  Name    A    B    C  Val Category
0    x  1.1  0.0  0.2  1.1        A
0    x  1.1  0.0  0.2  0.2        C
1    y  0.0  0.1  0.0  0.1        B
2    z  0.5  0.1  0.3  0.5        A
2    z  0.5  0.1  0.3  0.1        B
2    z  0.5  0.1  0.3  0.3        C

Explanation / intermediates

  • Select the relevant columns (['A','B','C']) and apply df.where to replace 0 with np.nan (or indeed: use df.replace) and df.stack. Note that NaN values will be dropped.
df[[*'ABC']].where(lambda x: x != 0).stack()

0  A    1.1
   C    0.2
1  B    0.1
2  A    0.5
   B    0.1
   C    0.3
dtype: float64
  • Next, use Series.rename_axis to rename level 1 to 'Category' and add it as a column via Series.reset_index. Adding name sets the values to 'Val' column.
  • Add df.iloc to switch the order:
tmp

   Val Category
0  1.1        A
0  0.2        C
1  0.1        B
2  0.5        A
2  0.1        B
2  0.3        C
  • Ignoring ['Val', 'Category'] from the original df, apply df.merge on the indices with how='right'.

To customize the categories, you can use Series.map:

out['Category'] = out['Category'].map({'A': 'first', ...})

Or, since your columns are in alphabetical order, you can use pd.Categorical with Categorical.rename_categories.

pd.Categorical(out['Category'])

['A', 'C', 'B', 'A', 'B', 'C']
Categories (3, object): ['A', 'B', 'C']

Hence:

out['Category'] = (pd.Categorical(out['Category'])
                   .rename_categories(['first', 'second', 'third']))

Data used

import pandas as pd
import numpy as np

data = {'Name': {0: 'x', 1: 'y', 2: 'z'}, 
        'A': {0: 1.1, 1: 0.0, 2: 0.5}, 
        'B': {0: 0.0, 1: 0.1, 2: 0.1}, 
        'C': {0: 0.2, 1: 0.0, 2: 0.3}, 
        'Val': {0: np.nan, 1: np.nan, 2: np.nan}, 
        'Category': {0: np.nan, 1: np.nan, 2: np.nan}
        }
df = pd.DataFrame(data)

Using numpy indexing:

nms = np.array(['first', 'second', 'third'])
d = df[['A', 'B', 'C']]
row, col = np.argwhere(d.gt(0)).T
res = df.iloc[row].assign(Val = d.to_numpy()[row, col], Category = nms[col])
res

 Name    A    B    C  Val Category
0    x  1.1  0.0  0.2  1.1    first
0    x  1.1  0.0  0.2  0.2    third
1    y  0.0  0.1  0.0  0.1   second
2    z  0.5  0.1  0.3  0.5    first
2    z  0.5  0.1  0.3  0.1   second
2    z  0.5  0.1  0.3  0.3    third

Categories without word-based representations of numbers

Another possible solution:

cols = ['Val', 'Category']

df[cols] = df.loc[:, 'A':'C'].replace(0, np.nan).apply(
    lambda x: pd.Series(
        [x.dropna(), (1 + np.where(x.notna())[0])]), 
    axis=1)

df.explode(cols)

The steps are:

  • The code selects columns A to C from the dataframe and replaces all 0 values with NaN values using the replace method.

  • Then, it applies a lambda function to each row using the apply method. Inside the lambda function, the dropna method is used to remove any NaN values from the row. The notna method is then used to identify the positions of the non-NaN values, and these positions are incremented by 1.

  • The results assigned to columns Val and Category.

  • Finally, the explode method is used to transform each list-like element in the Val and Category columns into individual rows.

Output:

  Name    A    B    C  Val Category
0    x  1.1  0.0  0.2  1.1        1
0    x  1.1  0.0  0.2  0.2        3
1    y  0.0  0.1  0.0  0.1        2
2    z  0.5  0.1  0.3  0.5        1
2    z  0.5  0.1  0.3  0.1        2
2    z  0.5  0.1  0.3  0.3        3

Categories with word-based representations of numbers

In case, we really need the categories as words, we can use inflect library as follows:

# pip install inflect
import inflect

p = inflect.engine()

df.explode(cols).assign(
    Category=lambda x: x['Category']
    .map(lambda n: p.number_to_words(p.ordinal(n))))

Output:

  Name    A    B    C  Val Category
0    x  1.1  0.0  0.2  1.1    first
0    x  1.1  0.0  0.2  0.2    third
1    y  0.0  0.1  0.0  0.1   second
2    z  0.5  0.1  0.3  0.5    first
2    z  0.5  0.1  0.3  0.1   second
2    z  0.5  0.1  0.3  0.3    third
import pandas as pd

data = {
    'Name': ['x', 'y', 'z'],
    'A': [1.1, 0, 0.5],
    'B': [0, 0.1, 0.1],
    'C': [0.2, 0, 0.3],
}
df = pd.DataFrame(data)
print(df)
'''
  Name    A    B    C
0    x  1.1  0.0  0.2
1    y  0.0  0.1  0.0
2    z  0.5  0.1  0.3
'''
# All columns except 'Name'
value_vars = [col for col in df.columns if col != 'Name'] 
'''
['A', 'B', 'C']
'''
category_map = {col: f"{i+1}" for i, col in enumerate(value_vars)}
'''
{'A': '1', 'B': '2', 'C': '3'}
'''

res = (
pd.melt(df,id_vars = ['Name'], value_vars = value_vars, var_name = 'Category',value_name = 'Val')
.query('Val != 0')
.assign(Category = lambda x : x['Category'].map(category_map) )
.reset_index(drop=True)    
)

print(res)
'''
 Name Category  Val
0    x        1  1.1
1    z        1  0.5
2    y        2  0.1
3    z        2  0.1
4    x        3  0.2
5    z        3  0.3
'''

Method 2(Better) :

import pandas as pd
import numpy as np

data = {
    'Name': np.array(['x', 'y', 'z']),
    'A': np.array([1.1, 0, 0.5]),
    'B': np.array([0, 0.1, 0.1]),
    'C': np.array([0.2, 0, 0.3]),
}
df = pd.DataFrame(data)
print(df)
'''
  Name    A    B    C
0    x  1.1  0.0  0.2
1    y  0.0  0.1  0.0
2    z  0.5  0.1  0.3
'''

valueVars = df.columns[df.columns != 'Name']
#Index(['A', 'B', 'C'], dtype='object')

categoryLabels = np.array(
[f"{i +1}" for i in range(len(valueVars))]    
)
#['1' '2' '3']

namesExpanded = np.repeat(df['Name'].values,len(valueVars))
#['x' 'x' 'x' 'y' 'y' 'y' 'z' 'z' 'z']

categoriesExpanded = np.tile(categoryLabels,len(valueVars))
#['1' '2' '3' '1' '2' '3' '1' '2' '3']

valuesExpanded = df[valueVars].values.ravel()
#[1.1 0.  0.2 0.  0.1 0.  0.5 0.1 0.3]

mask = valuesExpanded != 0 

df1 = pd.DataFrame(
{ 
'Name' : namesExpanded[mask],
'Category' : categoriesExpanded[mask],
'Val' : valuesExpanded[mask]   
})

print(df1)
'''
  Name Category  Val
0    x        1  1.1
1    x        3  0.2
2    y        2  0.1
3    z        1  0.5
4    z        2  0.1
5    z        3  0.3
'''

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745202434a4616408.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信