I have a dataframe with 6 columns: 'Name', 'A', 'B', 'C', 'Val', 'Category'
It looks like this:
Name A B C Val Category
x 1.1 0 0.2 NA NA
y 0 0.1 0 NA NA
z 0.5 0.1 0.3 NA NA
I want to expand the dataframe such that for each value that is not 0 in columns 'A', 'B', 'C' you get an extra row. The column 'Val' is assigned the non-zero value that led to the expansion and the 'Category' is arbitrarily based on where the value came from.
The result should look like this:
Name A B C Val Category
x 1.1 0 0.2 1.1 first
x 1.1 0 0.2 0.2 third
y 0 0.1 0 0.1 second
z 0.5 0.1 0.3 0.5 fisrt
z 0.5 0.1 0.3 0.1 second
z 0.5 0.1 0.3 0.3 third
This is probably the wrong approach, but I thought since I only have three columns I should be repeating all the rows 3 times by using the repeat function on the index and then looping through the rows based on a for loop with a skip to apply 3 functions to assign the target and AICN all rows and then dropping rows where the target is 0.
def targeta(row):
target = row
val = 'first'
return target, val
def targetb(row):
target = row
val = 'second'
return target, val
def targetc(row):
target = row
val = 'third'
return target, val
df_repeat = df.loc[df.index.repeat(3)]
for i in range(1,len(df_repeat)-3,3):
df_repeat.iloc[i][['Target','Category']]=targeta(df_repeat.iloc[i]['A'])
df_repeat.iloc[i+1][['Target','Category']]=targetb(df_repeat.iloc[i+1]['B'])
df_repeat.iloc[i+2][['Target','Category']]=targetc(df_repeat.iloc[i+2]['C'])
I only got to this point and realized I am getting an empty dataframe. Any suggestions on what to do?
I have a dataframe with 6 columns: 'Name', 'A', 'B', 'C', 'Val', 'Category'
It looks like this:
Name A B C Val Category
x 1.1 0 0.2 NA NA
y 0 0.1 0 NA NA
z 0.5 0.1 0.3 NA NA
I want to expand the dataframe such that for each value that is not 0 in columns 'A', 'B', 'C' you get an extra row. The column 'Val' is assigned the non-zero value that led to the expansion and the 'Category' is arbitrarily based on where the value came from.
The result should look like this:
Name A B C Val Category
x 1.1 0 0.2 1.1 first
x 1.1 0 0.2 0.2 third
y 0 0.1 0 0.1 second
z 0.5 0.1 0.3 0.5 fisrt
z 0.5 0.1 0.3 0.1 second
z 0.5 0.1 0.3 0.3 third
This is probably the wrong approach, but I thought since I only have three columns I should be repeating all the rows 3 times by using the repeat function on the index and then looping through the rows based on a for loop with a skip to apply 3 functions to assign the target and AICN all rows and then dropping rows where the target is 0.
def targeta(row):
target = row
val = 'first'
return target, val
def targetb(row):
target = row
val = 'second'
return target, val
def targetc(row):
target = row
val = 'third'
return target, val
df_repeat = df.loc[df.index.repeat(3)]
for i in range(1,len(df_repeat)-3,3):
df_repeat.iloc[i][['Target','Category']]=targeta(df_repeat.iloc[i]['A'])
df_repeat.iloc[i+1][['Target','Category']]=targetb(df_repeat.iloc[i+1]['B'])
df_repeat.iloc[i+2][['Target','Category']]=targetc(df_repeat.iloc[i+2]['C'])
I only got to this point and realized I am getting an empty dataframe. Any suggestions on what to do?
Share Improve this question edited Feb 12 at 21:36 ouroboros1 14.9k7 gold badges48 silver badges58 bronze badges asked Feb 12 at 20:57 BellBell 131 silver badge2 bronze badges5 Answers
Reset to default 1You could replace
the 0s with NaNs, rename
the columns to your categories, reshape to long with stack
, and join
back to the original to duplicate the rows:
out = (df
.drop(columns=['Val', 'Category'])
.join(df[['A', 'B', 'C']]
.set_axis(['first', 'second', 'third'], axis=1)
.rename_axis(columns='Category')
.replace(0, pd.NA)
.stack()
.rename('Val')
.reset_index(-1)
)
)
Output:
Name A B C Category Val
0 x 1.1 0.0 0.2 first 1.1
0 x 1.1 0.0 0.2 third 0.2
1 y 0.0 0.1 0.0 second 0.1
2 z 0.5 0.1 0.3 first 0.5
2 z 0.5 0.1 0.3 second 0.1
2 z 0.5 0.1 0.3 third 0.3
Here's one approach:
tmp = (df[[*'ABC']].where(lambda x: x != 0)
.stack()
.rename_axis([None, 'Category'])
.reset_index(1, name='Val')
.iloc[:, [1,0]]
)
out = (df.iloc[:, :4]
.merge(tmp,
left_index=True,
right_index=True,
how='right')
)
Output:
Name A B C Val Category
0 x 1.1 0.0 0.2 1.1 A
0 x 1.1 0.0 0.2 0.2 C
1 y 0.0 0.1 0.0 0.1 B
2 z 0.5 0.1 0.3 0.5 A
2 z 0.5 0.1 0.3 0.1 B
2 z 0.5 0.1 0.3 0.3 C
Explanation / intermediates
- Select the relevant columns (
['A','B','C']
) and applydf.where
to replace0
withnp.nan
(or indeed: usedf.replace
) anddf.stack
. Note thatNaN
values will be dropped.
df[[*'ABC']].where(lambda x: x != 0).stack()
0 A 1.1
C 0.2
1 B 0.1
2 A 0.5
B 0.1
C 0.3
dtype: float64
- Next, use
Series.rename_axis
to rename level 1 to 'Category' and add it as a column viaSeries.reset_index
. Addingname
sets the values to 'Val' column. - Add
df.iloc
to switch the order:
tmp
Val Category
0 1.1 A
0 0.2 C
1 0.1 B
2 0.5 A
2 0.1 B
2 0.3 C
- Ignoring
['Val', 'Category']
from the originaldf
, applydf.merge
on the indices withhow='right'
.
To customize the categories, you can use Series.map
:
out['Category'] = out['Category'].map({'A': 'first', ...})
Or, since your columns are in alphabetical order, you can use pd.Categorical
with Categorical.rename_categories
.
pd.Categorical(out['Category'])
['A', 'C', 'B', 'A', 'B', 'C']
Categories (3, object): ['A', 'B', 'C']
Hence:
out['Category'] = (pd.Categorical(out['Category'])
.rename_categories(['first', 'second', 'third']))
Data used
import pandas as pd
import numpy as np
data = {'Name': {0: 'x', 1: 'y', 2: 'z'},
'A': {0: 1.1, 1: 0.0, 2: 0.5},
'B': {0: 0.0, 1: 0.1, 2: 0.1},
'C': {0: 0.2, 1: 0.0, 2: 0.3},
'Val': {0: np.nan, 1: np.nan, 2: np.nan},
'Category': {0: np.nan, 1: np.nan, 2: np.nan}
}
df = pd.DataFrame(data)
Using numpy indexing:
nms = np.array(['first', 'second', 'third'])
d = df[['A', 'B', 'C']]
row, col = np.argwhere(d.gt(0)).T
res = df.iloc[row].assign(Val = d.to_numpy()[row, col], Category = nms[col])
res
Name A B C Val Category
0 x 1.1 0.0 0.2 1.1 first
0 x 1.1 0.0 0.2 0.2 third
1 y 0.0 0.1 0.0 0.1 second
2 z 0.5 0.1 0.3 0.5 first
2 z 0.5 0.1 0.3 0.1 second
2 z 0.5 0.1 0.3 0.3 third
Categories without word-based representations of numbers
Another possible solution:
cols = ['Val', 'Category']
df[cols] = df.loc[:, 'A':'C'].replace(0, np.nan).apply(
lambda x: pd.Series(
[x.dropna(), (1 + np.where(x.notna())[0])]),
axis=1)
df.explode(cols)
The steps are:
The code selects columns
A
toC
from the dataframe and replaces all0
values withNaN
values using thereplace
method.Then, it applies a lambda function to each row using the
apply
method. Inside the lambda function, thedropna
method is used to remove anyNaN
values from the row. Thenotna
method is then used to identify the positions of the non-NaN
values, and these positions are incremented by 1.The results assigned to columns
Val
andCategory
.Finally, the
explode
method is used to transform each list-like element in theVal
andCategory
columns into individual rows.
Output:
Name A B C Val Category
0 x 1.1 0.0 0.2 1.1 1
0 x 1.1 0.0 0.2 0.2 3
1 y 0.0 0.1 0.0 0.1 2
2 z 0.5 0.1 0.3 0.5 1
2 z 0.5 0.1 0.3 0.1 2
2 z 0.5 0.1 0.3 0.3 3
Categories with word-based representations of numbers
In case, we really need the categories as words, we can use inflect
library as follows:
# pip install inflect
import inflect
p = inflect.engine()
df.explode(cols).assign(
Category=lambda x: x['Category']
.map(lambda n: p.number_to_words(p.ordinal(n))))
Output:
Name A B C Val Category
0 x 1.1 0.0 0.2 1.1 first
0 x 1.1 0.0 0.2 0.2 third
1 y 0.0 0.1 0.0 0.1 second
2 z 0.5 0.1 0.3 0.5 first
2 z 0.5 0.1 0.3 0.1 second
2 z 0.5 0.1 0.3 0.3 third
import pandas as pd
data = {
'Name': ['x', 'y', 'z'],
'A': [1.1, 0, 0.5],
'B': [0, 0.1, 0.1],
'C': [0.2, 0, 0.3],
}
df = pd.DataFrame(data)
print(df)
'''
Name A B C
0 x 1.1 0.0 0.2
1 y 0.0 0.1 0.0
2 z 0.5 0.1 0.3
'''
# All columns except 'Name'
value_vars = [col for col in df.columns if col != 'Name']
'''
['A', 'B', 'C']
'''
category_map = {col: f"{i+1}" for i, col in enumerate(value_vars)}
'''
{'A': '1', 'B': '2', 'C': '3'}
'''
res = (
pd.melt(df,id_vars = ['Name'], value_vars = value_vars, var_name = 'Category',value_name = 'Val')
.query('Val != 0')
.assign(Category = lambda x : x['Category'].map(category_map) )
.reset_index(drop=True)
)
print(res)
'''
Name Category Val
0 x 1 1.1
1 z 1 0.5
2 y 2 0.1
3 z 2 0.1
4 x 3 0.2
5 z 3 0.3
'''
Method 2(Better) :
import pandas as pd
import numpy as np
data = {
'Name': np.array(['x', 'y', 'z']),
'A': np.array([1.1, 0, 0.5]),
'B': np.array([0, 0.1, 0.1]),
'C': np.array([0.2, 0, 0.3]),
}
df = pd.DataFrame(data)
print(df)
'''
Name A B C
0 x 1.1 0.0 0.2
1 y 0.0 0.1 0.0
2 z 0.5 0.1 0.3
'''
valueVars = df.columns[df.columns != 'Name']
#Index(['A', 'B', 'C'], dtype='object')
categoryLabels = np.array(
[f"{i +1}" for i in range(len(valueVars))]
)
#['1' '2' '3']
namesExpanded = np.repeat(df['Name'].values,len(valueVars))
#['x' 'x' 'x' 'y' 'y' 'y' 'z' 'z' 'z']
categoriesExpanded = np.tile(categoryLabels,len(valueVars))
#['1' '2' '3' '1' '2' '3' '1' '2' '3']
valuesExpanded = df[valueVars].values.ravel()
#[1.1 0. 0.2 0. 0.1 0. 0.5 0.1 0.3]
mask = valuesExpanded != 0
df1 = pd.DataFrame(
{
'Name' : namesExpanded[mask],
'Category' : categoriesExpanded[mask],
'Val' : valuesExpanded[mask]
})
print(df1)
'''
Name Category Val
0 x 1 1.1
1 x 3 0.2
2 y 2 0.1
3 z 1 0.5
4 z 2 0.1
5 z 3 0.3
'''
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745202434a4616408.html
评论列表(0条)