I have a large R dataframe of which the first 21 columns are abiotic variables (incl. sample names), and column 22-72 are species with relative abundances as values. Due to processing of the data, each sample (i.e., col 1) has multiple rows for all species (with variable rel. ab. values). The abiotic variables of each of that row (of a sample) are identical.
I would like to sum the relative abundance values of each species per sample.
Below, you can find a example original dataframe and the desired outcome.
Original:
df <- data.frame(
sample = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3),
var1 = c(3,3,3,3,3,3,3,7,7,7,7,7,7,7,2,2,2,2,2,2),
var2 = c(4,4,4,4,4,4,4,42,42,42,42,42,42,42,2,2,2,2,2,2),
species1 = c(0,0,0.05,0,0,0.02,0,0,0,0,0,0,0,0,0,0.001,0.02,0.03,0.001,0),
species2 = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.001,0.002,0.03,0,0,0)
)
desired outcome:
df_summed <- data.frame(
sample = c(1, 2, 3),
var1 = c(3, 7, 2),
var2 = c(4, 42, 2),
species1 = c(0.07, 0, 0.052),
species2 = c(0, 0, 0.033)
)
I've tried multiple things with dplyr
functions like group_by
and summarise
. For example:
df_summed <- df %>%
group_by(across(1:21)) %>%
summarise(across(22:ncol(df), sum), .groups = "drop")
but this gives me the error
Caused by error in `across()`:
! Can't subset columns past the end.
ℹ Locations 52, 53, 54, …, 71, and 72 don't exist.
ℹ There are only 51 columns.
while the df does have 72 columns... ncol(df)
yields 72
Could anyone assist me how to perform this operation?
I have a large R dataframe of which the first 21 columns are abiotic variables (incl. sample names), and column 22-72 are species with relative abundances as values. Due to processing of the data, each sample (i.e., col 1) has multiple rows for all species (with variable rel. ab. values). The abiotic variables of each of that row (of a sample) are identical.
I would like to sum the relative abundance values of each species per sample.
Below, you can find a example original dataframe and the desired outcome.
Original:
df <- data.frame(
sample = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3),
var1 = c(3,3,3,3,3,3,3,7,7,7,7,7,7,7,2,2,2,2,2,2),
var2 = c(4,4,4,4,4,4,4,42,42,42,42,42,42,42,2,2,2,2,2,2),
species1 = c(0,0,0.05,0,0,0.02,0,0,0,0,0,0,0,0,0,0.001,0.02,0.03,0.001,0),
species2 = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.001,0.002,0.03,0,0,0)
)
desired outcome:
df_summed <- data.frame(
sample = c(1, 2, 3),
var1 = c(3, 7, 2),
var2 = c(4, 42, 2),
species1 = c(0.07, 0, 0.052),
species2 = c(0, 0, 0.033)
)
I've tried multiple things with dplyr
functions like group_by
and summarise
. For example:
df_summed <- df %>%
group_by(across(1:21)) %>%
summarise(across(22:ncol(df), sum), .groups = "drop")
but this gives me the error
Caused by error in `across()`:
! Can't subset columns past the end.
ℹ Locations 52, 53, 54, …, 71, and 72 don't exist.
ℹ There are only 51 columns.
while the df does have 72 columns... ncol(df)
yields 72
Could anyone assist me how to perform this operation?
Share Improve this question asked Mar 3 at 9:00 RobHRobH 1,2921 gold badge12 silver badges23 bronze badges 2 |2 Answers
Reset to default 4After your group_by
, you need to adjust the summarise
columns accordingly:
For your toy data, the number of columns to group on is 3 (1 + 2 vars). The columns to summarise on (excluding these grouping columns) is now 1:2, not 4:5.
df_summed <- df %>%
group_by(across(1:3)) %>%
summarise(across(1:2, sum), .groups = "drop")
# A tibble: 3 × 5
sample var1 var2 species1 species2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 4 0.07 0
2 2 7 42 0 0
3 3 2 2 0.052 0.033
Although Sam's answer advises against using numbers to select columns, this answer explains why yours failed.
As you've tagged dplyr, you can use summarise(across())
the columns that satisfy the condition starts_with("species")
and take the sum.
library(dplyr)
out <- df |>
summarise(
across(starts_with("species"), sum),
.by = c(sample, var1, var2)
)
# sample var1 var2 species1 species2
# 1 1 3 4 0.070 0.000
# 2 2 7 42 0.000 0.000
# 3 3 2 2 0.052 0.033
identical(out, df_summed)
# [1] TRUE
Regarding the difference between this and the approach in your question, here is an extract from the data.table
FAQ:
You may have heard that it is generally bad practice to refer to columns by number rather than name, though. If your colleague comes along and reads your code later they may have to hunt around to find out which column is number 5. If you or they change the column ordering higher up in your R program, you may produce wrong results with no warning or error if you fet to change all the places in your code which refer to column number 5. That is your fault not R’s or data.table’s. It’s really really bad. Please don’t do it.
Alternative approach
The above is relatively little code but it's not really consistent with the tidyverse
philosophy which outlines that data should be stored in the format where:
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
- Each value is a cell; each cell is a single value.
This is not the case in your data. It can be rearranged into tidy format as follows:
df_long <- df |>
tidyr::pivot_longer(starts_with("species"), names_to = "species")
# # A tibble: 40 × 5
# sample var1 var2 species value
# <dbl> <dbl> <dbl> <chr> <dbl>
# 1 1 3 4 species1 0
# 2 1 3 4 species2 0
# 3 1 3 4 species1 0
# 4 1 3 4 species2 0
# 5 1 3 4 species1 0.05
# 6 1 3 4 species2 0
# 7 1 3 4 species1 0
# 8 1 3 4 species2 0
# 9 1 3 4 species1 0
# 10 1 3 4 species2 0
# # ℹ 30 more rows
# # ℹ Use `print(n = ...)` to see more rows
Or, if your real columns do not actually start with "species"
, selecting all but certain columns will lead to the same output, e.g. tidyr::pivot_longer(!c(sample, var1, var2), names_to = "species")
.
It is more natural to summarise()
in this format as you can simply do so by group and there's no need to iterate over columns:
out_long <- df_long |>
summarise(
value = sum(value),
.by = c(sample, var1, var2, species)
)
# # A tibble: 6 × 5
# sample var1 var2 species value
# <dbl> <dbl> <dbl> <chr> <dbl>
# 1 1 3 4 species1 0.07
# 2 1 3 4 species2 0
# 3 2 7 42 species1 0
# 4 2 7 42 species2 0
# 5 3 2 2 species1 0.052
# 6 3 2 2 species2 0.033
This is probably the format it makes sense to keep the data in for further analysis. However, if you need it in wide format to present it then you can pivot back:
out_long |>
tidyr::pivot_wider(
id_cols = c(sample, var1, var2),
names_from = species
)
# ^^ same as desired output
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745102620a4611368.html
aggregate(.~sample+var1+var2, df, sum)
. Do not name your datadf
, this masksstats::df()
. Note from the docs dplyr.tidyverse./reference/across.htmlacross()
is for columns not rows. You are looking fordf |> summarise(across(starts_with('species'), sum), .by = c(sample, var1, var2))
– Friede Commented Mar 3 at 9:03