Summing a set of R dataframe rows (column-wise), while retaining the first n columns - Stack Overflow

I have a large R dataframe of which the first 21 columns are abiotic variables (incl. sample names), an

I have a large R dataframe of which the first 21 columns are abiotic variables (incl. sample names), and column 22-72 are species with relative abundances as values. Due to processing of the data, each sample (i.e., col 1) has multiple rows for all species (with variable rel. ab. values). The abiotic variables of each of that row (of a sample) are identical.

I would like to sum the relative abundance values of each species per sample.

Below, you can find a example original dataframe and the desired outcome.

Original:

df <- data.frame(
  sample = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3),
  var1 = c(3,3,3,3,3,3,3,7,7,7,7,7,7,7,2,2,2,2,2,2),
  var2 = c(4,4,4,4,4,4,4,42,42,42,42,42,42,42,2,2,2,2,2,2),
  species1 = c(0,0,0.05,0,0,0.02,0,0,0,0,0,0,0,0,0,0.001,0.02,0.03,0.001,0),
  species2 = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.001,0.002,0.03,0,0,0)
)

desired outcome:

df_summed <- data.frame(
  sample = c(1, 2, 3),
  var1 = c(3, 7, 2),
  var2 = c(4, 42, 2),
  species1 = c(0.07, 0, 0.052),
  species2 = c(0, 0, 0.033)
)

I've tried multiple things with dplyr functions like group_by and summarise. For example:

df_summed <- df %>%
  group_by(across(1:21)) %>%
  summarise(across(22:ncol(df), sum), .groups = "drop")

but this gives me the error

Caused by error in `across()`:
! Can't subset columns past the end.
ℹ Locations 52, 53, 54, …, 71, and 72 don't exist.
ℹ There are only 51 columns.

while the df does have 72 columns... ncol(df) yields 72

Could anyone assist me how to perform this operation?

I have a large R dataframe of which the first 21 columns are abiotic variables (incl. sample names), and column 22-72 are species with relative abundances as values. Due to processing of the data, each sample (i.e., col 1) has multiple rows for all species (with variable rel. ab. values). The abiotic variables of each of that row (of a sample) are identical.

I would like to sum the relative abundance values of each species per sample.

Below, you can find a example original dataframe and the desired outcome.

Original:

df <- data.frame(
  sample = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3),
  var1 = c(3,3,3,3,3,3,3,7,7,7,7,7,7,7,2,2,2,2,2,2),
  var2 = c(4,4,4,4,4,4,4,42,42,42,42,42,42,42,2,2,2,2,2,2),
  species1 = c(0,0,0.05,0,0,0.02,0,0,0,0,0,0,0,0,0,0.001,0.02,0.03,0.001,0),
  species2 = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.001,0.002,0.03,0,0,0)
)

desired outcome:

df_summed <- data.frame(
  sample = c(1, 2, 3),
  var1 = c(3, 7, 2),
  var2 = c(4, 42, 2),
  species1 = c(0.07, 0, 0.052),
  species2 = c(0, 0, 0.033)
)

I've tried multiple things with dplyr functions like group_by and summarise. For example:

df_summed <- df %>%
  group_by(across(1:21)) %>%
  summarise(across(22:ncol(df), sum), .groups = "drop")

but this gives me the error

Caused by error in `across()`:
! Can't subset columns past the end.
ℹ Locations 52, 53, 54, …, 71, and 72 don't exist.
ℹ There are only 51 columns.

while the df does have 72 columns... ncol(df) yields 72

Could anyone assist me how to perform this operation?

Share Improve this question asked Mar 3 at 9:00 RobHRobH 1,2921 gold badge12 silver badges23 bronze badges 2
  • 4 This is just aggregate(.~sample+var1+var2, df, sum). Do not name your data df, this masks stats::df(). Note from the docs dplyr.tidyverse./reference/across.html across() is for columns not rows. You are looking for df |> summarise(across(starts_with('species'), sum), .by = c(sample, var1, var2)) – Friede Commented Mar 3 at 9:03
  • Potential duplicate: stackoverflow/questions/78160636/… – Edward Commented Mar 3 at 12:14
Add a comment  | 

2 Answers 2

Reset to default 4

After your group_by, you need to adjust the summarise columns accordingly:

For your toy data, the number of columns to group on is 3 (1 + 2 vars). The columns to summarise on (excluding these grouping columns) is now 1:2, not 4:5.

df_summed <- df %>%
  group_by(across(1:3)) %>%
  summarise(across(1:2, sum), .groups = "drop")
# A tibble: 3 × 5
  sample  var1  var2 species1 species2
   <dbl> <dbl> <dbl>    <dbl>    <dbl>
1      1     3     4    0.07     0    
2      2     7    42    0        0    
3      3     2     2    0.052    0.033

Although Sam's answer advises against using numbers to select columns, this answer explains why yours failed.

As you've tagged dplyr, you can use summarise(across()) the columns that satisfy the condition starts_with("species") and take the sum.

library(dplyr)
out <- df |>
    summarise(
        across(starts_with("species"), sum),
        .by = c(sample, var1, var2)
    )

#   sample var1 var2 species1 species2
# 1      1    3    4    0.070    0.000
# 2      2    7   42    0.000    0.000
# 3      3    2    2    0.052    0.033

identical(out, df_summed)
# [1] TRUE

Regarding the difference between this and the approach in your question, here is an extract from the data.table FAQ:

You may have heard that it is generally bad practice to refer to columns by number rather than name, though. If your colleague comes along and reads your code later they may have to hunt around to find out which column is number 5. If you or they change the column ordering higher up in your R program, you may produce wrong results with no warning or error if you fet to change all the places in your code which refer to column number 5. That is your fault not R’s or data.table’s. It’s really really bad. Please don’t do it.

Alternative approach

The above is relatively little code but it's not really consistent with the tidyverse philosophy which outlines that data should be stored in the format where:

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.

This is not the case in your data. It can be rearranged into tidy format as follows:

df_long <- df |>
    tidyr::pivot_longer(starts_with("species"), names_to = "species")
# # A tibble: 40 × 5
#    sample  var1  var2 species  value
#     <dbl> <dbl> <dbl> <chr>    <dbl>
#  1      1     3     4 species1  0
#  2      1     3     4 species2  0
#  3      1     3     4 species1  0
#  4      1     3     4 species2  0
#  5      1     3     4 species1  0.05
#  6      1     3     4 species2  0
#  7      1     3     4 species1  0
#  8      1     3     4 species2  0
#  9      1     3     4 species1  0
# 10      1     3     4 species2  0
# # ℹ 30 more rows
# # ℹ Use `print(n = ...)` to see more rows

Or, if your real columns do not actually start with "species", selecting all but certain columns will lead to the same output, e.g. tidyr::pivot_longer(!c(sample, var1, var2), names_to = "species").

It is more natural to summarise() in this format as you can simply do so by group and there's no need to iterate over columns:

out_long <- df_long |>
    summarise(
        value = sum(value),
        .by = c(sample, var1, var2, species)
    )
# # A tibble: 6 × 5
#   sample  var1  var2 species  value
#    <dbl> <dbl> <dbl> <chr>    <dbl>
# 1      1     3     4 species1 0.07
# 2      1     3     4 species2 0
# 3      2     7    42 species1 0
# 4      2     7    42 species2 0
# 5      3     2     2 species1 0.052
# 6      3     2     2 species2 0.033

This is probably the format it makes sense to keep the data in for further analysis. However, if you need it in wide format to present it then you can pivot back:

out_long |>
    tidyr::pivot_wider(
        id_cols = c(sample, var1, var2),
        names_from = species
    )
# ^^ same as desired output

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745102620a4611368.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信