r - Filter rows based on combined set of values in a string - Stack Overflow

In R, I have the following dataframe with the column "overlap" listing rows that have overlap

In R, I have the following dataframe with the column "overlap" listing rows that have overlapping values on some other column.

df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4", 
                              "5,6", "5,6,7", "6,7", 
                              "8,9", "8,9,10", "9,10", 
                              "11,12,13", "11,12,13", 
                              "11,12,13,14", "13,14", 
                              "15,16", "15,16,17", "16,17", 
                              "18,19", "18,19,20", "19,20"))

df
         overlap
  1        1,2,3
  2        1,2,3
  3      1,2,3,4
  4          3,4
  5          5,6
  6        5,6,7
  7          6,7
  8          8,9
  9       8,9,10
  10        9,10
  11    11,12,13
  12    11,12,13
  13 11,12,13,14
  14       13,14
  15       15,16
  16    15,16,17
  17       16,17
  18       18,19
  19    18,19,20
  20       19,20

I would like to identify rows with common values, even if those values are not in all rows, and then keep only 1 of the rows. For example, rows 1-4 contain the combined set 1,2,3,4 and I would like to keep only one of these rows. If we keep the first row, the resulting df would be:

  1        1,2,3
  5          5,6
  8          8,9
  11    11,12,13
  15       15,16
  18       18,19

I've searched many other solutions on here and none include uneven rows lengths, and which is vital as the full data can have rows with dozens of values.

In R, I have the following dataframe with the column "overlap" listing rows that have overlapping values on some other column.

df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4", 
                              "5,6", "5,6,7", "6,7", 
                              "8,9", "8,9,10", "9,10", 
                              "11,12,13", "11,12,13", 
                              "11,12,13,14", "13,14", 
                              "15,16", "15,16,17", "16,17", 
                              "18,19", "18,19,20", "19,20"))

df
         overlap
  1        1,2,3
  2        1,2,3
  3      1,2,3,4
  4          3,4
  5          5,6
  6        5,6,7
  7          6,7
  8          8,9
  9       8,9,10
  10        9,10
  11    11,12,13
  12    11,12,13
  13 11,12,13,14
  14       13,14
  15       15,16
  16    15,16,17
  17       16,17
  18       18,19
  19    18,19,20
  20       19,20

I would like to identify rows with common values, even if those values are not in all rows, and then keep only 1 of the rows. For example, rows 1-4 contain the combined set 1,2,3,4 and I would like to keep only one of these rows. If we keep the first row, the resulting df would be:

  1        1,2,3
  5          5,6
  8          8,9
  11    11,12,13
  15       15,16
  18       18,19

I've searched many other solutions on here and none include uneven rows lengths, and which is vital as the full data can have rows with dozens of values.

Share Improve this question edited Nov 18, 2024 at 21:25 ThomasIsCoding 104k9 gold badges37 silver badges103 bronze badges asked Nov 18, 2024 at 18:35 bcrewbcrew 1053 bronze badges 2
  • 2 It might be better if you can go through the comments under @margusl's solution and clarify what are the valid overlap values. – ThomasIsCoding Commented Nov 19, 2024 at 10:24
  • 1 These are all great solutions to what I asked, thank you! I am also looking for flexibility in non-sequential groups, which I didn't mention. I selected margusl's because it fit the question and has some good flexibility. Friede's IV solution is great for what I asked and works well if all groups are sequential. But what if we change the 4s to 40 instead? I'm having mixed results with non-sequential groups, which I apologize, was not part of the original question. – bcrew Commented Nov 19, 2024 at 19:25
Add a comment  | 

3 Answers 3

Reset to default 6

One option for this particular example data is to create an igraph graph from row overlaps, detect connected components in resulting graph and use component's cluster id as a grouping variable. From there we can pick the first row from every group.

library(dplyr)
library(igraph)

df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4", 
                              "5,6", "5,6,7", "6,7", 
                              "8,9", "8,9,10", "9,10", 
                              "11,12,13", "11,12,13", 
                              "11,12,13,14", "13,14", 
                              "15,16", "15,16,17", "16,17", 
                              "18,19", "18,19,20", "19,20"))


df |> 
  mutate(id = row_number(), .before = 1) |> 
  group_by(
    g_clust = 
      strsplit(overlap, ",") |> 
      # either create a directed graph or set duplicate = FALSE for 
      # corner cases like `overlap = c("1", "1,2,3", ...)`
      graph_from_adj_list(mode = "all", duplicate = FALSE) |> 
      components() |> 
      getElement("membership")
    ) |> 
  slice_head(n = 1)
#> # A tibble: 6 × 3
#> # Groups:   g_clust [6]
#>      id overlap  g_clust
#>   <int> <chr>      <dbl>
#> 1     1 1,2,3          1
#> 2     5 5,6            2
#> 3     8 8,9            3
#> 4    11 11,12,13       4
#> 5    15 15,16          5
#> 6    18 18,19          6

Overlaps graph for reference:

strsplit(df$overlap, ",") |> 
  graph_from_adj_list(mode = "all", duplicate = FALSE) |>
  plot()

We can try {ivs}:

x = vapply(strsplit(unique(overlap), ","), 
           \(i) as.numeric(i[c(1, length(i))]), numeric(2))

library(ivs)
int = iv_groups(iv(x[1, ], x[2, ]))

giving

> as.data.frame(int)
         y
1   [1, 4)
2   [5, 7)
3  [8, 10)
4 [11, 14)
5 [15, 17)
6 [18, 20)

The vapply is a bit redundant as we call as.numeric several times. Do you really want comma-separated integers stored as character?

transform(as.data.frame(int), 
          s = Vectorize(\(x, y) toString(x:y))(iv_start(int), iv_end(int) - 1))

giving

       int          s
1   [1, 4)    1, 2, 3
2   [5, 7)       5, 6
3  [8, 10)       8, 9
4 [11, 14) 11, 12, 13
5 [15, 17)     15, 16
6 [18, 20)     18, 19

Edit

@Chris is right in the comment below. I should add some explanation.

(1) Re-structure the data. Split the strings, find first and last value, coerce character to numeric.

x = # we assign the output of the pipe x |> ... |> ... to x 
  overlap |> # access the data
  unique() |> # get rid of duplicates (not needed)
  strsplit(",") |> # split on ",", we might want to add fixed=TRUE
  # returns a list of character vectors, so we iterate over it with lapply
  lapply(\(x) x[c(1, length(x))]) |> # get first and last element
  # "1,2,3" ---> "1" "2" "3" has length 3 while "3" "4" has length 2
  do.call(what="rbind") |> # list to 2-column matrix 
  type.convert(as.is=TRUE) # we coerce from character to numeric 

gives

> x
      [,1] [,2]
 [1,]    1    3
 [2,]    1    4
 [3,]    3    4
 [4,]    5    6
 [5,]    5    7
 [6,]    6    7
 [7,]    8    9
 [8,]    8   10
 [9,]    9   10
[10,]   11   13
[11,]   11   14
[12,]   13   14
[13,]   15   16
[14,]   15   17
[15,]   16   17
[16,]   18   19
[17,]   18   20
[18,]   19   20
> 
> # of 
> class(x)
[1] "matrix" "array" 

This obviously assumes that the lowest integer is in first position and the highest in last--a reasonable assumption? Otherwise we should coerce to numeric first and apply range on each list element.

(2) To create interval vectors, we use iv(). From it's documentation (cp. help(iv)):

iv() creates an interval vector from start and end vectors. This is how you will typically create interval vectors, and is often used with columns in a data frame.

i.e.

> library(ivs)
> y = iv(x[, 1], x[, 2])
> y
<iv<integer>[18]>
 [1] [1, 3)   [1, 4)   [3, 4)   [5, 6)   [5, 7)   [6, 7)   [8, 9)   [8, 10) 
 [9] [9, 10)  [11, 13) [11, 14) [13, 14) [15, 16) [15, 17) [16, 17) [18, 19)
[17] [18, 20) [19, 20)

Finally, we use iv_groups. From help(iv_groups):

This family of functions revolves around grouping overlapping intervals within a single iv. When multiple overlapping intervals are grouped together they result in a wider interval containing the smallest iv_start() and the largest iv_end() of the overlaps.

> z = iv_groups(y)
> z
<iv<integer>[6]>
[1] [1, 4)   [5, 7)   [8, 10)  [11, 14) [15, 17) [18, 20)

(3) It seems like your desired output is a character vector, where all integer sequences of the remaining intervals are pasted together.

To achieve this, we use ivs::iv_start() and ivs::iv_end() to access the boundaries of each interval. Now we would like to generate regular sequences. Unfortunately, : is not vectorised, hence we introduce:

seq2str = Vectorize(\(from, to) toString(from:to))

toSpring() is a wrapper for format(), it's help page states

The default method first converts x to character and then concatenates the elements separated by ", ".

Applying our custom function to each start and end gives

> seq2str(from = iv_start(z), to = iv_end(z) - 1)
[1] "1, 2, 3"    "5, 6"       "8, 9"       "11, 12, 13" "15, 16"     "18, 19"      

Note

You can also use this shorter version.

# input data 
overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4", "5,6", "5,6,7", "6,7", "8,9", 
            "8,9,10", "9,10", "11,12,13", "11,12,13", "11,12,13,14", "13,14",
            "15,16", "15,16,17", "16,17", "18,19", "18,19,20", "19,20")
# piped version
library(ivs)
overlap |> 
  strsplit(",") |> 
  lapply(\(x) x[c(1, length(x))]) |> 
  do.call(what="rbind") |>
  type.convert(as.is=TRUE) |>
  list(. = _) |>
  with(iv(.[, 1], .[, 2])) |>
  iv_groups() |>
  list(. = _) |>
  with(Vectorize(\(from, to) toString(from:to))(iv_start(.), to = iv_end(.) - 1))

where we use a trick to be able to make use of the forward pipe operator from base.

Here is an igraph option

df %>%
    mutate(id = str_c("r", row_number()), .before = 1) %>%
    separate_longer_delim(overlap, ",") %>%
    graph_from_data_frame() %>%
    components() %>%
    membership() %>%
    enframe() %>%
    filter(startsWith(name, "r")) %>%
    mutate(name = as.integer(sub("\\D", "", name))) %>%
    arrange(name) %>%
    slice_head(n = 1, by = value) %>%
    pull(name) %>%
    slice(.data = df)

If we run a dummy dataset like below (slightly different the example from OP's data, see the first overlap value)

df <- data.frame(overlap = c(
    "1", "1,2,3", "1,2,3,4", "3,4",
    "5,6", "5,6,7", "6,7",
    "8,9", "8,9,10", "9,10",
    "11,12,13", "11,12,13",
    "11,12,13,14", "13,14",
    "15,16", "15,16,17", "16,17",
    "18,19", "18,19,20", "19,20"
))

we will obtain

   overlap
1        1
2      5,6
3      8,9
4 11,12,13
5    15,16
6    18,19

and the original data from OP's question (starting from "1,2,3" as the first value) gives

   overlap
1    1,2,3
2      5,6
3      8,9
4 11,12,13
5    15,16
6    18,19

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745605445a4635647.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信