r - Filter one dataframe column based on partial match of a string from another dataframe - Stack Overflow

I have a main dataframe (df1) which I want to filter for common values with two others (df2, df3).df1

I have a main dataframe (df1) which I want to filter for common values with two others (df2, df3).

df1 <- data.frame (gene = c('TMEM201;PIK3CD','BRCA1','MECP2','TMEM201', 'HDAC4','TMEM201'))
df2 <- data.frame (gene = c('PIK3CD','GRIN2B','BRCA2'))
df3 <- data.frame (gene = c('TMEM201','GRIN2B','BRCA2'))
df1_common_df2 <- subset (df1, df1$gene %in% df2$gene)
df1_common_df3 <- subset (df1, df1$gene %in% df3$gene)

Filtering df1 against df3, I get df1_common_df3 which has 2 observations ('TMEM201;PIK3CD' and 'TMEM201').

Filtering df1 against df2, I get df1_common_df2 which has 0 observations eventhough 'PIK3CD' exists in df1. Can you propose any solution for this?

I have a main dataframe (df1) which I want to filter for common values with two others (df2, df3).

df1 <- data.frame (gene = c('TMEM201;PIK3CD','BRCA1','MECP2','TMEM201', 'HDAC4','TMEM201'))
df2 <- data.frame (gene = c('PIK3CD','GRIN2B','BRCA2'))
df3 <- data.frame (gene = c('TMEM201','GRIN2B','BRCA2'))
df1_common_df2 <- subset (df1, df1$gene %in% df2$gene)
df1_common_df3 <- subset (df1, df1$gene %in% df3$gene)

Filtering df1 against df3, I get df1_common_df3 which has 2 observations ('TMEM201;PIK3CD' and 'TMEM201').

Filtering df1 against df2, I get df1_common_df2 which has 0 observations eventhough 'PIK3CD' exists in df1. Can you propose any solution for this?

Share asked Mar 7 at 11:02 John FistikisJohn Fistikis 531 silver badge4 bronze badges 2
  • 2 Your problem is the ; in 'TMEM201;PIK3CD' everything else is correctly delimited by ,. Right now the code asks if PIK3CD %in% 'TMEM201;PIK3CD' which it correctly returns as FALSE. If you want to filter for substrings you can try something like sapply(df1$gene, \(x) stringr::str_detect(string = x, pattern = df2$gene)). – D.J Commented Mar 7 at 11:13
  • This is a perfect usecase for a function that I once wrote: stackoverflow/a/79377658/28479453 – Tim G Commented Mar 7 at 12:15
Add a comment  | 

2 Answers 2

Reset to default 1

You could try

  • subset + grepl in base R
> subset(df1, grepl(paste0(unlist(df2), collapse = "|"), gene))
            gene
1 TMEM201;PIK3CD
  • separate_longer_delim + inner_join in dplyr
library(dplyr)

df1 %>%
    mutate(common = gene) %>%
    separate_longer_delim(
        col = gene,
        delim = ";"
    ) %>%
    inner_join(df2) %>%
    select(common)
          common

which gives

          common
1 TMEM201;PIK3CD

The issue is that your first row in df1 contains a semicolon-separated value TMEM201;PIK3CD, when filtering df1 against df2, it can't find a match for TMEM201;PIK3CD; when comparing with df3, it can find TMEM201 in both dataframes.

using grepl to create patterns for df2 and df3 to see if it matches df1

# For df2 matches
pattern_df2 <- paste0(df2$gene, collapse="|")
df1_common_df2 <- df1[grepl(pattern_df2, df1$gene), ]

# For df3 matches
pattern_df3 <- paste0(df3$gene, collapse="|")
df1_common_df3 <- df1[grepl(pattern_df3, df1$gene), ]

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744934876a4601959.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信