I have a main dataframe (df1) which I want to filter for common values with two others (df2, df3).
df1 <- data.frame (gene = c('TMEM201;PIK3CD','BRCA1','MECP2','TMEM201', 'HDAC4','TMEM201'))
df2 <- data.frame (gene = c('PIK3CD','GRIN2B','BRCA2'))
df3 <- data.frame (gene = c('TMEM201','GRIN2B','BRCA2'))
df1_common_df2 <- subset (df1, df1$gene %in% df2$gene)
df1_common_df3 <- subset (df1, df1$gene %in% df3$gene)
Filtering df1 against df3, I get df1_common_df3 which has 2 observations ('TMEM201;PIK3CD' and 'TMEM201').
Filtering df1 against df2, I get df1_common_df2 which has 0 observations eventhough 'PIK3CD' exists in df1. Can you propose any solution for this?
I have a main dataframe (df1) which I want to filter for common values with two others (df2, df3).
df1 <- data.frame (gene = c('TMEM201;PIK3CD','BRCA1','MECP2','TMEM201', 'HDAC4','TMEM201'))
df2 <- data.frame (gene = c('PIK3CD','GRIN2B','BRCA2'))
df3 <- data.frame (gene = c('TMEM201','GRIN2B','BRCA2'))
df1_common_df2 <- subset (df1, df1$gene %in% df2$gene)
df1_common_df3 <- subset (df1, df1$gene %in% df3$gene)
Filtering df1 against df3, I get df1_common_df3 which has 2 observations ('TMEM201;PIK3CD' and 'TMEM201').
Filtering df1 against df2, I get df1_common_df2 which has 0 observations eventhough 'PIK3CD' exists in df1. Can you propose any solution for this?
Share asked Mar 7 at 11:02 John FistikisJohn Fistikis 531 silver badge4 bronze badges 2 |2 Answers
Reset to default 1You could try
subset
+grepl
in base R
> subset(df1, grepl(paste0(unlist(df2), collapse = "|"), gene))
gene
1 TMEM201;PIK3CD
separate_longer_delim
+inner_join
indplyr
library(dplyr)
df1 %>%
mutate(common = gene) %>%
separate_longer_delim(
col = gene,
delim = ";"
) %>%
inner_join(df2) %>%
select(common)
common
which gives
common
1 TMEM201;PIK3CD
The issue is that your first row in df1 contains a semicolon-separated value TMEM201;PIK3CD
, when filtering df1 against df2, it can't find a match for TMEM201;PIK3CD
; when comparing with df3, it can find TMEM201
in both dataframes.
using grepl
to create patterns for df2 and df3 to see if it matches df1
# For df2 matches
pattern_df2 <- paste0(df2$gene, collapse="|")
df1_common_df2 <- df1[grepl(pattern_df2, df1$gene), ]
# For df3 matches
pattern_df3 <- paste0(df3$gene, collapse="|")
df1_common_df3 <- df1[grepl(pattern_df3, df1$gene), ]
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744934876a4601959.html
;
in'TMEM201;PIK3CD'
everything else is correctly delimited by,
. Right now the code asks ifPIK3CD %in% 'TMEM201;PIK3CD'
which it correctly returns asFALSE
. If you want to filter for substrings you can try something likesapply(df1$gene, \(x) stringr::str_detect(string = x, pattern = df2$gene))
. – D.J Commented Mar 7 at 11:13