I am working with large administrative data and have found that there are correctly-ordered names and names that are ordered in the reverse (definitely expected, but I haven't had to parse such errors before). I need to figure out which names have been reversed and correct the order of those names based on other observations of the correctly-ordered names.
Example:
tibble(names = c("MIMO DAGRYL", "LUMPERSTEIN ANASTASIA", "KEY MORPHIN", "DAGRYL MIMO", "ANASTASIA LUMPERSTEIN", "KEY MORPHIN"))
I have a dataset of first names and last names which may or may not cover the observed names.
Example:
tibble(forename = c("MIMO", "KEY", "DAVID"),
surname = c("DAGRYL", "LUMPERSTEIN", "MORPHIN"))
I have tried probabilistic matching to no avail because the data produce too many results for the packages to handle. I would like to steer clear of fuzzy matching.
I am working with large administrative data and have found that there are correctly-ordered names and names that are ordered in the reverse (definitely expected, but I haven't had to parse such errors before). I need to figure out which names have been reversed and correct the order of those names based on other observations of the correctly-ordered names.
Example:
tibble(names = c("MIMO DAGRYL", "LUMPERSTEIN ANASTASIA", "KEY MORPHIN", "DAGRYL MIMO", "ANASTASIA LUMPERSTEIN", "KEY MORPHIN"))
I have a dataset of first names and last names which may or may not cover the observed names.
Example:
tibble(forename = c("MIMO", "KEY", "DAVID"),
surname = c("DAGRYL", "LUMPERSTEIN", "MORPHIN"))
I have tried probabilistic matching to no avail because the data produce too many results for the packages to handle. I would like to steer clear of fuzzy matching.
Share Improve this question edited Feb 3 at 9:53 ThomasIsCoding 104k9 gold badges37 silver badges103 bronze badges asked Feb 3 at 3:28 ShaqShaq 534 bronze badges3 Answers
Reset to default 3Here's a solution for this toy data example using sapply
and strsplit
.
names <- tibble(forename = c("MIMO", "KEY", "DAVID"), surname = c("DAGRYL", "LUMPERSTEIN", "MORPHIN"))
names.list <- lapply(names, strsplit, split=" ")
df['new_names'] <- sapply(strsplit(df$names, " "), \(x) {
if(any((x[1]==names.list$forename & x[2]==names.list$surname))) {
paste(x[1], x[2]) # do nothing
} else
if(any((x[1]==names.list$surname & x[2]==names.list$forename)))
paste(x[2], x[1]) # reverse the names
else paste(x[1], x[2]) # do nothing
} )
df
# A tibble: 6 × 2
names new_names
<chr> <chr>
1 MIMO DAGRYL MIMO DAGRYL
2 LUMPERSTEIN ANASTASIA LUMPERSTEIN ANASTASIA
3 KEY MORPHIN KEY MORPHIN
4 DAGRYL MIMO MIMO DAGRYL
5 ANASTASIA LUMPERSTEIN ANASTASIA LUMPERSTEIN
6 KEY MORPHIN KEY MORPHIN
names
# A tibble: 3 × 2
forename surname
<chr> <chr>
1 MIMO DAGRYL
2 KEY LUMPERSTEIN
3 DAVID MORPHIN
Create reversed and correct versions of names in your reference dataset, then use the reversed names as a key to merge in corrections. Then coalesce to overwrite with corrections where they exist.
library(dplyr)
correct_nms <- correct_nms %>%
transmute(
reversed = paste(surname, forename),
correct = paste(forename, surname)
)
dat %>%
left_join(correct_nms, join_by(names == reversed)) %>%
mutate(
names = coalesce(correct, names),
correct = NULL
)
Result:
# A tibble: 6 × 1
names
<chr>
1 MIMO DAGRYL
2 ANASTASIA LUMPERSTEIN
3 KEY MORPHIN
4 MIMO DAGRYL
5 ANASTASIA LUMPERSTEIN
6 KEY MORPHIN
Note you’ll run into trouble if both variants are valid names — eg, if there are two different individuals named "FRANKLIN ROSE"
and "ROSE FRANKLIN"
. Unless you have some other way of disambiguating these, you should at least remove them from your lookup table:
correct_nms <- correct_nms %>%
filter(!(correct %in% reversed))
Or alternatively from your main dataset:
dat <- dat %>%
filter(!(names %in% correct_nms$correct & names %in% correct_nms$reversed))
Here are two base R options
Note: Option 2 might be more efficient than Option 1 since the replacement happens to the values only where the reversed name orders are detected.
- Option 1
idx <- rowSums(
sapply(
list(lst, rev(lst)),
function(x) {
match(nms$names, do.call(paste, x))
}
),
na.rm = TRUE
)
out <-transform(
nms,
ordered = replace(names, idx>0, do.call(paste, lst)[idx])
)
- Option 2
transform(
nms,
ordered = {
fnm <- sub("\\s.*$", "", names)
snm <- sub(".*\\s", "", names)
idx <- with(lst, which(match(fnm, surname) == match(snm, forename)))
replace(names, idx, paste0(snm[idx], " ", fnm[idx]))
}
)
and you will obtain
> out
names ordered
1 MIMO DAGRYL MIMO DAGRYL
2 LUMPERSTEIN ANASTASIA LUMPERSTEIN ANASTASIA
3 KEY MORPHIN KEY MORPHIN
4 DAGRYL MIMO MIMO DAGRYL
5 ANASTASIA LUMPERSTEIN ANASTASIA LUMPERSTEIN
6 KEY MORPHIN KEY MORPHIN
data
nms <- tibble(names = c("MIMO DAGRYL", "LUMPERSTEIN ANASTASIA", "KEY MORPHIN", "DAGRYL MIMO", "ANASTASIA LUMPERSTEIN", "KEY MORPHIN"))
lst <- tibble(
forename = c("MIMO", "KEY", "DAVID"),
surname = c("DAGRYL", "LUMPERSTEIN", "MORPHIN")
)
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745250535a4618652.html
评论列表(0条)