I am working with large administrative data and have found that there are correctly-ordered names and names that are ordered in the reverse (definitely expected, but I haven't had to parse such errors before). I need to figure out which names have been reversed and correct the order of those names based on other observations of the correctly-ordered names.

Example:

tibble(names = c("MIMO DAGRYL", "LUMPERSTEIN ANASTASIA", "KEY MORPHIN", "DAGRYL MIMO", "ANASTASIA LUMPERSTEIN", "KEY MORPHIN"))

I have a dataset of first names and last names which may or may not cover the observed names.

Example:

tibble(forename = c("MIMO", "KEY", "DAVID"),
       surname = c("DAGRYL", "LUMPERSTEIN", "MORPHIN"))

I have tried probabilistic matching to no avail because the data produce too many results for the packages to handle. I would like to steer clear of fuzzy matching.

Example:

tibble(names = c("MIMO DAGRYL", "LUMPERSTEIN ANASTASIA", "KEY MORPHIN", "DAGRYL MIMO", "ANASTASIA LUMPERSTEIN", "KEY MORPHIN"))

I have a dataset of first names and last names which may or may not cover the observed names.

Example:

tibble(forename = c("MIMO", "KEY", "DAVID"),
       surname = c("DAGRYL", "LUMPERSTEIN", "MORPHIN"))

I have tried probabilistic matching to no avail because the data produce too many results for the packages to handle. I would like to steer clear of fuzzy matching.

Share Improve this question edited Feb 3 at 9:53 ThomasIsCoding 104k9 gold badges37 silver badges103 bronze badges asked Feb 3 at 3:28 Shaq 534 bronze badges

Add a comment |

3 Answers 3

Sorted by: Reset to default 3

Here's a solution for this toy data example using sapply and strsplit.

names <- tibble(forename = c("MIMO", "KEY", "DAVID"), surname = c("DAGRYL", "LUMPERSTEIN", "MORPHIN"))

names.list <- lapply(names, strsplit, split=" ")

df['new_names'] <- sapply(strsplit(df$names, " "), \(x) {
                     if(any((x[1]==names.list$forename & x[2]==names.list$surname))) {
                       paste(x[1], x[2]) # do nothing
                       } else 
                         if(any((x[1]==names.list$surname & x[2]==names.list$forename))) 
                           paste(x[2], x[1]) # reverse the names
                         else paste(x[1], x[2]) # do nothing
                     } )

df
# A tibble: 6 × 2
  names                 new_names            
  <chr>                 <chr>                
1 MIMO DAGRYL           MIMO DAGRYL          
2 LUMPERSTEIN ANASTASIA LUMPERSTEIN ANASTASIA
3 KEY MORPHIN           KEY MORPHIN          
4 DAGRYL MIMO           MIMO DAGRYL          
5 ANASTASIA LUMPERSTEIN ANASTASIA LUMPERSTEIN
6 KEY MORPHIN           KEY MORPHIN

names
# A tibble: 3 × 2
  forename surname    
  <chr>    <chr>      
1 MIMO     DAGRYL     
2 KEY      LUMPERSTEIN
3 DAVID    MORPHIN

Create reversed and correct versions of names in your reference dataset, then use the reversed names as a key to merge in corrections. Then coalesce to overwrite with corrections where they exist.

library(dplyr)

correct_nms <- correct_nms %>%
  transmute(
    reversed = paste(surname, forename),
    correct = paste(forename, surname)
  )

dat %>%
  left_join(correct_nms, join_by(names == reversed)) %>%
  mutate(
    names = coalesce(correct, names),
    correct = NULL
  )

Result:

# A tibble: 6 × 1
  names                
  <chr>                
1 MIMO DAGRYL          
2 ANASTASIA LUMPERSTEIN
3 KEY MORPHIN          
4 MIMO DAGRYL          
5 ANASTASIA LUMPERSTEIN
6 KEY MORPHIN

Note you’ll run into trouble if both variants are valid names — eg, if there are two different individuals named "FRANKLIN ROSE" and "ROSE FRANKLIN". Unless you have some other way of disambiguating these, you should at least remove them from your lookup table:

correct_nms <- correct_nms %>% 
  filter(!(correct %in% reversed))

Or alternatively from your main dataset:

dat <- dat %>%
  filter(!(names %in% correct_nms$correct & names %in% correct_nms$reversed))

Here are two base R options

Note: Option 2 might be more efficient than Option 1 since the replacement happens to the values only where the reversed name orders are detected.

Option 1

idx <- rowSums(
    sapply(
        list(lst, rev(lst)),
        function(x) {
            match(nms$names, do.call(paste, x))
        }
    ),
    na.rm = TRUE
)

out <-transform(
    nms,
    ordered = replace(names, idx>0, do.call(paste, lst)[idx])
)

Option 2

transform(
    nms,
    ordered = {
        fnm <- sub("\\s.*$", "", names)
        snm <- sub(".*\\s", "", names)
        idx <- with(lst, which(match(fnm, surname) == match(snm, forename)))
        replace(names, idx, paste0(snm[idx], " ", fnm[idx]))
    }
)

and you will obtain

> out
                  names               ordered
1           MIMO DAGRYL           MIMO DAGRYL
2 LUMPERSTEIN ANASTASIA LUMPERSTEIN ANASTASIA
3           KEY MORPHIN           KEY MORPHIN
4           DAGRYL MIMO           MIMO DAGRYL
5 ANASTASIA LUMPERSTEIN ANASTASIA LUMPERSTEIN
6           KEY MORPHIN           KEY MORPHIN

data

nms <- tibble(names = c("MIMO DAGRYL", "LUMPERSTEIN ANASTASIA", "KEY MORPHIN", "DAGRYL MIMO", "ANASTASIA LUMPERSTEIN", "KEY MORPHIN"))


lst <- tibble(
    forename = c("MIMO", "KEY", "DAVID"),
    surname = c("DAGRYL", "LUMPERSTEIN", "MORPHIN")
)

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1745250535a4618652.html

r - Method for finding reversed names in variable - Stack Overflow

3 Answers 3

data

发表回复

评论列表（0条）

联系我们

400-800-8888

r - Method for finding reversed names in variable - Stack Overflow

3 Answers 3

data

相关推荐

r - Method for finding reversed names in variable - Stack Overflow

发表回复

评论列表（0条）

联系我们

400-800-8888