Skip to main content eteppo

Recoding IDs Using A Metadata File in R

Published: 2023-08-04
Updated: 2023-08-04

Often you have some raw data that has its own ID scheme. It might not be enough so you want to map the old IDs to some new IDs so that everything is completely uniqueacross the whole dataset. You might have an urge to change the raw data but I think you should always keep raw data read-only. It’s too risky. You just build the necessary steps to get to the clean data you personally want or need, and apply them every time you do something with the raw data. You only need to write those functions once.

The old and new IDs should exist in a metadata file anyway so it would be easy to recode the old ID values into the new ID scheme directly based on the file.

This can get quite messy but the following function did the work in my own situation. It warns or stops about some basic issues.

reidentify_with <- function(data, metadata_file, from, to, ...) {
  
  all_identified <- function(data, id_variables) {
    data %>%
      select(all_of(id_variables)) %>%
      all_distinct()
  }

  assert_that("data.frame" %in% class(data))
  assert_that(is.character(metadata_file))
  assert_that(length(metadata_file) == 1)

  data <- data %>%
    rename(from = {{from}})
  
  id_mapping <- metadata_file %>%
    read_csv(col_types = cols(.default = "c")) %>%
    filter(...) %>%
    select(from = {{ from }}, all_of(to)) %>%
    filter(!is.na(from)) %>%
    distinct()

  if (sum(is.na(id_mapping)) > 0) {
    stop("Metadata must not contain missing IDs.")
  }

  id_mapping_froms <- id_mapping %>%
    pull(from)

  if (nrow(id_mapping) > length(unique(id_mapping_froms))) {
    
    duplicated_ids <- id_mapping_froms %>%
      magrittr::extract(base::duplicated(id_mapping_froms)) %>%
      unique()

    n_duplicates <- id_mapping %>%
      filter(from %in% duplicated_ids) %>%
      group_by(from) %>%
      summarize(n_duplicates = n()) %>%
      pull(n_duplicates)

    warning_message <- str_c(
      "Non-unique metadata 'from' IDs {", 
      str_c(duplicated_ids, collapse = ", "),
      "} were mapped to multiple {",
      str_c(n_duplicates, collapse = ", "),
      "} rows of 'to' IDs."
    )

    warning(warning_message)
  
  }
  
  data_froms <- data %>%
    pull(from)

  if (!all(data_froms %in% id_mapping_froms)) {

    missing_ids <- data_froms %>%
      magrittr::extract(!(data_froms %in% id_mapping_froms))
    
    stop_message <- str_c(
      "Missing metadata IDs for {",
      str_c(missing_ids, collapse = ", "),
      "}. All ID values in the data must exist in the metadata."
    )

    stop(stop_message)

  }

  data <- id_mapping %>%
    right_join(data, by = "from") %>%
    select(-from) %>%
    distinct()

  if (!all_identified(data, to)) {

    # Attempt merging observations.
    duplicated_ids <- data %>%
      select(all_of(to)) %>%
      filter(base::duplicated(.))
    
    deduplicated_data <- duplicated_ids %>%
      left_join(data, by = to) %>%
      group_by(all_of(to)) %>%
      summarize(across(everything(), derep) %>%
      ungroup())
    
    data <- data %>%
      anti_join(duplicated_ids, by = to) %>%
      bind_rows(deduplicated_data)

    if (!all_identified(data, to)) {
      stop("All observations must be uniquely identified by variables given in 'to'.")
    }

  }

  return(data)
  
}