README

Record linkage and deduplication of individual-level data, such as repeated spells in hospital, or recurrent cases of infection is a common task in epidemiological analysis and other fields of research.

The diyar package aims to provide a simple and flexible implementation of deterministic record linkage and episode grouping for the application of case definitions in epidemiological analysis.

Installation

# Install the latest CRAN release 
install.packages("diyar")

# Or, install the development version from GitHub
install.packages("devtools")
devtools::install_github("OlisaNsonwu/diyar")

Cheat sheet

Usage

There are two main aspects of the diyar package; multistage record grouping (record_group()) and episode grouping (fixed_episodes(), rolling_episodes() and episode_group()) for applying case definitions in epidemiological analysis. number_line objects are used for both.

library(diyar)

l <- as.Date("01/04/2019", "%d/%m/%Y"); r <- as.Date("30/04/2019", "%d/%m/%Y")
nl <- number_line(l, r)
nl
#> [1] "2019-04-01 -> 2019-04-30"
reverse_number_line(nl)
#> [1] "2019-04-30 <- 2019-04-01"
shift_number_line(nl, -2)
#> [1] "2019-03-30 -> 2019-04-28"
expand_number_line(nl, 2)
#> [1] "2019-03-30 -> 2019-05-02"
number_line_sequence(nl, by =3)
#> [[1]]
#>  [1] "2019-04-01" "2019-04-04" "2019-04-07" "2019-04-10" "2019-04-13"
#>  [6] "2019-04-16" "2019-04-19" "2019-04-22" "2019-04-25" "2019-04-28"

data(infections);
db <- infections[c("date")]
# Dates
db$date
#>  [1] "2018-04-01" "2018-04-07" "2018-04-13" "2018-04-19" "2018-04-25"
#>  [6] "2018-05-01" "2018-05-07" "2018-05-13" "2018-05-19" "2018-05-25"
#> [11] "2018-05-31"

# Fixed episodes
db$f_epid <- fixed_episodes(date = db$date, case_length = 15, display = F, group_stats = T)
#> Episode grouping complete: 0 record(s) with a unique ID.

# Rolling episodes
db$r_epid <- rolling_episodes(date = db$date, case_length = 15, recurrence_length = 40, display = F,
                              group_stats = T)
#> Episode grouping complete: 0 record(s) with a unique ID.

db[c("f_epid","r_epid")]
#>                               f_epid                           r_epid
#> 1  E.01 2018-04-01 -> 2018-04-13 (C) E.1 2018-04-01 -> 2018-05-31 (C)
#> 2  E.01 2018-04-01 -> 2018-04-13 (D) E.1 2018-04-01 -> 2018-05-31 (D)
#> 3  E.01 2018-04-01 -> 2018-04-13 (D) E.1 2018-04-01 -> 2018-05-31 (D)
#> 4  E.04 2018-04-19 -> 2018-05-01 (C) E.1 2018-04-01 -> 2018-05-31 (R)
#> 5  E.04 2018-04-19 -> 2018-05-01 (D) E.1 2018-04-01 -> 2018-05-31 (D)
#> 6  E.04 2018-04-19 -> 2018-05-01 (D) E.1 2018-04-01 -> 2018-05-31 (D)
#> 7  E.07 2018-05-07 -> 2018-05-19 (C) E.1 2018-04-01 -> 2018-05-31 (D)
#> 8  E.07 2018-05-07 -> 2018-05-19 (D) E.1 2018-04-01 -> 2018-05-31 (D)
#> 9  E.07 2018-05-07 -> 2018-05-19 (D) E.1 2018-04-01 -> 2018-05-31 (D)
#> 10 E.10 2018-05-25 -> 2018-05-31 (C) E.1 2018-04-01 -> 2018-05-31 (D)
#> 11 E.10 2018-05-25 -> 2018-05-31 (D) E.1 2018-04-01 -> 2018-05-31 (D)

# Two stages of record grouping
data(staff_records);

staff_records$pids_a <- record_group(staff_records, sn = r_id, criteria = c(forename, surname),
                     data_source = sex, display = FALSE)
#> Record grouping complete: 1 record(s) with a unique ID.
staff_records
#>   r_id forename  surname sex    dataset       pids_a
#> 1    1    James    Green   M Staff list P.1 (CRI 02)
#> 2    2     <NA> Anderson   M Staff list P.2 (CRI 02)
#> 3    3    Jamey    Green   M  Pay slips P.1 (CRI 02)
#> 4    4              <NA>   F  Pay slips P.4 (No Hit)
#> 5    5  Derrick Anderson   M Staff list P.2 (CRI 02)
#> 6    6  Darrack Anderson   M  Pay slips P.2 (CRI 02)
#> 7    7 Christie    Green   F Staff list P.1 (CRI 02)

diyar

Overview

Installation

Cheat sheet

Usage

Bugs and issues