Risk Set Matching with rsmatch • rsmatch

Time-varying matching

Any time we are handling observational data (as opposed to that from a randomized study) and trying to measure the effect of some treatment, bias can potentially confound our estimates—undermining the analysis. To reduce this bias, matching attempts to find groups that are similar across measured variables, despite receiving alternate sides of treatment. Matching requires special care when treatment(s) (and related matching variables) happen at multiple multiple times, as in some longitudinal data.

The R package rsmatch implements various forms of matching on time-varying observational studies to help in causal inference.

library(rsmatch)

This short vignette aims to provide enough information on time-varying matching and the rsmatch package for someone to start data analysis. The package implements two different approaches, and we recommend a thorough read-through of the corresponding paper to understand the methodology, potential pitfalls, and other crucial considerations.

Balanced Risk Set Matching with brsmatch(), based on the work of Li, Propert, and Rosenbaum (2001)
Propensity Score Matching with Time-Dependent Covariates with coxph_match(), based on the work of Lu (2005)

OASIS data

We start by showing a data set that can illustrate the structure of a time-varying observational data set.

Consider the oasis data with 51 patients from the Open Access Series of Imaging Studies that are classified at each time point as having Alzheimer’s disease (AD) or not.

data(oasis)
head(oasis, n = 10)
#>    subject_id visit time_of_ad m_f educ ses age mr_delay e_tiv n_wbv   asf
#> 1   OAS2_0002     1         NA   M   12  -1  75        0  1678 0.736 1.046
#> 2   OAS2_0002     2         NA   M   12  -1  76      560  1738 0.713 1.010
#> 3   OAS2_0002     3         NA   M   12  -1  80     1895  1698 0.701 1.034
#> 4   OAS2_0007     1          3   M   16  -1  71        0  1357 0.748 1.293
#> 5   OAS2_0007     3          3   M   16  -1  73      518  1365 0.727 1.286
#> 6   OAS2_0007     4          3   M   16  -1  75     1281  1372 0.710 1.279
#> 7   OAS2_0009     1         NA   M   12   2  68        0  1457 0.806 1.205
#> 8   OAS2_0009     2         NA   M   12   2  69      576  1480 0.791 1.186
#> 9   OAS2_0010     1         NA   F   12   3  66        0  1447 0.769 1.213
#> 10  OAS2_0010     2         NA   F   12   3  68      854  1482 0.752 1.184

The data set has been modified to a “long” time-varying format with the following variables:

subject_id - A unique patient identifier
visit - An identifier of the time point, from 1 up to 4
time_of_ad - The visit in which a patient was identified to have AD, or NA if this patient has not yet been identified with AD
m_f, educ,ses, age - Baseline patient characteristics for gender, education level, socioeconomic status, age
mr_delay, e_tiv, n_wbv, asf - Time-varying patient characteristics for MR delay time, estimated total intracranial volume, and Atlas scaling factor.

Importantly, visits have a similar time difference between them. This sets up a panel data structure. The package does not help with data sets where subject visits happen irregularly or continuously.

This data set illustrates a few of the standard causal inference elements for time-varying matching, including

Treatment: The “treatment” used here is being diagnosed with AD, with the associated cognitive changes. NOTE: this is far from the standard concept of treatment in a randomized study, but we could be interested in the effect of a diagnosis of AD on other variables like mortality, health care cost, or time spent with loved ones.
Control groups across time: For visit \(t\), the control group includes all subjects without AD yet. For example, the control group at visit 1 includes all patients. The control group at visit 3 would include OAS2_0002, OAS2_0009, and OAS2_0010 above, but would NOT include OAS2_0007 (who has AD at that visit).
Treatment groups across time: For visit \(t\), the treatment group includes all subjects diagnosed with AD at that visit. Subject OAS2_0007 is in the treatment group at visit 3, but NOT in the treatment groups for visits 1, 2, or 4.
Matching (confounding) variables: Variables that relate to both the treatment (AD) and the chosen outcome. This includes both baseline characteristics and time-varying characters.

Unfortunately, the data is missing a key aspect for causal analysis, the outcome on which to measure our treatment effect. Open-source longitudinal data sets are somewhat scarce, and this was the most illustrative example we could find while including in the package. However, this data fully demonstrates the matching process. Applying a set of matches to determine the treatment effect is, thankfully, a typically straightforward process.

Our goal is to match subjects similar in their gender, education level, socioeconomic status, age, and other MRI measurements, but one of which moved from a cognitively normal state to AD in the next period and the other who remained cognitively normal. On these similar groups, we can hopefully isolate the effect of an AD diagnosis on an outcome of interest.

Balanced Risk Set Matching

Based on the work of Li et al. (2001), the brsmatch() function performs “Balanced Risk Set Matching”. Given our data structure, it finds matches based on minimizing the Mahalanobis distance between covariates. If balancing is desired, the model will try to minimize the imbalance in terms of specified balancing covariates in the final pair output. Each treated individual is matched to one other individual. Full details are provided in the available reference.

For example, we can find 5 matched pairs based on all covariates with balance across gender and age:

pairs <- brsmatch(
  n_pairs = 5,
  data = oasis,
  id = "subject_id", 
  time = "visit", 
  trt_time = "time_of_ad",
  covariates = c("m_f", "educ", "ses", "age", 
                 "mr_delay", "e_tiv", "n_wbv", "asf"),
  balance = TRUE, 
  balance_covariates = c("m_f", "age")
)

na.omit(pairs)
#>    subject_id pair_id type
#> 5   OAS2_0014       1  trt
#> 15  OAS2_0043       2  all
#> 22  OAS2_0079       2  trt
#> 32  OAS2_0112       3  all
#> 36  OAS2_0124       1  all
#> 40  OAS2_0140       5  all
#> 41  OAS2_0150       3  trt
#> 43  OAS2_0160       4  trt
#> 49  OAS2_0182       4  all
#> 50  OAS2_0184       5  trt

The output includes a long-format data frame with subject_id, pair_id, and type to indicate whether the subject was considered in the treatment (trt) group, or the control (all) group for the indicated time point. The type variable is helpful because subjects who got AD could match in either the treatment or control group depending on which visit they were matched in. brsmatch() considers all of the possibilities in finding an optimal match. The output will include all subjects, with an NA value for pair_id and type for unmatched subjects.

Taking, for example, the first pair, OAS2_0014 to OAS2_0124, we know that the matching happened at visit 2, and in that visit, the two subjects have very similar covariates.

first_pair <- pairs[which(pairs$pair_id == 1), "subject_id"]
oasis[which(oasis$subject_id %in% first_pair), ]
#>    subject_id visit time_of_ad m_f educ ses age mr_delay e_tiv n_wbv   asf
#> 11  OAS2_0014     1          2   M   16   3  76        0  1602 0.697 1.096
#> 12  OAS2_0014     2          2   M   16   3  77      504  1590 0.696 1.104
#> 80  OAS2_0124     1         NA   M   16   3  70        0  1463 0.749 1.200
#> 81  OAS2_0124     2         NA   M   16   3  71      472  1479 0.750 1.187

Full details of the brsmatch() function’s arguments can be found with ?brsmatch. One option not used in this example allows for exact matching constraints. These can be very useful to improve the optimization speed for large data sets.

Propensity Score Matching with Time-Dependent Covariates

Based on the work of Lu (2005) “Propensity Score Matching with Time-Dependent Covariates”, the coxpsmatch() function matches subjects based on time-dependent propensity scores from a Cox proportional hazards model. If balancing is desired, the model will try to minimize the imbalance in terms of specified balancing covariates in the final pair output. Each treated individual is matched to one other individual.

The coxpsmatch() interface is similar to brsmatch() in both inputs and outputs as the following example shows. However, coxpsmatch() can pair all individuals if desired and does not allow (or need) the option of separate balance variables.

pairs <- coxpsmatch(
  n_pairs = 5,
  data = oasis,
  id = "subject_id", 
  time = "visit", 
  trt_time = "time_of_ad",
  covariates = c("m_f", "educ", "ses", "age", 
                 "mr_delay", "e_tiv", "n_wbv", "asf")
)

na.omit(pairs)
#>    subject_id pair_id type
#> 3   OAS2_0009       1  all
#> 5   OAS2_0014       2  trt
#> 16  OAS2_0046       1  trt
#> 29  OAS2_0104       5  trt
#> 32  OAS2_0112       5  all
#> 34  OAS2_0114       4  trt
#> 36  OAS2_0124       2  all
#> 39  OAS2_0139       4  all
#> 41  OAS2_0150       3  trt
#> 49  OAS2_0182       3  all

In this example, OAS2_0014 still matches to OAS2_0124, but other matches differ.

second_pair <- pairs[which(pairs$pair_id == 2), "subject_id"]
oasis[which(oasis$subject_id %in% second_pair), ]
#> # A tibble: 4 × 11
#>   subject_id visit time_of_ad m_f    educ ses     age mr_delay e_tiv n_wbv   asf
#>   <chr>      <int>      <dbl> <chr> <int> <fct> <int>    <int> <int> <dbl> <dbl>
#> 1 OAS2_0014      1          2 M        16 3        76        0  1602 0.697  1.10
#> 2 OAS2_0014      2          2 M        16 3        77      504  1590 0.696  1.10
#> 3 OAS2_0124      1         NA M        16 3        70        0  1463 0.749  1.2 
#> 4 OAS2_0124      2         NA M        16 3        71      472  1479 0.75   1.19

Going further

This quick vignette provides a primer to get you started with time-varying matching methods. Some necessary concepts not covered include:

How to select which variables to include for matching, balancing, and exact matches
How to visualize and quantify whether the matches are good
How to use the matches to compare the outcome (e.g. binary, continuous, or survival endpoints)
Sensitivity analysis to measure the size of an unmeasured confounder that could change your treatment-effect conclusions
Whether treatment occurs at the specified visit, or in between two visits (see the time_lag option in both functions)

References

Li, Yunfei Paul, Kathleen J Propert, and Paul R Rosenbaum. 2001. “Balanced Risk Set Matching.” Journal of the American Statistical Association 96 (455): 870–82.

Lu, Bo. 2005. “Propensity Score Matching with Time-Dependent Covariates.” Biometrics 61 (3): 721–28.