tidyhte provides tidy semantics for estimation of heterogeneous treatment effects through the use of Kennedy’s (2023) doubly-robust learner.

The package includes comprehensive automated tests with over 80% code coverage and continuous integration across multiple platforms, ensuring reliability for production research use.

Why tidyhte?

While heterogeneous treatment effect estimation has become increasingly important in applied research, existing tools often require substantial statistical expertise to implement correctly. Researchers must navigate complex decisions about cross-validation, model selection, and valid inference. tidyhte addresses these challenges by:

  1. Implementing state-of-the-art doubly-robust methods with automatic cross-validation
  2. Using intuitive “recipe” semantics familiar to R users
  3. Scaling easily from single to multiple outcomes and moderators
  4. Providing built-in diagnostics for model quality
  5. Returning results in tidy formats for easy visualization

This makes modern HTE methods accessible to applied researchers who need to understand treatment effect variation but may not be causal inference experts.

Research Applications

tidyhte is designed to support research across multiple domains where understanding treatment effect variation is crucial:

  • Clinical Trials: Identify patient subgroups that benefit most from medical treatments
  • Policy Evaluation: Understand which populations are most affected by policy interventions
  • Technology & A/B Testing: Optimize product features for different user segments
  • Economics: Study heterogeneous effects of economic policies across demographics
  • Education: Evaluate differential impacts of educational interventions

Getting Started

The best place to start for learning how to use tidyhte are the vignettes which run through example analyses from start to finish: vignette("experimental_analysis") and vignette("observational_analysis"). There is also a writeup summarizing the method and implementation in vignette("methodological_details"), which includes a quasi-real world example using the Palmer Penguins dataset.

Minimal Example

For a quick start with default settings, simply use basic_config():

library(tidyhte)

# Use all defaults - linear models for nuisance functions
results <- data %>%
  attach_config(basic_config()) %>%
  make_splits(user_id) %>%
  produce_plugin_estimates(outcome, treatment, x1, x2, x3) %>%
  construct_pseudo_outcomes(outcome, treatment) %>%
  estimate_QoI(x1, x2)

Model Selection Guide

When choosing machine learning algorithms for the ensemble, consider a progression like the following based on your subject-matter expertise:

  • Start simple: "SL.glm" (linear models) for initial exploration
  • Add interactions: "SL.glm.interaction" when you expect treatment effects to vary with covariates
  • Regularization: "SL.glmnet" for higher-dimensional data (e.g. many interactions, note SL.glmnet.interaction) or when overfitting is a concern
  • Flexible models: "SL.ranger" (random forests) or "SL.xgboost" (gradient boosting) when relationships are complex

The SuperLearner ensemble will automatically weight these models. Start with 2-3 algorithms and add complexity as needed. See SuperLearner::listWrappers() for all available options.

Installation

Install the released version of tidyhte from CRAN:

install.packages("tidyhte")

Or install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("ddimmery/tidyhte")

Setting up a configuration

To set up a simple configuration, it’s straightforward to use the Recipe API. For complete examples with data, see vignette("experimental_analysis") and vignette("observational_analysis").

library(tidyhte)
library(dplyr)

basic_config() %>%
    add_propensity_score_model("SL.glmnet") %>%
    add_outcome_model("SL.glmnet") %>%
    add_moderator("Stratified", x1, x2) %>%
    add_moderator("KernelSmooth", x3) %>%
    add_vimp(sample_splitting = FALSE) -> hte_cfg

The basic_config includes a number of defaults: it starts off the SuperLearner ensembles for both treatment and outcome with linear models ("SL.glm")

Running an Analysis

data %>%
    attach_config(hte_cfg) %>%
    make_splits(userid, .num_splits = 12) %>%
    produce_plugin_estimates(
        outcome_variable,
        treatment_variable,
        covariate1, covariate2, covariate3, covariate4, covariate5, covariate6
    ) %>%
    construct_pseudo_outcomes(outcome_variable, treatment_variable) -> data

data %>%
    estimate_QoI(covariate1, covariate2) -> results

To get information on estimate CATEs for a moderator not included previously would just require rerunning the final line:

data %>%
    estimate_QoI(covariate3) -> results

Replicating this on a new outcome would be as simple as running the following, with no reconfiguration necessary.

data %>%
    attach_config(hte_cfg) %>%
    produce_plugin_estimates(
        second_outcome_variable,
        treatment_variable,
        covariate1, covariate2, covariate3, covariate4, covariate5, covariate6
    ) %>%
    construct_pseudo_outcomes(second_outcome_variable, treatment_variable) %>%
    estimate_QoI(covariate1, covariate2) -> results

This leads to the ability to easily chain together analyses across many outcomes in an easy way:

library("foreach")

data %>%
    attach_config(hte_cfg) %>%
    make_splits(userid, .num_splits = 12) -> data

foreach(outcome = list_of_outcome_strs, .combine = "bind_rows") %do% {
    data %>%
    produce_plugin_estimates(
        .data[[outcome_str]],
        treatment_variable,
        covariate1, covariate2, covariate3, covariate4, covariate5, covariate6
    ) %>%
    construct_pseudo_outcomes(outcome, treatment_variable) %>%
    estimate_QoI(covariate1, covariate2) %>%
    mutate(outcome = outcome_str)
}

The function estimate_QoI returns results in a tibble format which makes it easy to manipulate or plot results.

Performance Considerations

HTE estimation with cross-validation and ensemble machine learning can be computationally intensive. Plan accordingly for larger datasets or analyses with many outcomes.

Parallelism is not currently managed through the package directly, but can be easily supported using a parallel backend with foreach:

library(doParallel)
registerDoParallel(cores = 4)

foreach(outcome = outcome_list, .combine = "bind_rows") %dopar% {
  data %>%
    produce_plugin_estimates(outcome, treatment, covariates) %>%
    construct_pseudo_outcomes(outcome, treatment) %>%
    estimate_QoI(moderators)
}

Getting help

There are two main ways to get help:

GitHub Issues

If you have a problem, feel free to open an issue on GitHub. Please try to provide a minimal reproducible example. If that isn’t possible, explain as clearly and simply why that is, along with all of the relevant debugging steps you’ve already taken.

Discord

Support for the package will also be provided in the Experimentation Community Discord:

You are welcome to come in and get support for your usage in the tidyhte channel. Keep in mind that everyone is volunteering their time to help, so try to come prepared with the debugging steps you’ve already taken.

Code of Conduct

Please note that the tidyhte project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.