tidyhte
provides tidy semantics for estimation of heterogeneous treatment effects through the use of Kennedy’s (2023) doubly-robust learner.
The package includes comprehensive automated tests with over 80% code coverage and continuous integration across multiple platforms, ensuring reliability for production research use.
While heterogeneous treatment effect estimation has become increasingly important in applied research, existing tools often require substantial statistical expertise to implement correctly. Researchers must navigate complex decisions about cross-validation, model selection, and valid inference. tidyhte
addresses these challenges by:
This makes modern HTE methods accessible to applied researchers who need to understand treatment effect variation but may not be causal inference experts.
tidyhte
is designed to support research across multiple domains where understanding treatment effect variation is crucial:
The best place to start for learning how to use tidyhte
are the vignettes which run through example analyses from start to finish: vignette("experimental_analysis")
and vignette("observational_analysis")
. There is also a writeup summarizing the method and implementation in vignette("methodological_details")
, which includes a quasi-real world example using the Palmer Penguins dataset.
For a quick start with default settings, simply use basic_config()
:
library(tidyhte)
# Use all defaults - linear models for nuisance functions
results <- data %>%
attach_config(basic_config()) %>%
make_splits(user_id) %>%
produce_plugin_estimates(outcome, treatment, x1, x2, x3) %>%
construct_pseudo_outcomes(outcome, treatment) %>%
estimate_QoI(x1, x2)
When choosing machine learning algorithms for the ensemble, consider a progression like the following based on your subject-matter expertise:
"SL.glm"
(linear models) for initial exploration"SL.glm.interaction"
when you expect treatment effects to vary with covariates"SL.glmnet"
for higher-dimensional data (e.g. many interactions, note SL.glmnet.interaction
) or when overfitting is a concern"SL.ranger"
(random forests) or "SL.xgboost"
(gradient boosting) when relationships are complexThe SuperLearner ensemble will automatically weight these models. Start with 2-3 algorithms and add complexity as needed. See SuperLearner::listWrappers()
for all available options.
Install the released version of tidyhte from CRAN:
install.packages("tidyhte")
Or install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("ddimmery/tidyhte")
To set up a simple configuration, it’s straightforward to use the Recipe API. For complete examples with data, see vignette("experimental_analysis")
and vignette("observational_analysis")
.
library(tidyhte)
library(dplyr)
basic_config() %>%
add_propensity_score_model("SL.glmnet") %>%
add_outcome_model("SL.glmnet") %>%
add_moderator("Stratified", x1, x2) %>%
add_moderator("KernelSmooth", x3) %>%
add_vimp(sample_splitting = FALSE) -> hte_cfg
The basic_config
includes a number of defaults: it starts off the SuperLearner ensembles for both treatment and outcome with linear models ("SL.glm"
)
data %>%
attach_config(hte_cfg) %>%
make_splits(userid, .num_splits = 12) %>%
produce_plugin_estimates(
outcome_variable,
treatment_variable,
covariate1, covariate2, covariate3, covariate4, covariate5, covariate6
) %>%
construct_pseudo_outcomes(outcome_variable, treatment_variable) -> data
data %>%
estimate_QoI(covariate1, covariate2) -> results
To get information on estimate CATEs for a moderator not included previously would just require rerunning the final line:
data %>%
estimate_QoI(covariate3) -> results
Replicating this on a new outcome would be as simple as running the following, with no reconfiguration necessary.
data %>%
attach_config(hte_cfg) %>%
produce_plugin_estimates(
second_outcome_variable,
treatment_variable,
covariate1, covariate2, covariate3, covariate4, covariate5, covariate6
) %>%
construct_pseudo_outcomes(second_outcome_variable, treatment_variable) %>%
estimate_QoI(covariate1, covariate2) -> results
This leads to the ability to easily chain together analyses across many outcomes in an easy way:
library("foreach")
data %>%
attach_config(hte_cfg) %>%
make_splits(userid, .num_splits = 12) -> data
foreach(outcome = list_of_outcome_strs, .combine = "bind_rows") %do% {
data %>%
produce_plugin_estimates(
.data[[outcome_str]],
treatment_variable,
covariate1, covariate2, covariate3, covariate4, covariate5, covariate6
) %>%
construct_pseudo_outcomes(outcome, treatment_variable) %>%
estimate_QoI(covariate1, covariate2) %>%
mutate(outcome = outcome_str)
}
The function estimate_QoI
returns results in a tibble format which makes it easy to manipulate or plot results.
HTE estimation with cross-validation and ensemble machine learning can be computationally intensive. Plan accordingly for larger datasets or analyses with many outcomes.
Parallelism is not currently managed through the package directly, but can be easily supported using a parallel backend with foreach
:
library(doParallel)
registerDoParallel(cores = 4)
foreach(outcome = outcome_list, .combine = "bind_rows") %dopar% {
data %>%
produce_plugin_estimates(outcome, treatment, covariates) %>%
construct_pseudo_outcomes(outcome, treatment) %>%
estimate_QoI(moderators)
}
There are two main ways to get help:
If you have a problem, feel free to open an issue on GitHub. Please try to provide a minimal reproducible example. If that isn’t possible, explain as clearly and simply why that is, along with all of the relevant debugging steps you’ve already taken.
Support for the package will also be provided in the Experimentation Community Discord:
You are welcome to come in and get support for your usage in the tidyhte
channel. Keep in mind that everyone is volunteering their time to help, so try to come prepared with the debugging steps you’ve already taken.
Please note that the tidyhte project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.