Define splits for cross-fitting

This takes a dataset, a column with a unique identifier and an arbitrary number of covariates on which to stratify the splits. It returns the original dataset with an additional column .split_id corresponding to an identifier for the split.

make_splits(data, identifier, ..., .num_splits)

Arguments

data: dataframe
identifier: Unquoted name of unique identifier column
...: variables on which to stratify (requires that quickblock be installed.)
.num_splits: number of splits to create. If VIMP is requested in QoI_cfg, this must be an even number.

Value

original dataframe with additional .split_id column

Details

To see an example analysis, read vignette("experimental_analysis") in the context of an experiment, vignette("experimental_analysis") for an observational study, or vignette("methodological_details") for a deeper dive under the hood.

Examples

library("dplyr")
if(require("palmerpenguins")) {
data(package = 'palmerpenguins')
penguins$unitid = seq_len(nrow(penguins))
penguins$propensity = rep(0.5, nrow(penguins))
penguins$treatment = rbinom(nrow(penguins), 1, penguins$propensity)
cfg <- basic_config() %>%
add_known_propensity_score("propensity") %>%
add_outcome_model("SL.glm.interaction") %>%
remove_vimp()
attach_config(penguins, cfg) %>%
make_splits(unitid, .num_splits = 4) %>%
produce_plugin_estimates(outcome = body_mass_g, treatment = treatment, species, sex) %>%
construct_pseudo_outcomes(body_mass_g, treatment) %>%
estimate_QoI(species, sex)
}
#> Dropped 11 of 344 rows (3.2%) through listwise deletion.
#> 

#> estimating nuisance models [-----------------------------------] splits: 0 / 4
#> 

#> estimating nuisance models [========>--------------------------] splits: 1 / 4
#> 

#> estimating nuisance models [=================>-----------------] splits: 2 / 4
#> 

#> estimating nuisance models [=========================>---------] splits: 3 / 4
#> 

#> estimating nuisance models [===================================] splits: 4 / 4
#> 
                                                                              
#> 

#> Dropped 11 of 344 rows (3.2%) through listwise deletion.
#> Skipping diagnostic on .pseudo_outcome due to lack of model.
#> # A tibble: 11 × 5
#>    estimand       term                   level                estimate std_error
#>    <chr>          <chr>                  <chr>                   <dbl>     <dbl>
#>  1 MSE            body_mass_g            Control Response    90402.      9.07e+3
#>  2 MSE            body_mass_g            Treatment Response 108035.      1.07e+4
#>  3 SL risk        SL.glm.interaction_All Control Response    91458.      5.38e+3
#>  4 SL risk        SL.glm_All             Control Response    97130.      4.86e+3
#>  5 SL risk        SL.glm.interaction_All Treatment Response 112972.      5.74e+3
#>  6 SL risk        SL.glm_All             Treatment Response 106986.      3.32e+3
#>  7 SL coefficient SL.glm.interaction_All Control Response        0.768   7.84e-2
#>  8 SL coefficient SL.glm_All             Control Response        0.232   7.84e-2
#>  9 SL coefficient SL.glm.interaction_All Treatment Response      0.160   1.60e-1
#> 10 SL coefficient SL.glm_All             Treatment Response      0.840   1.60e-1
#> 11 SATE           NA                     NA                     27.2     3.40e+1

Arguments

Value

Details

See also

Examples