--- title: "Defining an Analysis Pipeline Blueprint" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{create-your-blueprint} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Setup First things first. `multitool` leverages the `tidyverse` package so lets load both: ```{r setup, message=FALSE} library(tidyverse) library(multitool) ``` ## Setting up a data analysis pipeline Image we have some data with several predictor variables, moderators, covariates, and dependent measures. We want to know if our predictors (`ivs`) interact with our moderators (`mods`) to predict the outcome (`dvs`). But we have three versions of our predictor that (supposedly) measure the same thing, albeit in slightly different ways. In addition, because we collected messy data from the real world (not really but let's pretend), we have some idea of which observations to include and which we might exclude (e.g., `include1`, `include2`, `include3`). ```{r data} the_data <- data.frame( id = 1:500, iv1 = rnorm(500), iv2 = rnorm(500), iv3 = rnorm(500), mod = rnorm(500), dv1 = rnorm(500), dv2 = rnorm(500), include1 = rbinom(500, size = 1, prob = .1), include2 = sample(1:3, size = 500, replace = TRUE), include3 = rnorm(500) ) ``` ## Create a blueprint Say we don't know much about this new and exciting area of research. We want to maximize our knowledge but we also want to be systematic. One approach would be to specify a reasonable analysis pipeline. Something that looks like the following: ```{r eval=FALSE} # Filter out exclusions filtered_data <- the_data |> filter( include1 == 0, # -- include2 != 3, # Exclusion criteria include3 > -2.5 # -- ) # Model the data my_model <- lm(dv1 ~ iv1 * mod, data = filtered_data) # Check the results my_results <- parameters::parameters(my_model) ``` But what if there are valid alternative alternatives to this pipeline? For example, using `iv2` instead of `iv1` or only using two exclusion criteria instead of three? A sensible approach would be to copy the code above, paste it, and edit with different decisions. This quickly become tedious. It adds many lines of code, many new objects, and is difficult to keep track of in a systematic way. Enter `multitool`. With `multitool`, the above analysis pipeline can be transformed into **a specification blueprint** for exploring all combinations of sensible data decisions in a pipeline. It was designed to leverage already written code (e.g., the `filter` statement above) to create a all possible combinations of data analysis pipelines. ## Filtering specifications Our example above has three exclusion criteria. If we don't know which are important, for example, because they are based on arbitrary 'rules of thumb' (that may or may not have inherent wisdom) or we don't know if including/excluding these cases is valid, we can generate all combinations: ```{r filters} the_data |> add_filters(include1 == 0, include2 != 3, include3 > -2.5) ``` The output above is a simple `tibble` (i.e., `data.frame`) containing three columns. Each row is a possible filter: the `type` column refers to the type of blueprint specification (see below for types other than filters), the `group` refers to the variable in the base data frame (in our case `the_data`) for which the filter applies, and the `code` column contains the code needed to execute the filter. For filtering decisions (e.g., exclusion criteria), a *'do nothing'* alternative is always generated. For example, perhaps some observations belong to a subgroup, `include1 == 1`. We may or may not have good reason to exclude these cases (this depends on the specific situation). But imagine that we don't know if we should include them or not. When `include1 == 1` is added to `add_filters()`, the *'do nothing'* alternative `include1 %in% unique(include1)` is automatically generated so you can compare including versus excluding cases based on a criterion. ## Adding alternative analysis variables Most multiverse-style analyses explore a range of exclusion criteria and their alternatives. However, sometimes alternative versions of a variable are also included. In the social sciences, it is fairly common to have many measures of roughly the same construct (i.e., measured variable). For example, a happiness researcher might measure positive mood, life satisfaction, and/or a single item measuring happiness (e.g., 'how happy do your feel?'). If you want to explore the output of your pipeline with differing versions of a variable, you can use `add_variables()`. ```{r variables1} the_data |> add_variables(var_group = "ivs", iv1, iv2, iv3) ``` The output above generates the same `tibble` as `add_filters()`. Each row is a particular decision to use a particular variable in your pipeline. In contrast to filter, however, you need to tell `add_variables()` what to call each set of variables with the `var_group` argument. This is how `multitool` knows that each variable name in the `code` column is a different alternative of a larger set. Here, `var_group = "ivs"` indicates that `iv1, iv2, iv3` are all different versions of `ivs`. I used "ivs" as way of indicating to myself that these are alternative versions of my main independent variable. You can add as many variable sets as you want. For example, we might also want to analyze our two versions of the outcome, `dv1` and `dv2`. ```{r variables2} the_data |> add_variables(var_group = "ivs", iv1, iv2, iv3) |> add_variables(var_group = "dvs", dv1, dv2) ``` ## Building up the blueprint You can harness the real power of `multitool` by piping specification statements. For example, perhaps we want to explore our exclusion criteria alternatives across different versions of our predictor and outcome variables. We can simply pipe new blueprint specifications into each other like so: ```{r building} the_data |> add_filters(include1 == 0, include2 != 3, include3 > -2.5) |> add_variables(var_group = "ivs", iv1, iv2, iv3) |> add_variables(var_group = "dvs", dv1, dv2) ``` Notice that we now have a specification blueprint with both exclusion alternatives and variable alternatives. ## Adding a model The whole point of building a specification blueprint is to eventually feed it to a model and examine the results. You can add a model to your blueprint by using `add_model()`. I designed `add_model()` so the user can simply paste a model function. For example, our call to `lm()` can be simply pasted into `add_model()`. Make sure to give your model a label with the `model_desc` argument. ```{r model} the_data |> add_filters(include1 == 0, include2 != 3, include3 > -2.5) |> add_variables(var_group = "ivs", iv1, iv2, iv3) |> add_variables(var_group = "dvs", dv1, dv2) |> add_model("linear model", lm(dv1 ~ iv1 * mod)) ``` Above, the model is completely unquoted. It also has no `data` argument. This is intentional; `multitool` is tracking the base dataset along the way (so you don't have to). Note that you can still quote the model formula, if that is more your style. ```{r} the_data |> add_filters(include1 == 0, include2 != 3, include3 > -2.5) |> add_variables(var_group = "ivs", iv1, iv2, iv3) |> add_variables(var_group = "dvs", dv1, dv2) |> add_model("linear model", "lm(dv1 ~ iv1 * mod)") ``` To make sure your `add_variables()` works properly, `add_model()` was designed to interpret `glue::glue()` syntax. For example: ```{r} the_data |> add_filters(include1 == 0, include2 != 3, include3 > -2.5) |> add_variables(var_group = "ivs", iv1, iv2, iv3) |> add_variables(var_group = "dvs", dv1, dv2) |> add_model("linear model", lm({dvs} ~ {ivs} * mod)) # see the {} here ``` This allows `multitool` to insert the correct version of each variable specified in a `add_variables()` step. Make sure to embrace the variable with the `var_group` argument from `add_variables()`, for example `add_model(lm({dvs} ~ {ivs} * mod))`. Here, `dvs` and `ivs` tells `multitool` to insert the current version of the `ivs` and `dvs` into the model. ## Finalizing the specification blueprint There are two steps in finalizing your blueprint. The first is to visualize your pipeline with a graph. This is optional, but I think it is helpful. You can automate making a chart with `create_blueprint_graph()`. Feed your pipeline to `create_blueprint_graph()` to see a chart of your multiverse pipeline plan: ```{r blueprint-diagram, fig.width=5.75, fig.height=6} full_pipeline <- the_data |> add_filters(include1 == 0, include2 != 3, include3 > -2.5) |> add_variables(var_group = "ivs", iv1, iv2, iv3) |> add_variables(var_group = "dvs", dv1, dv2) |> add_model("linear model", lm({dvs} ~ {ivs} * mod)) create_blueprint_graph(full_pipeline) ``` The final step in making your blueprint is expanding all your specifications into all possible combinations. You can do this by calling `expand_decisions()` at the end of your blueprint pipeline: ```{r expand} expanded_pipeline <- expand_decisions(full_pipeline) expanded_pipeline ``` The result is an expanded `tibble` with 1 row per unique decision and columns for each major blueprint category. In our example, we have alternative variables (predictors and outcomes), filters (three exclusion alternatives), and a model to run. Note that we have 3 exclusions (each with two combinations), 3 versions of our predictor, and 2 versions of our outcome. This means our blueprint should have `2*2*2*3*2` or `r 2*2*2*3*2` rows, which corresponds with our expanded pipeline: ```{r} 2*2*2*3*2 == nrow(expanded_pipeline) ``` Our blueprint uses list columns to organize information. You can view each list column by using `tidyr::unnest()`. For example, we can look at the filters: ```{r} expanded_pipeline |> unnest(filters) ``` Or we could look at the models: ```{r} expanded_pipeline |> unnest(models) ``` Notice that, with the `glue::glue()` syntax, different versions of our predictors and outcomes were inserted appropriately. You can check their correspondence by using `unnest()` on both the models and variable list columns: ```{r} expanded_pipeline |> unnest(c(variables, models)) ``` ## Going further This example uses relatively simple pipeline steps. You can also add more sophisticated steps to your pipeline, such as preprocessing data, post-processing model results, or calculating descriptive statistics alongside your model.