What is a reason that alternating treatments designs may have good internal validity?

  • Summary

  • Contents

  • Subject index

This text ntroduces readers to the history, epistemology, and strategies of single-case research design. The authors offer concrete information on how to observe, measure, and interpret change in relevant outcome variables and how to design strategies that promote causal inferences.

Key Features

Includes case vignettes on specific single-case designs; Describes clinical and applied case studies; Draws on multiple examples of single-case designs from published journals across a wide range of disciplines; Covers recent developments in applied research, including meta-analysis and the distinction between statistical and clinical significance; Provides pedagogical tools to help readers master the material, including a glossary, interim summaries, end-of-chapter review questions, and activities that encourage active processing of material.

Intended Audience

This text is intended for students and practitioners in a variety of disciplines—including psychology, nursing, physical therapy, and occupational therapy—who are increasingly called upon to document the effectiveness of interventions.

Chapter 8: Comparing Treatments: The Alternating-Treatments Designs

Comparing Treatments: The Alternating-Treatments Designs

Comparing treatments: The alternating-treatments designs

Paul, a 37-year-old white man, first appeared at his doctor's office after reading a magazine article about diabetes. Over the previous summer he had gained 30 pounds, was “thirsty all the time,” and was “running to the bathroom all night long.” Getting Paul to the doctor took his wife almost 6 months. Since Paul was a child, he had experienced an aversion to doctors, blood, and needles. His fear began as a child, after he fainted while getting stitches for a cut on his forehead. Over the next few years, he fainted several times during blood tests, vaccinations, and other situations involving needles. As a teenager, he avoided any situation in which he or his friends ...

locked icon

Sign in to access this content

Sign in

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life

  • Read modern, diverse business cases

  • Explore hundreds of books and reference titles

sign up today!

  • Journal List
  • Perspect Behav Sci
  • v.44(2-3); 2021 Sep
  • PMC8476682

Perspect Behav Sci. 2021 Sep; 44(2-3): 389–416.

Abstract

The Repeated Acquisition Design (RAD) is a type of single-case research design (SCRD) that involves repeated and rapid measurement of irreversible discrete skills or behaviors through pre-and postintervention probes across different sets of stimuli. Researchers interested in the study of learning in animals and humans have used the RAD because of its sensitivity to detect immediate changes in rate or accuracy. Despite its strengths, critics of the RAD have cautioned against its use due to reasonable threats to internal validity like pretest effects, history, and maturation. Furthermore, many methodologists and researchers have neglected the RAD in their SCRD standards (e.g., What Works Clearinghouse [WWC], 2020; Horner et al., 2005). Unless given guidance to address threats to internal validity, researchers may avoid the design altogether or continue to use a weak version of the RAD. Therefore, we propose a set of 15 quality RAD indicators, comprising foundational elements that should be present in all RAD studies and additional features that enhance causal inference and external validity. We review contemporary RAD use and describe how the additional features strengthen the rigor of RAD studies. We end the article with suggested guidelines for interpreting effects and the strength of the evidence generated by RAD studies. We invite researchers to use these initial guidelines as a jumping off point for a more RAD future.

Keywords: single-case research design, repeated acquisition design, education, behavior analysis, measurement

This design is like bigfoot. Everyone swears they've seen one but can never find the evidence.—Micheal Sandbank (2020)

The Repeated Acquisition Design (RAD) is a single-case research design (SCRD) that organizes the rapid and repeated acquisition of nonreversible behaviors in relation to a recursively implemented intervention. The RAD requires three necessary ingredients: (1) multiple sets of equivalent discrete skills or behaviors, (2) targeted for instruction at preplanned and regular intervals, and (3) the repeated measurement of an outcome conducted through comparable pre- and postintervention probes (Kennedy, 2005; Ledford & Gast, 2018). For example, researchers can use a RAD to study the effects of an oral language intervention on vocabulary gains in dual-language learners. Researchers would plan to teach students a different set of unknown words each week and consistently assess students’ vocabulary knowledge before (e.g., Monday morning) and after (e.g., Friday afternoon) delivery of the intervention. They could display these results similar to the hypothetical data presented in Figure 1A. Each set of connected pre- to postintervention probes (labeled A through F) would represent students’ vocabulary acquisition across different sets of words each week.

Basic features of a Repeated Acquisition Design. Note. Panel A shows a standard repeated acquisition design featuring a single intervention while panel B shows a comparison of two interventions

The strengths of a RAD can include its sensitivity to detect immediate changes in rate or accuracy (Greenwood et al., 2016; Porritt et al., 2009) and feasibility for use in applied settings like schools. By reducing confounds related to delayed retrieval, researchers can continuously monitor skill acquisition across new targets. Thus, RAD is a suitable option for studying academic outcomes such as vocabulary, alphabet knowledge, spelling, and number knowledge because mastery of these discrete skills occurs rapidly. In addition, the design is also suitable for investigating the effects of brief, low-dose interventions repeated over time with different targets. Unlike multiple probe (MP) and multiple baseline design (MBD) studies that extend study length to provide sufficient data in each condition, researchers can use a RAD to examine the effect of an intervention more efficiently. If researchers seek to compare the effects of two or more interventions, RAD may be more feasible than alternating treatments design (ATD) or adapted alternating treatments design (AATD). This is because unlike ATD, researchers can use RAD to study nonreversible behaviors. Furthermore, unlike AATD, RAD does not require the researcher to identify and assign more than one nonreversible and equivalent behavior to each compared intervention. In many ways, RAD may be the most efficient and practical design for measuring skill acquisition, mastery monitoring of learning targets, and facilitating data-informed instructional decisions (Van den Noortgate & Onghena, 2007).

Like its measurement methods, RAD history is brief. Researchers have used the RAD in basic laboratory studies for decades, albeit sparingly. It is thought to have originated in the 1960s with John Boren’s work on learning in animal populations (Boren, 1963, 1969). Other researchers in the field of experimental behavior analysis have used RADs to study effects of lead exposure on learning (e.g., Cohn et al., 1993) and in behavioral pharmacology research (e.g., Thompson et al., 1986). However, RAD use in applied research has been almost nonexistent.

Although the purpose of this article was not to undertake a systematic review of RAD use in applied research, we performed a brief literature review to understand its rarity. The first two authors conducted searches by hand and in databases (EBSCO, PsychINFO, Google Scholar, and PubMed) for RADs used in applied research (i.e., peer-reviewed literature and dissertations) and published in English within the past 20 years (2000–2020). The authors used the search terms “repeated acquisition design” or “repeated acquisition + intervention” or “repeated acquisition + single case.” The search yielded only 10 applied studies (see Table 1).

Table 1

Quality scores for ten Repeated Acquisition Design studies

Quality Indicators of Repeated Acquisition Design StudiesBouck et al. (2011)Butler et al. (2014)Dennis & Whalon (2020)Greenwood et al. (2016)Kelley et al. (2015)Lin & Kubina (2015)Peters-Sanders et al. (2020)Spencer et al. (2012)Sullivan et al. (2013)Whalon et al. (2016)
Foundational Features
1. Design appropriately matches the behavior of interest and research questions . .
2. Stimulus sets are independent of each other and equivalent
3. Order of stimulus sets is intentional, random, or counterbalanced .
4. Pre- and post-intervention measurements of stimulus sets are identical . .
5. Measurement of at least five stimulus sets per outcome is conducted within the treatment condition . .
6. Duration of time between pre- and post-intervention measurement is consistent across stimulus sets and scheduled a priori . .
7. Intervention is consistency implemented across stimulus sets, participants, and behaviors . .
8. Visual analyses emphasize variability and replication . . .
9. Independent and dependent variables are socially valid . . . . . . .
Additional Features
1. A baseline condition is included . . . . . . . . .
2. Randomization is used . . . . . .
3. Control stimuli are included . . . . . . .
4. Design allows for replication across at least three participants or behaviors
5. Control participants are included . . . . . . . . .
6. A maintenance condition, retention probe(s), and/or generalization probe(s) is included . . . .
Total = 7 11 11 6 12 7 9 9 11 8

As behaviorally oriented researchers engage in educationally relevant research, however, the need for SCRDs that can investigate academic behaviors efficiently and flexibly will likely become more urgent. Even so, until RAD receives mainstream appreciation (like from the Institute of Education Sciences [IES]), it is unlikely that applied researchers will consider RAD studies adequate for contributing evidence in support of effective interventions or include them in meta-analyses. It is unfortunate that the scarcity of studies using a RAD is a significant obstacle to the methodological refinement needed for the RAD to be welcomed by organizations like IES or reviewed by WWC.

There are two possible reasons for the scarcity of the RADs in applied behavioral research. First, it could be that few SCRD researchers study the types of behaviors RAD is most suitable to investigate. At the same time, intervention researchers often rely on group designs, even when such designs are underpowered, or the novel intervention is still in the development phase. The second reason could be that a basic RAD study cannot sufficiently control for threats to internal validity like history, maturation, and testing effects, as many critiques have suggested (Kennedy, 2005; Ledford & Gast, 2018; Shepley et al., 2020). In Figure 1A, the basic RAD lacks a baseline and an ongoing control condition that would strengthen the study’s internal validity. Without knowing how to enhance the basic RAD with additional features, researchers may choose to use other more common designs, even when the choice of a RAD may be the most fitting. It is also possible that researchers choose not to investigate educationally relevant behaviors because the most suitable design (i.e., RAD) is not well-respected in applied sciences. Regardless of the reason, only 10 published studies in 20 years represent minimal use of RAD.

For researchers to value RAD as a viable research design, we must address its three primary shortcomings. First, use of a RAD requires the identification and inclusion of a large pool of substitutable, or equivalent, stimuli. It can be extremely challenging for researchers to establish and document equivalence of stimulus sets. Any asymmetry in the difficulty of stimulus sets introduces a plausible, alternative explanation (Kennedy, 2005). Second, in RAD studies, a single preintervention probe serves as a measure of baseline performance prior to the introduction of an intervention. Without repeated baseline measurement, the researcher cannot mitigate internal validity threats related to testing, history, and maturation. Third, only immediate effects of an intervention on proximal outcomes can be observed using a basic RAD (Ledford & Gast, 2018). The absence of a maintenance condition in the basic RAD can result in conclusions that are incomplete and less socially valid.

We acknowledge the validity of criticisms against the RAD. However, like with any research design, researchers can take many steps to enhance their rigor (Kratochwill et al., 2013). To date, nobody has collated those steps. There are no established guidelines for designing or evaluating basic RAD studies. As a result, researchers must generalize broad recommendations from guidelines pertaining to other SCRDs, use what few research design texts recommend (Ledford & Gast, 2018), or extrapolate methods from limited published studies (e.g., Spencer et al., 2012). Guidelines or rules for deciding whether a SCRD meets evidence standards are available from several scientific fields, from education science (i.e., Gersten et al., 2005; Kratochwill et al., 2014; WWC, 2020) to rehabilitation research (Lobo et al., 2017). However, these guidelines or standards neglect the RAD. Whereas some may see these factors as limiting future RAD usage, we view the paucity of studies and lack of quality indicators as opportunities to improve upon the design. The basic features of a RAD suggest a need for innovation. Therefore, the purpose of this article is to introduce a set of methodological indicators of quality for RAD studies and offer a starting place from which scientists can further refine the application of RAD in research and practice.

Quality Indicators of a RAD

In 2005, numerous scholars proposed sets of quality indicators for various designs for conducting research in the field of special education, including experimental and quasi-experimental group designs (Gersten et al., 2005) and single-case research designs (Horner et al., 2005). The authors highlighted the need to increase the quality and rigor of research designs through the provision of quality indicators for the purpose of aiding in the determination of what constitutes evidence in a booming evidence-based practice climate. Likewise, we see an opportunity to provide researchers with a set of RAD standards that address its shortcomings and offer ideas for design variations that reduce associated threats to internal validity. In following sections, we outline fifteen quality indicators of RAD studies, organized into two groups: essential foundational and additional design features (see Figure 2), following the framework set forth by Gersten et al. (2005). We derived these indicators through both available evidence (e.g., published RAD studies) and our own consideration of modern SCRD and group design methodological features. Although many are the fundamental quality features of a RAD or of SCRDs in general, researchers may opt for inclusion of other features to supplement the basic design for enhanced causal inference and external validity. However, it is important to note that researchers should only include features to the degree they are necessary and sufficient for enhancing the internal and external validity of their RAD.

Quality indicators of a Repeated Acquisition Design

Finally, as we address each quality indicator, we offer recommendations for reporting RAD studies. There are a handful of design and analysis procedures that researchers must explicitly include in RAD manuscripts; otherwise, readers cannot judge the study’s quality. For that reason, we integrate our recommendations for reporting RAD studies throughout the descriptions of quality indicators and offer a summary in Figure 4.

Recommendations for reporting Repeated Acquisition Design studies

Design Appropriately Matches the Behavior of Interest and Research Questions

Research questions, theory of change, and knowledge of participants, context, behaviors, and ultimate outcomes should drive the researchers’ choice of RAD. The RAD is ideal for the study of (1) nonreversible, discrete behaviors or skills that can be (2) divided into several sets of equivalent stimuli, and (3) learned without influencing performance on untreated stimuli. It is appropriate for researchers to consider a RAD when investigating the effects of one or more brief interventions on such behaviors or skills. However, if the behavior(s) of interest does not align with these requirements (e.g., toileting skills), the researcher must select another design.

Many researchers view the RAD as strictly a comparative design, in which they used RAD to compare the effects of two or more interventions (Kennedy, 2005; Ledford & Gast, 2018; Shepley et al., 2020). However, researchers can use a RAD for either inductive/dynamic comparisons or to address deductive/static research questions (Johnson & Cook, 2019). In a comparative RAD, researchers rapidly alternate different interventions or different doses of an intervention at regular intervals to judge their relative effectiveness (see hypothetical data in Figure 1B). The repeated rapid alterations in a RAD help to control for history, maturation, and other time-related confounds (Van den Noortgate & Onghena, 2007). For example, Dennis and Whalon (2020) repeatedly measured words learned in alternating interventions to examine the extent to which two different methods of intervention delivery (app- or teacher-delivered) resulted in the greatest vocabulary learning among preschoolers. As an alternative, researchers can also employ a RAD in an effectiveness study. Shown in Table 1, we found that most RAD research involved a study of the effects of a single intervention.

The most basic effectiveness RAD does not necessarily control for all threats to internal validity. Therefore, researchers may need to hybridize RADs with other SCRD or group design features to control for plausible threats to internal validity. As one example, Kelley et al. (2015) employed a randomized group design with an embedded RAD. Because there are multiple purposes of RADs and numerous possible hybrid designs, it is incumbent upon the researchers to specify the reason for the design selection in manuscripts and justify selecting RAD over other design options. In addition to communicating specific hypotheses or predictions in relation to a RAD, the explication of the study’s purpose can lead to considerations for additional design features, whether in the planning process or when evaluating the study’s rigor.

Stimulus Sets Are Independent of Each Other and Equivalent

In a RAD, the researcher delivers an intervention and measures the participants’ pre- and postintervention performance across a series of stimulus sets. Researchers can compare outcomes if, and only if, each set has the same number of stimuli, and those stimuli are generally equivalent (e.g., similar levels of difficulty). In addition, the independence of stimulus sets is a key characteristic of RAD. In a RAD, the determination of an effect is established through repeated pre-to-post change resulting from the recurrent application of the intervention to specific stimuli (see Figure 1). If intervention on previously targeted stimuli improves performance on an untrained set, we would conclude that the stimulus sets were not independent of one another. When researchers cannot create independent stimulus sets, they typically reduce opportunities for controlled replication and a different SCRD would be preferable.

Researchers have several obligations regarding the careful selection and programming of stimuli to sets. First, the researcher specifies the number of targets per stimulus set and hold it constant across the study. Next, the researcher supplies a description of how they determined the stimuli to be “generally equivalent.” They can do this through an analysis of the relevant literature and expert judgement. Consider the selection of mathematics vocabulary in the study of an intervention designed to improve students’ descriptions of spatial relations. For a 6-week intervention, a researcher can randomly select five terms from a pool of 30 grade-level math words each week to create six equivalent stimulus sets (for examples of such lists, see Powell & Nelson, 2017, and Rubenstein & Thompson, 2002). By creating equivalent sets through random selection, the researcher can also reduce threats to internal validity related to systematic differences between within-set and between-set stimulus equivalencies. As an alternative, the researcher may wish to individualize stimulus sets for each participant using only unknown stimuli. In doing so, the specific stimuli taught may vary across participants. For example, Butler et al. (2014) conducted a pretest of 30 words with each participant to include only unknown words in the individual participants’ stimulus sets.

Researchers will need to be strategic and thoughtful about how to arrange study parameters to maintain its internal validity. When planning a RAD study, researchers should carefully consider stimulus set equivalence and when reporting the study, clearly describe how they created the sets. This increases the transparency of procedures and helps to document stimulus set equivalence. In some cases, generalization across stimulus sets may be plausible, which is counter to the required independence between stimulus sets in RAD studies. When complete independence is not possible or preferred (e.g., Lin & Kubina, 2015), researchers should balance the need to establish experimental control with considerations of practical importance and be certain another research design is not more suitable. Furthermore, when independence is not expected, the researcher should consider additional features that enhance causal inference.

Order of Stimulus Sets Is Intentional, Random, or Counterbalanced

After determining stimulus sets to be generally equivalent, attention must turn to how stimuli are ordered within the RAD. Researchers should plan the instructional sequence strategically using knowledge of content to decide whether order should be intentional, random, or counterbalanced. When the planning is completed, the experimental arrangement of stimulus sets should not have influence over outcomes measured across time.

Intentional ordering of stimulus sets is often the result of the researcher using a specific curriculum or intervention that has a prespecified scope and sequence. In this case, the researcher would use the predefined instructional schedule to guide the intentional ordering of each stimulus set. For example, the Story Friends program (Goldstein & Kelley, 2016) includes nine program-specific storybooks with embedded target words. As an automated intervention, the storybook order in the program drives the intentional ordering of vocabulary stimulus sets in the RAD. In studies investigating the effect of Story Friends (Greenwood et al., 2016; Kelley et al., 2015; Peters-Sanders et al., 2020; Spencer et al., 2012), researchers supplied a rationale for the sequence of stimulus sets as it related to the design of the intervention. In contrast, Whalon et al. (2016) delivered a storybook reading intervention using a variety of children’s books that did not belong to a manualized program with a specified scope and sequence. However, the authors did not provide a rationale for the order of their stimulus sets.

When not constrained by a program’s scope and sequence, researchers should consider randomizing the sequence of stimulus sets to control for time and order effects. Ledford and Gast (2018) provide a step-by-step guide for the use of randomization of targeted stimuli with and without equal difficulty levels. Researchers can randomly order the presentation of equivalent sets (e.g., sets 4, 2, 1, 3, 5; sets 1, 5, 3, 4, 2; sets 2, 4, 3, 5, 1) and then randomly assign these sequences to different participants. Sullivan et al. (2013) randomly assigned different sight words to stimulus sets and created a random sequence of stimulus sets that all participants received. Although each participant did not experience a unique sequence of stimulus sets, they did experience a unique sequence of alternating formats of the intervention (i.e., racetrack vs. list). This adequately controlled for sequence effects.

Last, counterbalancing the sequence of stimulus sets can help control the unintentional influence of order effects in SCRDs. Especially when using comparative RADs, researchers should pay close attention to the scheduling of stimuli when setting up near-identical conditions that allow for comparison of outcomes across interventions. For example, the researcher can expose stimulus sets to different interventions equally, program stimulus sets to occur in each position in each condition (e.g., first, second, third), or arrange for different participants to experience the stimulus sets in reciprocal orders (5, 2, 3, 1, 4 vs. 4, 1, 3, 2, 5). When there are few participants, the need to counterbalance the order of stimulus set presentation increases so that it is possible to disentangle the effect of intervention from order and stimuli. Dennis and Whalon (2020) first created a random sequence of stimulus sets for each of their interventions, app- and teacher-delivered vocabulary instruction. Because targeted stimuli were specific to conditions, participants received stimulus sets in the order that corresponded to the sequential random assignment of delivery format. However, the researchers planned for sessions to alternate between the different intervention delivery formats thus creating a type of counterbalanced order of stimulus sets.

When using a RAD, it is necessary for researchers to consider stimulus set order as a potential threat to internal validity. To increase confidence in the RAD and study outcomes, manuscripts should include adequate information about researchers’ decision-making processes for stimulus set ordering. The researcher’s procedural description and order justification can increase confidence in the RAD and is necessary for replication.

Pre- and Postintervention Measurement of Stimulus Sets Are Identical

Identical pre- and postintervention measurement is an essential feature of RAD. Differences in measurement of the same stimuli introduce an extraneous variable and disrupt the causal link between the independent and dependent variables. For example, researchers used a RAD in two studies to examine the Story Friends intervention preschoolers’ vocabulary and comprehension skills (Greenwood et al., 2016; Spencer et al., 2012). Measures for vocabulary were identical for pre- and postintervention assessment, but measures of participants’ comprehension pre- and postintervention were different. Researchers asked inferential questions that required students to predict story events prior to receipt of intervention. During intervention, the researchers read the same story read three times. Prediction questions were not appropriate after exposure to the story, and as a result they replaced prediction items with story recall questions. Graphed results revealed little to no gains on comprehension outcomes. The lack of growth could be related to the intervention, or due to differences in pre- and postintervention measures. In the end, the differences in their pre- and postintervention measures confounded the results and few conclusions were possible regarding the effect of the intervention on comprehension.

During manuscript preparation, RAD researchers should carefully describe their measures, including the variable type (e.g., continuous scale, binary) and scoring reliability. All studies in Table 1 provided detailed descriptions of the measures and scoring. Kelley et al. (2015) and Peters-Sanders et al. (2020) went beyond such standard and reported the internal consistency of their vocabulary mastery monitoring probes using Cronbach's alpha (.95). When measurement variations are reduced or removed, readers’ confidence in the results is improved.

Measurement of at Least Five Stimulus Sets per Intervention and Dependent Variable Is Conducted in the Treatment Condition

Within-case replication helps to reduce threats to internal validity related to maturation and history, and the more the better. Consistency in gains over time is more evident in RADs having five or more opportunities to demonstrate such change. Thus, for basic intervention effectiveness studies using RAD, at least five sets of pre- to postintervention measures provides sufficient data for visual analysis (Zimmerman et al., 2018). Of the 10 RAD studies shown in Table 1, 8 included at least five sets of pre-to-post probes per participant in the treatment condition. This indicator mimics the need for five data points per condition in other SCRD studies and serves as a minimum standard (Kratochwill et al., 2013).

If researchers intend to use the RAD as a comparison design, five sets of pre- to postmeasurement per intervention are desirable. In other words, RADs comparing two different interventions should have at least 10 pre-post probe pairs in the treatment condition—five sets per intervention (see Figure 1B). Kelley et al. (2015) examined the effects of an intervention on preschoolers’ comprehension and vocabulary skills. The authors included nine sets of probe pairs across each dependent variable, allowing for nine possible replications of intervention effect for comprehension and nine possible replications of intervention effect for vocabulary. When a study includes fewer than five stimulus sets, there are fewer opportunities for replication. For example, Butler et al. (2014) compared students’ learning in relation to e-books on receptive and expressive vocabulary outcomes, but with only three e-books (and three stimulus sets), there were only three possible replications of effect for each dependent variable. More replications over time assist the researcher in making judgements about the presence of history and maturation-related threats to internal validity.

Duration of Time Between Pre- and Postintervention Measurement Is Consistent Across Stimulus Sets and Scheduled a Priori

One of the strengths of the RAD is its sensitivity to immediate changes in accuracy and rate, the latter being a time-related variable. Thus, a consistent time interval between pre- and postmeasurement of stimulus sets is a necessary component of RAD studies. It is impossible to separate the confounding influence of time from the effect of the intervention if the time between assessments is longer for some stimulus sets than for others. In Lin and Kubina (2015), the participant continued in training until they met a mastery criterion for each stimulus set; the researchers did not standardize the time between pre- and postintervention measurement for stimulus sets. Variability in measurement timing weakens the RAD and introduces potential extraneous variables that hinder the ability to make causal assumptions. At a minimum, researchers and readers should be able to interpret the stability of pre- to postintervention outcomes based on reports of consistency in delivery schedule and the magnitude of intervention received by participant(s).

Of course, not all sources of variation are within researchers’ control; applied research with human participants is often subject to schedule disruptions beyond the control of the experimenter (e.g., participant illness). However, to the extent possible, researchers should plan for a consistent schedule of pre-to-post measurement. Although only Butler et al. (2014) included a flow chart to depict the sequence of study procedures, most of the studies we reviewed reported sufficient information to allow for replication. Greenwood et al. (2016) described the schedule of measurement in detail, with postintervention probes (of the previous week’s stimuli) and the preintervention probes (of the coming week’s stimuli) occurring every Friday. They also reported specific cases where they adjusted the schedule to allow for makeup sessions (intervention or data collection) related to student absences or school holidays.

In addition, researchers should be specific in reporting how they arrange pre- to postintervention probes across time and related to stimulus sets. To promote reliable and valid comparisons of pre-to-post probe gains across the study, the time unit for measurement needs to be more specific and transparent than “session.” In two RAD studies (Greenwood et al., 2016; Spencer et al., 2012), the authors measured the pre- to postintervention gains across nine different books (stimulus sets) introduced sequentially over a 9-week period (i.e., x-axis labeled “Books”). However, in the study by Bouck et al. (2011), the schedule of pre- and postintervention measures were unclear; data collection occurred “after school hours during . . . mandatory study period” (p. 4). The authors omitted necessary information about how and when (e.g., on what days) they conducted probes, and whether they held the duration between pre- and postmeasures constant across participants or stimulus sets. What if the mandatory study period occurred within a block schedule, meaning that it can occur on different days each week? What if probe measures occurred closer to or further from mealtimes? When researchers operationalize their measurement procedures to include the duration of time between pre- and postintervention measurement and report such information, they greatly improve the interpretation of their results and facilitate further exploration of the severity of the threats any inconsistencies pose.

Intervention Is Consistently Implemented Across Stimulus Sets, Participants, and Behaviors

When using a RAD, intervention dose must be brief enough to allow for regular and repeated delivery and measurement of its effect through pre- and postintervention probes. An intervention that has the potential to result in behavior change within a few sessions is suitable for a RAD, but interventions that take several weeks to yield behavior change are not. It is important to hold all aspects of the intervention constant across participants, behaviors, and stimulus sets. Likewise, researchers should deliver a consistent dose of the intervention between the pre- and postprobes for each participant and/or behavior. However, knowledge of the study population, intervention, and context may result in the need to vary some aspects of the intervention delivery method while holding other variables constant. For example, if a study occurs in a special education classroom, students may need different accommodations or adaptations to materials based on their individualized education program (IEP). Using a RAD to compare the effects of and student preferences for different calculation methods (e.g., traditional calculator, dictating calculations to others to input into a calculator, or voice-output calculator), Bouck et al. (2011) noted variance in how students engaged in pre- and postintervention probes (either read aloud or using enlarged print). Although the students’ use of calculation methods varied based on their IEP accommodations, the researchers consistently implemented the intervention across participants while also holding within-participant measurement procedures constant (i.e., researchers reported 100% fidelity). In addition to describing intervention procedures with precision, researchers should also delineate interventions by the number of stimuli targeted per intervention session, number of sessions, and duration of sessions. For example, Dennis and Whalon (2020) reported 30-min. intervention sessions, once a day for 4 days, to learn a set of eight words. Spencer et al. (2012) included even more explicit information with a schedule of activities and the average duration between the first and third intervention session (mean = 3 days, range = 2–7 days). For easy reporting, researchers can monitor and document these aspects alongside procedural fidelity. In the interpretation of results, both researcher and reader can weigh limitations reported, such as disruptions to planned intervention dosage (e.g., absences, interfering behavior), in relation to demonstration of experimental control.

Visual Analyses Emphasize Variability and Replications

Visual inspection of graphs is the standard method of analysis in all SCRD studies. It is usual for researchers to examine data for changes in level, trend, variability, and immediacy of effects to determine the presence (or absence) of a functional relation (Wolfe et al., 2019). For a RAD, however, the way researchers consider these analytical dimensions differ somewhat. First, immediacy of effect cannot be comparably determined because the ongoing measurement of the same stimuli does not occur in a RAD. The only type of change possible in a basic RAD study is an immediate one. Second, researchers usually identify a trend in the data through inspection of several related data points. However, a standard RAD has only two data points (pre- and post-) that are truly related to one another. For each stimulus set in a RAD, the researcher can examine the magnitude of change from pre-to-post measurement (or slope), but it is more closely related to level change than trend. There are also multiple opportunities to judge changes in level within the same condition, increasing with the number of probe pairs in the condition. Third, variability in RAD refers to the consistency of pre-to-post change between stimulus sets within a condition for a participant or behavior, not the range of data points distributed across a condition. When other conditions or multiple interventions are added to the basic RAD, additional options for analyzing level and variability exist. For example, in Figure 3A (upper-right panel), researchers can compare average change in levels and the degree of such change (i.e., variability) from baseline to treatment conditions. Likewise, researchers can reference average level changes and variability to compare effects in the case of alternating interventions within a RAD (e.g., Figure 1B).

Supplemental features that strengthen Repeated Acquisition Designs. Note. Panel A shows the addition of a baseline condition, with (right) and without (left) pre- to post-intervention measurement; Panel B shows the addition of control stimulus sets; Panel C shows the addition of a control participant; Panel D shows a repeated acquisition within a multiple baseline design; Panel E shows the addition of retention probes

Variability is an important analytical marker in RAD. Researchers can analyze variability of the data in a RAD to describe the consistency of results within and across participants and/or behaviors, and across conditions (e.g., baseline, treatment, maintenance). A researcher’s interpretation of variability differs according to the specific intervention examined (i.e., intervention typically yields gains of 7–9). There may be an educationally or clinically meaningful difference between gains of 1–4 versus gains of 10 or more. Moreover, if two out of 20 sets have near-zero changes in level (gains of 1–2) whereas the majority have gains of 10–12, such variability may be of interest to the researcher or practitioner, resulting in further examination of the stimulus sets associated with the lower scores. As a result of the variability potential within a single condition for a single participant or behavior, researchers should define what magnitude of change from pre-to-post measurement constitutes “meaningful” a priori. We suggest that researchers derive their definition of meaningful gain from a baseline condition (see “Additional Features” section, below). For instance, if pre-to-post gains never exceeded two in baseline, the researcher might consider a gain of three or more in the treatment condition as “meaningful.” Then, during visual analysis, the researcher could count the number of probe pairs that do and do not meet that standard (i.e., replications vs. nonreplications). That is, the researcher could count or classify every pre-to-post measurement opportunity within the treatment condition that met or exceeded a gain of three as a replication; anything less would count as a nonreplication based on the researchers’ definition. However, without a baseline condition, researchers will have to extrapolate reasonable definitions of meaningful gain from other sources such as the research literature, measurement scale, setting, and the nature of target behaviors.

Regardless of how researchers define meaningful intervention gains, replications within and across participants and behaviors are the key to interpreting the overall effect of the intervention and the strength of that effect—processes described below. Therefore, at a minimum, graphs should use raw data in the reporting of results. Bouck et al. (2011) and Greenwood et al. (2016) presented graphs of group averages and reported the mean gains per stimulus set. Although this might be useful in an overview of the intervention or in an analysis of the stimuli most participants learned, it prevents a thorough examination of variability. Without also reporting individual raw scores, the researcher can only make a weak argument for their use of a RAD. When confronted with page or content limits, researchers can include these data in supplemental materials rather than omitting the information altogether. The strength of the RAD is in its ability to show immediate, proximal gains for individuals, at a level that is not available in pre-to-posttest group or repeated measures designs and in the stability of those gains.

Independent and Dependent Variables Are Socially Valid

Limitations of SCRD studies can include a lack of control over contingencies governing participant attendance/attrition and access to the natural environment to implement an intervention and measure its effects. The brevity and straightforwardness of RAD make it useful for research in applied settings (Ledford & Gast, 2018), especially when researchers and/or practitioners are attempting to teach nonreversible discrete skills. In addition, researchers can use RADs to examine intervention dose and the efficiency of procedures on rates of acquisition at the participant level. For example, Kelley et al. (2015) used a RAD embedded in their group design to provide more information about for whom and to what degree their intervention worked within the treatment group. Some researchers have used RAD to compare the likability of interventions (e.g., Bouck et al., 2011). Although a researcher may select RAD to reduce the number of measurements, in some circumstances, repeated measurement can become excessive or interfere with ongoing routines (e.g., two probes per day). For that reason, RAD researchers should attend to the social validity (e.g., acceptability) of their measurement procedures as well as their intervention.

Researchers should also give attention to the feasibility of and costs (e.g., time, materials) associated with repeated pre- and postintervention measurement. Although Sullivan et al. (2013) mention the potential cost effectiveness and versatility of the racetrack intervention in a discussion of implications for practice, none of the 10 studies reviewed included cost analyses, nor did researchers assign monetary values to their study ingredients. RAD studies should supply details about interventionists, duration and magnitude of intervention, and similar specifics about the measurement procedures. RAD measurement procedures can be efficient and reliable, making the design well-suited for applied settings in which pre/posttest practices are commonplace (e.g., classrooms). Although most of the studies in Table 1 referenced a classroom or school setting (e.g., Story Friends studies), none involved the teachers or related service professionals (e.g., speech and language pathologist) in the measurement of pre- postintervention outcomes. To contribute to the research on both feasible and practical measures in applied settings, researchers should consider programming for the inclusion of the intervention’s intended end users—parents, teachers, and other school professionals.

When the duration and number of pre- to postintervention probes in a RAD do not significantly interrupt ongoing activities and routines, researchers can reduce their disruption to the natural environment. RAD probe measures are often brief; Sullivan et al. (2013) reported the average duration of each pre- and postintervention probe as 6.4 to 6.6 min. With respect to frequency, Whalon et al. (2016) examined the effects of an in-home parent-implemented reading comprehension intervention with five pre-to postintervention probe pairs. The nature of data collection (e.g., quick, easy, discrete) in many RADs may make a large set of observations more acceptable to end users. Across all the sampled RAD studies in Table 1, there was an average of eight pre- to postintervention pairs (16 observations).

Additional Features that Strengthen Internal and/or External Validity

Intervention researchers often rely on MBD and ATD to make claims about functional relationships (Ledford et al., 2019; Pustejovsky et al., 2019; Shadish & Sullivan, 2011; Smith, 2012; Tanious & Onghena, 2020) even when RAD might be suitable. To increase the confidence with which inferences about the effects of an intervention can be made in a RAD study, it is essential that researchers add features such as the inclusion of control stimuli or control participants, and a baseline condition. For RAD studies to produce robust and interpretable findings, one or more of these strengthening features will be necessary. It is incumbent upon researchers to select additional features that adequately control for the plausible threats relevant to a specific study.

A Baseline Condition Is Included

In RAD, only a treatment condition is obligatory. Within the treatment condition, measurement of the dependent variable occurs recursively across different stimulus sets (e.g., vocabulary words) before and after receipt of an intervention. The RAD preintervention probes serve as the counterfactual for the postintervention probes. Many applied researchers use a RAD as an alternative to a MBD when an extended baseline condition is not appropriate for the study population or not practical in an applied setting (e.g., preschool). For example, Dennis and Whalon (2020) cited concern for preschoolers’ fatigue as a possible artifact of adding a baseline condition. Thus, they relied on consistent replications of pre- to postintervention gains within and across participants to make causal inferences about the independent and dependent variables. However, a major critique of the basic RAD is that gains at posttest could be due to incidental learning that occurs through exposure to the stimuli at pretest (i.e., pretest effects). When researchers compare pre- to postintervention probes using a basic RAD, they cannot confidently ascertain the size of the gain attributed to the intervention alone. Therefore, they should consider other design features to reduce the threat of testing effects.

A baseline condition is a strengthening addition to RAD, in particular when the study focuses on the effect of an intervention rather than a comparison between interventions. Where appropriate and feasible, a RAD baseline condition provides an additional counterfactual for investigations of intervention effect and allow for between-condition (but within-case) comparisons. The well-established standard of at least three stable baseline data points in a baseline condition applies to RAD, too. For example, Whalon et al. (2016) included a baseline condition to measure a preschooler’s ability to answer story questions prior to the introduction of a parent-implemented dialogic reading intervention. They collected three baseline data points that resembled the preintervention probes in the treatment condition. After the baseline condition, they introduced their intervention and conducted pre- to postintervention probes using different stimulus sets (like the hypothetical baseline in the graph on the left in Figure 3A). Although this allowed for a comparison of probes prior to intervention across baseline and treatment conditions, another arrangement for baseline data collection is also conceivable, and likely preferred. Collecting baseline data using similar timing and procedures as later pre- and postintervention probes would allow for a comparison of gains between conditions and thus better control for potential testing effects (see hypothetical data in the graph on the right in Figure 3A).

Although none of the RAD studies in Table 1 included a baseline condition with pre-to-post probe pairs, the comparison between conditions can be extremely convincing as it certainly strengthens the study’s internal validity. Nonetheless, researchers may often select the RAD over MBD because the former does not require a baseline condition. Therefore, researchers must balance the need for feasibility of design in applied settings with the need to establish strong internal validity. We recommend that researchers provide a rationale for the inclusion or exclusion of a baseline condition in the design section of the study manuscript.

Control Stimuli Are Included

Another RAD variation involves a researcher’s selection of untrained, or control, stimuli to probe alongside targeted stimuli for a comparison of pre- to postintervention change. By planning for the inclusion of control stimuli, researchers can reduce testing, history, and maturation threats to internal validity. The primary requirement for use of control stimuli is that the researchers must ensure equivalency (e.g., levels of difficulty) with trained stimuli. The researcher can randomly select stimuli from a larger pool to create equivalent sets and randomly select which targets to teach. Thus, each stimulus set would include both trained and untrained stimuli. Another option would be for the researcher to create entire sets of control stimuli from the larger pool of equivalent items. For example, imagine that a researcher is interested in studying the effect of explicit teaching of irregular words by sight (e.g., said, the, come). Before the study, the researcher would create nine sets of four words and randomly select three sets to serve as control sets. The resulting RAD graph would look like the hypothetical data shown in Figure 3B, with targeted sets represented by circle markers and the control sets represented by triangles.

As an alternative, several of the studies we reviewed included simultaneous measurement of taught and untaught skills, such as novel vocabulary words to serve as control stimuli. For example, Butler et al. (2014) included eight vocabulary words explicitly targeted during the intervention and two untrained vocabulary words to serve as control stimuli. Spencer et al. (2012) and Kelley et al. (2015) also included control stimuli to examine the effects of a preschool vocabulary and comprehension intervention, Story Friends, on taught and untaught words. When interventions are unlikely to have generalized impacts, untrained stimuli can function as a type of counterfactual condition.

In our review, we found one study that used untrained stimuli to test generalization. Lin and Kubina (2015) targeted a set of four motor skills per session and reported pre- to postintervention gains for those taught skills. Every third session, the authors probed pre- to postintervention change for a set of untrained or unreinforced imitative behaviors as a test of generalization. Fluency improvements in untrained imitative behaviors corresponded with improvements in trained behaviors, despite the lack of reinforcement history. The authors concluded generalization occurred, but the study lacked rigor sufficient to establish causal inference (Horner et al., 2005). However, with the application of additional design features (e.g., staggered baselines, multiple participants), the authors may have been able to improve confidence in their conclusion.

Control Participants Are Included

Inclusion of control stimuli and a baseline condition are features that reduce threats to internal validity. If a variation of RAD includes such features, it may not be necessary to also include control participants. However, there may be situations in which it is neither feasible nor practical for a researcher to include a baseline condition or use control stimuli (e.g., school setting and delivery schedule put limits on researchers’ time and access to additional stimuli). Thus, when available and appropriate, use of control participants as a counterfactual is preferable to the basic RAD.

To address threats to internal validity and allow for between case analyses, researchers who have access to six or more participants should consider random assignment of equal number of participants to either a treatment or control group. Like in a control group design, researchers can assess pre-to-post change on skill sets for cases assigned to receive intervention and to one or more business-as-usual (BAU) conditions. Kelley et al. (2015) used a RAD embedded in a group design, allowing for comparison of the treatment group’s outcomes to those of the BAU group. Researchers can also include control participants in a RAD without requiring the SCRD to be a part of a larger group design. In this scenario, both treatment and control participants would receive the same schedule of pre-to-post measures, but only participants assigned to the treatment group would receive the intervention. The design would allow the researcher to investigate the extent to which a majority of the control group participants had smaller pre-/postprobe gains than those in the treatment group. Researchers can efficiently compare patterns in pre-/postprobe gains across groups through visual inspection of participant graphs. As shown in Figure 3C, graphs can display the pre-to-post changes for participants who do and do not receive the intervention. When using control participants, researchers should plan for at least three participants per group (only one from each group is shown in Figure 3C).

In considering the use of control participants in RAD, researchers must guard against the unethical withholding of an effective intervention from participants. Researchers might consider a waitlist control design (Brown et al., 2006) in which control participants receive the intervention but later than the participants in the treatment condition. Even when it is not practical to have as many control cases as in the treatment group or when control participants cannot remain in their condition for an extensive period, the researcher can make a convincing causal argument if the inclusion of control participants substantially improves the ability to determine that change in the dependent variable(s) corresponds with the intervention.

Randomization Is Used

In applied research, it is challenging to have truly comparable conditions, whether related to participants or stimuli. Randomization is another mechanism to further minimize threats to the internal validity when using a RAD. Many of the additional features already presented to improve experimental control are best done through randomization. When it is possible and appropriate, researchers can randomly assign participants to treatments; randomly assign participants to receive instruction on different stimulus sets; and randomly assign stimulus sets to intervention or to remain untrained. Although there are many randomization techniques available to researchers (e.g., Kratochwill & Levin, 2010; Weaver & Lloyd, 2019), these three randomization procedures are most relevant to RAD.

In two of the sampled studies in Table 1, researchers used random assignment. Allowing for the examination of differential effectiveness between two different instructional delivery methods on vocabulary outcomes, Dennis and Whalon (2020) randomly assigned preschoolers to start each week in app- or teacher-delivered instruction. Likewise, Sullivan et al. (2013) had students select their initial fluency drill condition assignment from a deck of cards; students drew one card that had either racetrack or list printed on it. When a participant completed postintervention assessment for one stimulus set, they were immediately randomized to one of two conditions for the next stimulus set. Sequential random assignment continued until the participant was assigned to the same condition three consecutive times, at which time the researchers used purposeful assignment of the remaining condition for the rest of the stimulus sets.

A primary reason to use random assignment in RAD is to add confidence in the equivalence (and comparability) of participants, conditions, and stimuli. At the very least, the use of randomization helps to ensure that any differences between participants, conditions, and stimuli are random and not influenced by the researcher or extraneous variables. Researchers should arrange for the inclusion of randomization technique(s) in a RAD a priori, selecting only those that adequately address threats to internal validity given the behaviors of interest, nature of intervention, and context. Within the study manuscript, researchers should specify randomization strategies used and link them directly to both the study’s purpose and plausible threats to internal validity.

Design Allows for Replication Across at Least Three Cases

The purposeful inclusion of multiple participants in a RAD can help to establish internal validity and improve external validity. In the sampled RAD literature, most studies planned for the inclusion of multiple participants. Only two RAD studies (Lin & Kubina, 2015; Whalon et al., 2016) examined intervention effects for a single participant—in both cases, a 4-year-old child diagnosed with autism. The remaining studies included 7 or more participants, with a range of 3 (Bouck et al., 2011) to 17 (Peters-Sanders et al., 2020).

In studies with fewer than three participants, other forms of replication (e.g., behaviors and settings) are critical. Depending on the causal mechanisms of an intervention, it may be appropriate for the researcher to investigate replications of effect across behaviors. Several studies in Table 1 measured the effects of a storybook reading intervention on two different language or literacy outcomes (Butler et al., 2014; Greenwood et al., 2016; Kelley et al., 2015; Spencer et al., 2012). However, no study included more than two behaviors to allow for a complete hybrid RAD-MBD study. Figure 3D shows one possibility for how this might work. Causal inference benefits from additional replications; the more demonstrations of effect the better.

A Maintenance Condition, Retention Probe(s), and/or Generalization Probe(s) Is Included

If researchers are interested in measuring proximal intervention effects on learner behaviors, use of a RAD can be ideal. However, without distal measures, researchers cannot make inferences about intervention durability (Ledford & Gast, 2018). This limits their ability to make external validity claims regarding the intervention and its potency to improve socially meaningful generalized outcomes. Researchers can include a follow-up measure (i.e., one-time test for retention or generalization) or a maintenance condition (i.e., multiple observations after the intervention has been withdrawn) in a RAD to assess retention and provide a comprehensive evaluation of behaviors or skills gained across the treatment condition. A comparison of retained skills and initial acquisition data in conjunction with the time elapsed since intervention (see hypothetical data in Figure 3E) can reveal helpful information about how to tailor the intervention and titrate its dose, frequency, and duration. In the scaling of behavioral interventions, sustainability of effects is an important consideration and the inclusion of follow up measures in RAD studies can help to document the extent to which the intervention leads to robust outcomes.

Generalization probes are another way to increase the external validity of the study and the social validity of the intervention’s effect. Just like in other SCRD studies, the extent to which the intervention produces effects on untrained behaviors or in untrained settings bolsters the value of said intervention. However, for RAD, generalization may not always be a suitable expectation because controlled replications necessitate independence between stimulus sets targeted for intervention. A delicate balance between reasonable independence and plausible generalization may be needed.

Many RAD studies in Table 1 reported the use of some type of follow-up measure (e.g., a cumulative posttest of all stimuli targeted during the intervention). However, only Whalon et al. (2016) programmed a maintenance condition. Following the last intervention session, the authors included a 3-week maintenance condition and conducted probes corresponding to stimulus sets targeted during the treatment condition. By including the maintenance condition, they were able to measure the retention of a sample of stimuli that received intervention. In other RAD studies, researchers explored the degree to which participants’ retained improvements over time by conducting a single probe of stimulus sets between 1- and 7-weeks following intervention (e.g., Dennis & Whalon, 2020; Kelley et al., 2015). Rather than assess retention of all taught words, Dennis and Whalon (2020) randomly selected vocabulary from a list of mastered words to assess at follow up.

Effects and Strength of Evidence

Although it may seem arbitrary and feel uncomfortable to prescribe rules for judging the effect of intervention(s) in SCRD research, we believe it is important to set minimum standards for determining effects and the strength of the evidence using RAD. Given its reputation, the more clarity and standardization we can bring to RAD studies the better. It must be admitted that we consider the guidelines offered here to be draft standards; apart from the 10 studies we reviewed, these guidelines have not been empirically investigated or even applied. Nonetheless, we submit them for analysis and expect their broader application and evaluation will assist in their refinement over time.

At the foundation of our guidelines is the necessary separation of study rigor and intervention effects. When positive results are needed to establish methodological rigor, we inadvertently cultivate confirmation bias in our publications. Studies in which the intervention lacks potency tend not to be published because their experimental control has not been established, resulting in a literature saturated with only effective interventions. In still-emerging group design and preregistry traditions, effects do not dictate the publication of researchers’ well-designed studies. In many applied research arenas, published studies of null effects offer tremendous insight. Therefore, it is incredibly important to separate RAD quality from effects of the intervention investigated. As a comparative design, the research question could be about which intervention, if any, produces stronger effects. If neither produces a stronger effect or neither produce any effect, this is an answer to the research question and should be available for readers. Researchers can also use a RAD to examine parameters of implementation (e.g., dose, intervention agents, acceptability), generalization, and retention, which do not necessarily depend on positive effects. Because there are few examples of RAD studies in the literature, it is difficult to predict all its potential uses. As intervention tailoring (Gagliardi, 2011) and the science of implementation (Odom et al., 2020) are emphasized, behavioral researchers will need a SCRD that is flexible and feasible to do the job. RAD may fill the need.

We propose that an examination of the effect of the intervention occurs at two levels: (1) within-case (e.g., participants, behaviors, settings) and (2) across cases. The evaluation of an effect begins with the researcher defining “effect” and then conducting visual inspection of graphs to analyze outcomes within-cases. For each case, visual inspectors can count the number of pre-to-post pairs that qualify as “meaningful” and compare the result to the number of pre-to-post pairs possible within the treatment condition (i.e., number that qualify divided by the number possible). In visual terms, graphs should allow for ease in deciding the presence of an intervention effect for an individual case, which we define as observing at least five within-case replications (i.e., meaningful gains) AND the within-replications constitute at least 75% of pre-to-post pairs (i.e., stimulus sets). For example, in Dennis and Whalon (2020), the app condition resulted in pre- to postintervention gains across at least 86% of all probe pairs for five out of six participants.

The second part to the evaluation of an effect is for researchers to determine the strength of the evidence produced by the study. This requires an across-case analysis (e.g., Figure 3D). After inspecting individual graphs, visual inspectors can count the number of cases for which they observe a within-case effect (defined above). We suggest there is moderate evidence if researchers observe at least three within-case effects (five or more pre-to-post gains) AND observe within-case noneffects (fewer than five pre-to-post gains) in 26%–33% of cases. If an intervention designed to improve three different behaviors (e.g., fine motor movements, gross motor movements, and object movements) was investigated with a single participant, but only two behaviors had at least five within-behavior gains, the intervention would not meet the suggested evidence standards. However, if there had been six participants and researchers report effects for four of them, this study would meet the moderate evidence standards with at least three participants (i.e., four participants) and noneffects for only 33% of participants (i.e., two of them). A conclusion of strong evidence is warranted when the effects are observed for at least three cases AND noneffects are observed for 25% or fewer participants.

Applying this strength of evidence logic to all RAD studies in Table 1, we were able to supply a summative statement regarding evidence of effect(s) per dependent variable. We were not able to judge four studies for the presence or strength of effect(s). First, Bouck et al. (2011) used a RAD in a preliminary study of intervention usability and likability. In their exploratory study, they did not use the RAD to compare effectiveness of different calculation methods. For Butler et al. (2014), we could not detect a presence of effect for any of the three participants because the number of within-participant observations was fewer than five pre- to postintervention probe pairs (n = 3). Next, although examination intervention effectiveness was the purpose of the study, Greenwood et al. (2016) averaged participant outcomes and only reported mean gains across participants. Finally, Lin and Kubina (2015) used a modified RAD with only three sets of taught and one set of untaught stimuli. Graphed data on a Standard Celeration Chart did not allow for the consistent calculation of pre- and postintervention changes. Considering these results, we encourage researchers using RADs to include raw pre- and postintervention data for each case and present data in graphical format that allows for accurate examination of effects.

Overall, 6 of 10 RAD studies allowed us to examine the strength of evidence for the intervention. We present the results of the analyses in Table 2. Through visual inspection of study graphs, we calculated the number of gains across stimulus sets for each outcome per participant. All six studies reported an intervention effect for at least one participant or behavior. In addition to reporting replication counts, study results often included average gain scores for each participant and/or each outcome to quantify the size of the intervention effect. Two studies (Spencer et al., 2012; Whalon et al., 2016) estimated individual effect sizes using the supplementary statistic nonoverlap of all pairs (NAP; Parker and Vannest 2014), a process that involves comparing every participant’s preintervention probes to all postintervention probes. Peters-Sanders et al. (2020) also reported intercorrelations between preintervention oral language skills and vocabulary words learned in the treatment condition to address their research question about differential effects for diverse students.

Table 2

Evidence of intervention effects

StudyNo. of Quality IndicatorsNo. Of ParticipantsStudy purposeDependent Variable 1Dependent Variable 2Behavior or InterventionNo. of Participants with 5+ Pre-to- Post Gains% of participants with Non-EffectsStrength of EvidenceBehavior or InterventionNo. of Participants with 5+ Pre-to- Post Gains% of participants with Non-EffectsStrength of Evidence
Dennis & Whalon (2020) 11 6 Comparison Teacher-delivery 4 33 Moderate App-delivery 5 17 Strong
Kelly et al. (2015) 12 9 Effectiveness Vocabulary 6 33 Moderate Story comprehension 3 67 .
Peters-Sanders et al. (2020) 9 17 Effectiveness Vocabulary 12 29 Moderate
Spencer et al. (2012) 9 9 Effectiveness Vocabulary 5 44 . Story comprehension 2 78 .
Sullivan et al. (2013) 11 8 Comparison Racetrack 8 0 Strong List 8 0 Strong
Whalon et al. (2016) 8 1 Effectiveness Story comprehension 1 0 .

Note. Moderate evidence = 5+ pre-to-post gains are observed for 3+ participants/behaviors AND fewer than 5 pre-to-post gains are observed for 26-33% of participants/behaviors. Strong evidence = 5 + pre-to-post gains are observed for 3+ participants/behaviors AND fewer than 5 pre-to-post gains are observed for ≤ 25% of participants/behaviors.

Next, to judge the strength of the evidence, we analyzed the extent to which effects were replicated across participants or behaviors in each study. We counted the number of participants who experienced five or more pre-to-post gains and calculated the percent of participants who experienced fewer than five pre-to-post gains. As shown in Table 2, two RAD studies reported moderate evidence for the intervention’s effect on one outcome, one had moderate evidence for one outcome and strong evidence for the other outcome, and one study demonstrated strong evidence for two outcomes. The strength of the evidence could be determined if and only if the RAD included three or more cases, regardless of the number of quality indicators. Therefore, if using a RAD for the purpose of investigating intervention effectiveness, the planned inclusion of multiple participants or behaviors is important, if not necessary, to allow for reporting on the strength of evidence for an intervention produced by a study.

Call to Action

Now that we have drafted a set of quality indicators for RAD studies and proposed guidelines for judging the effect and strength of evidence offered by an individual RAD, the scientific community is in a better position to begin the refinement process. Perhaps in time, we will see an increase of RAD studies in the behavioral literature and in adjacent fields (e.g., education, psychology). We extend an invitation to applied researchers with an interest in developing, tailoring, comparing, and documenting the effects of brief, easy to implement intervention procedures on nonreversible, discrete skills to apply these draft quality indicators to their work. We have yet to realize all the creative ways in which researchers can use the quality indicators as a guide in the planning of a RAD, appropriate to the research questions and study context. Therefore, as researchers continue to expand use of the RAD, more discussion is necessary to identify if and how the quality indicators and effect-related judgements can advance SCRD methodology and applied research. There is room to further quantify the precision and strength of effects produced by one or more interventions. Although presently beyond the scope of this article, there is also a need to examine the utility of statistical analyses for supplementing visual analyses within RAD studies. The extent to which statistical analyses like randomization tests (Hua et al., 2020; Levin et al., 2020; Onghena & Edgington, 1994) and nonparametric effect-size estimates (e.g., NAP) contribute to the precision and strength of evidence is unknown and prime for further study. In the spirit of creativity and innovation, we look forward to a RAD future.

Declarations

Conflict of Interest

The authors report no conflicts of interest.

Footnotes

This article was updated to correct the spelling of Micheal Sandbank in the quote attribution on the second page of the article.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Change history

7/28/2021

The affiliation of Pradeep Kumar Singh should be: Department of Computer Science.

Change history

10/19/2021

A Correction to this paper has been published: 10.1007/s40614-021-00320-z

References

  • Boren JJ. The repeated acquisition of new behavioral chains. American Psychologist. 1963;18(7):421–421. [Google Scholar]
  • Boren JJ. Some variables affecting the superstitious chaining of responses. Journal of the Experimental Analysis of Behavior. 1969;12(6):959–969. doi: 10.1901/jeab.1969.12-959. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Bouck, E. C., Flanagan, S., Joshi, G. S., Sheikh, W., & Schleppenback, D. (2011). Speaking math: A voice input, speech output calculator for students with visual impairments. Journal of Special Education Technology, 26(4), 1–14. 10.1177/016264341102600401.
  • Brown, C. H., Wyman, P. A., Guo, J., & Peña, J. (2006). Dynamic wait-listed designs for randomized trials: New designs for prevention of youth suicide. Clinical Trials, 3(3), 259–271. 10.1191/1740774506cn152oa. [PubMed]
  • Butler, C., Brown, J. A., & Woods, J. J. (2014). Teaching at-risk toddlers new vocabulary using interactive digital storybooks. Contemporary Issues in Communication Science & Disorders, 41, 155–168. //doi.org/1092-5171/14/4102-0155.
  • Cohn J, Cox C, Cory-Slechta DA. The effects of lead exposure on learning in a multiple repeated acquisition and performance schedule. Neurotoxicology. 1993;14(2–3):329–346. [PubMed] [Google Scholar]
  • Dennis, L. R., & Whalon, K. J. (2020). Effects of teacher- versus application-delivered instruction on the expressive vocabulary of at-risk preschool children. Remedial & Special Education, 42(4), 195–206. 10.1177/0741932519900991.
  • Gagliardi AR. Tailoring interventions: Examining the evidence and identifying gaps. Journal of Continuing Education in the Health Professions. 2011;31(4):276–282. doi: 10.1002/chp.20141. [PubMed] [CrossRef] [Google Scholar]
  • Gersten, R., Fuchs, L. S., Compton, D., Coyne, M., Greenwood, C., & Innocenti, M. S. (2005). Quality indicators for group experimental and quasi-experimental research in special education. Exceptional Children, 71(2), 149–164. 10.1177/001440290507100202.
  • Goldstein, H., & Kelley, E. S. (2016). Story friends: An early literacy intervention for improving oral language. Paul H. Brookes.
  • Greenwood, C. R., Carta, J. J., Kelley, E. S., Guerrero, G., Kong, N. Y., Atwater, J., & Goldstein, H. (2016). Systematic replication of the effects of a supplementary, technology-assisted, storybook intervention for preschool children with weak vocabulary and comprehension skills. Elementary School Journal, 116(4), 574–599. 10.1086/686223.
  • Horner, R. H., Carr, E. G., Halle, J., McGee, G., Odom, S., & Wolery, M. (2005). The use of single-subject research to identify evidence-based practice in special education. Exceptional Children, 71(2), 165–179. 10.1177/001440290507100203.
  • Hua, Y., Hinzman, M., & Yuan, C. (2020). Comparing the effects of two reading interventions using a randomized alternating treatment design. Exceptional Children, 86(4), 355–373. 10.1177/0014402919881357.
  • Johnson, A. H., & Cook, B. G. (2019). Preregistration in single-case design research. Exceptional Children, 86(1), 95–112. 10.1177/0014402919868529.
  • Kelley, E. S., Goldstein, H., Spencer, T. D., & Sherman, A. (2015). Effects of automated Tier 2 storybook intervention on vocabulary and comprehension learning in preschool children with limited oral language skills. Early Childhood Research Quarterly, 31, 47–61. 10.1016/j.ecresq.2014.12.004.
  • Kennedy, C. H. (2005). Single-case designs for educational research. Pearson.
  • Kratochwill, T. R., Hitchcock, J. H., Horner, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M., & Shadish, W. R. (2013). Single-case intervention research design standards. Remedial & Special Education, 34(1), 26–38. 10.1177/0741932512452794.
  • Kratochwill, T. R., Levin, J. R., Horner, R. H., & Swoboda, C. M. (2014). Visual analysis of single-case intervention research: Conceptual and methodological issues. In T. R. Kratochwill & J. R. Levin (Eds.), Single-case intervention research: Methodological and statistical advances (pp. 91–125). School psychology series. American Psychological Association. 10.1037/14376-004
  • Kratochwill TR, Levin JR. Enhancing the scientific credibility of single-case intervention research: Randomization to the rescue. Psychological Methods. 2010;15(2):124–144. doi: 10.1037/a0017736. [PubMed] [CrossRef] [Google Scholar]
  • Ledford, J. R., Barton, E. E., Severini, K. E., Zimmerman, K. N., & Pokorski, E. A. (2019). Visual display of graphic data in single case design studies: Systematic review and expert preference analysis. Education & Training in Autism & Developmental Disabilities, 54(4), 315–327.
  • Ledford, J. R., & Gast, D. L. (2018). Combination and other designs. In J. R. Ledford & D. L. Gast (Eds.), Single case research methodology: Applications in special education and behavioral sciences (3rd ed., pp. 335–364). Routledge.
  • Levin, J. R., Ferron, J. M., & Gafurov, B. S. (2020). Investigation of single-case multiple-baseline randomization tests of trend and variability. Educational Psychology Review, 33, 713–737. 10.1007/s10648-020-09549-7.
  • Lin, F.-Y., & Kubina., R. M. (2015). Imitation fluency in a student with autism spectrum disorder: An experimental case study. European Journal of Behavior Analysis, 16(1), 2–20. 10.1080/15021149.2015.1065637.
  • Lobo, M. A., Moeyaert, M., Baraldi Cunha, A., & Babik, I. (2017). Single-case design, analysis, and quality assessment for intervention research. Journal of Neurologic Physical Therapy: JNPT, 41(3), 187–197. 10.1097/NPT.0000000000000187. [PMC free article] [PubMed]
  • Odom, S. L., Hall, L. J., & Suhrheinrich, J. (2020). Implementation science, behavior analysis, and supporting evidence-based practices for individuals with autism. European Journal of Behavior Analysis, 21(1), 55–73. 10.1080/15021149.2019.1641952. [PMC free article] [PubMed]
  • Onghena, P., & Edgington, E. S. (1994). Randomization tests for restricted alternating treatments designs. Behavior Research & Therapy, 32, 783–786. 10.1016/0005-7967(94)90036-1. [PubMed]
  • Parker, R. I., & Vannest, K. J. (Eds) (2014). Non-overlap analysis for single-case research. In T. R. Kratochwill & J. R. Levin (Eds.), Single-case intervention research: Methodological and statistical advances (pp. 127–151). American Psychological Association. 10.1037/14376-005.
  • Peters-Sanders, L. A., Kelley, E. S., Biel, C. H., Madsen, K., Soto, X., Seven, Y., Hull, K., & Goldstein, H. (2020). Moving forward four words at a time: Effects of a supplemental preschool vocabulary intervention. Language, Speech, & Hearing Services in Schools, 51, 165–175. 10.1044/2019_LSHSS-19-00029. [PubMed]
  • Porritt, M., Wagner, K. V., & Poling, A. (2009). Effects of response spacing on acquisition and retention of conditional discriminations. Journal of Applied Behavior Analysis, 42(2), 295–307. 10.1901/jaba.2009.42-295. [PMC free article] [PubMed]
  • Powell, S. R., & Nelson, G. (2017). An investigation of the mathematics-vocabulary knowledge of first-grade students. Elementary School Journal, 117(4), 664–686. 10.1086/691604.
  • Pustejovsky, J. E., Swan, D. M., & English, K. W. (2019). An examination of measurement procedures and characteristics of baseline outcome data in single-case research. Behavior Modification. Advance online publication. 10.1177/0145445519864264. [PubMed]
  • Rubenstein, R. N., & Thompson, D. R. (2002). Understanding and supporting children’s mathematical vocabulary development. Teaching Children Mathematics, 9(2), 107–112. 10.5951/TCM.9.2.0107.
  • Shadish, W. R., & Sullivan, K. J. (2011). Characteristics of single-case designs used to assess intervention effects in 2008. Behavior Research Methods, 43(4), 971–980. 10.3758/s13428-011-0111-y. [PubMed]
  • Shepley, C., Zimmerman, K. N., & Ayres, K. M. (2020). Estimating the impact of design standards on the rigor of a subset of single-case research. Journal of Disability Policy Studies. 10.1177/1044207320934048.
  • Smith, J. D. (2012). Single-case experimental designs: A systematic review of published research and current standards. Psychological Methods, 17(4), 510–550. 10.1037/a0029312. [PMC free article] [PubMed]
  • Spencer, E. J., Goldstein, H., Sherman, A., Noe, S., Tabbah, R., Ziolkowski, R., & Schneider, N. (2012). Effects of an automated vocabulary and comprehension intervention: An early efficacy study. Journal of Early Intervention, 45, 195–221. 10.1177/1053815112471990.
  • Sullivan, M., Konrad, M., Joseph, L. M., & Luu, K. C. T. (2013). A comparison of two sight word reading fluency drill formats. Preventing School Failure, 57(2), 102–110. 10.1080/1045988X.2012.674575.
  • Tanious, R., & Onghena, P. (2020). A systematic review of applied single-case research published between 2016 and 2018: Study designs, randomization, data aspects, and data analysis. Behavior Research Methods. 10.3758/s13428-020-01502-4. [PubMed]
  • Thompson, D. M., Mastropaulo, J., Winsauer, P. J., & Moerschbaecher, J. M. (1986). Repeated acquisition and delayed performance as a baseline to assess drug effects on retention in monkeys. Pharmacology, Biochemistry, & Behavior, 25, 201–207. [PubMed]
  • Van den Noortgate, W., & Onghena, P. (2007). The aggregation of single-case results using hierarchical linear models. Behavior Analyst Today, 8(2), 196–209. 10.1037/h0100613.
  • Weaver, E. S., & Lloyd, B. P. (2019). Randomization tests for single case designs with rapidly alternating conditions: An analysis of p-values from published experiments. Perspectives on Behavior Science, 42(3), 617–645. 10.1007/s40614-018-0165-6. [PMC free article] [PubMed]
  • Whalon, K., Hanline, M. F., & Davis, J. (2016). Parent implementation of RECALL: A systematic case study. Education & Training in Autism & Developmental Disabilities, 51(2), 211–220. //www.jstor.org/stable/24827548.
  • What Works Clearinghouse (WWC). (2020). What Works Clearinghouse standards handbook, Version 4.1. U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance. //ies.ed.gov/ncee/wwc/handbooks.
  • Wolfe, K., Barton, E. E., & Meadan, H. (2019). Systematic protocols for the visual analysis of single-case research data. Behavior Analysis in Practice, 12, 491–502. 10.1007/s40617-019-00336-7. [PMC free article] [PubMed]
  • Zimmerman, K. N., Ledford, J. R., Severini, K. E., Pustejovsky, J. E., Barton, E. E., & Lloyd, B. P. (2018). Single-case synthesis tools I: Comparing tools to evaluate SCD quality and rigor. Research in Developmental Disabilities, 79, 19–32. 10.1016/j.ridd.2018.02.003. [PubMed]

Articles from Perspectives on Behavior Science are provided here courtesy of Association for Behavior Analysis International

Which of the following is an advantage of alternating treatment designs?

Alternating treatment design has the following advantages: Efficiently compares intervention effectiveness. It does not require withdraw. It can be used to assess generalization effects.

What is an alternating treatment design?

alternating treatments design a type of study in which the experimental condition or treatment assigned to the participant changes from session to session or within sessions.

When should you use alternating treatment design?

This approach can be useful when testing the generality of an intervention that has shown to be effective in one context, but for which the effects in other contexts are unknown. Two or more interventions can also be compared in the alternating treatment design.

What is the reason for counterbalancing in alternating treatments designs?

Counterbalance A method of experimental control in which the order of treatments in an alternating-treatments design is varied (counterbalanced) across experimental phases to eliminate order effects.

Toplist

Neuester Beitrag

Stichworte