New Approaches
to Evaluating
Community
Initiatives

Volume 2
Theory, Measurement, and Analysis


Establishing Causality in Evaluations of Comprehensive Community Initiatives
Robert C. Granger

Introduction

Causal attribution is difficult in all sciences, and by its nature the comprehensive community initiative (CCI) is an especially complex case. Like many domestic social programs, CCIs are meant to create positive changes in the well-being of low income children and families. They try to do this by combining some or all of the following elements in a manner that encourages synergy across the strategies: expansion and improvement of social services, such as child care and family support; health care; economic development; housing rehabilitation; community planning and organizing; adult education; job training; and school reform. Moreover, most of today’s CCIs operate on the premise that power must devolve to the community as part of the effective change process (Connell, Kubisch, Schorr, and Weiss, 1995).

CCIs work across sectors while trying to change individuals, families, institutions, and communities. They are situated in particular places and historical moments; as interventions they tend to evolve slowly and flexibly with attention to a large number of interacting processes; and they try to affect a broad range of outcomes (Connell, Aber, and Walker, 1995; Rossi, 1996). They also tend to involve a large number of individuals and groups—funders, community leaders, community residents, the "downtown" political structure. All these factors make it extremely challenging to determine whether or not CCIs make a difference and, if so, how.

Given the obvious evaluation challenges presented by CCIs, it takes courage, and perhaps some folly, to address the issue of causality in CCI evaluations. Yet that is the focus of this paper, necessitated by the fact that the interventions themselves seem too promising to be ignored or given short shrift by the evaluation community. The paper especially tries to assess the promise of theory-based evaluations in advancing the assignment of causality to changes within CCI communities.

The paper begins with a brief discussion of the theory of change approach being articulated by James Connell, Anne Kubisch, and other colleagues from the Aspen Institute Roundtable on Comprehensive Community Initiatives for Children and Families. The theory of change approach is then submitted to a "test" for evaluations developed by Chen (1990): that is, any evaluation, regardless of its specific purpose, should be responsive to the needs of stakeholders and produce credible and generalizable results. The paper concludes that theory-based approaches can help on these counts if evaluators attend to the need for sufficiently credible counterfactuals at all stages of their work. Doing so will require that they develop strong theories, use multiple methods of inquiry to search for and confirm patterns in data, creatively blend research designs, and refrain from rushing to judgment based on findings from individual studies. The paper places this discussion in context by considering what is meant by "cause," the role of counterfactuals in estimating effects and their causes, and the consequences of mistakes in causal inference.

Causality and the Theory of Change Approach

A theory of change approach to evaluation assumes that underlying any social intervention is an explicit or latent "theory" about how the intervention is meant to change outcomes (Weiss, 1995; Schorr, 1995). This notion has been around for some time (Weiss, 1972; Cronbach et al., 1980). In the earlier literature not directed toward CCIs, theory is most typically suggested as a guide for getting within the "black box" of social programs, in order to understand the relative contribution of specific programmatic mechanisms or components to any estimated effects. Further, having an explicit theory about how various processes and outcomes might be linked can direct data collection and analysis. With this map in hand (so the argument goes), evaluators and their clients can measure near-term outcomes with some confidence that observable change in those outcomes will be followed by changes in longer-term outcomes (Chen, 1990). They can also measure the processes that link (and perhaps cause change in) those outcomes. In short, theory can help evaluators pull apart and understand social interventions.

While the general notion of a theory of change approach is not new, the literature does not contain many examples that describe in detail how to develop such a theory for an intervention. There appears to be a consensus that theories about CCIs will come from a combination of existing "social science knowledge" and "practitioner wisdom," with local practitioners in a CCI playing an important role (Weiss, 1995). There are good reasons why local wisdom is required. First, current social science and practitioner knowledge alone are in no way up to the task. As yet, there is neither a scientific literature nor a consensus among practitioners about how to put CCIs in place or how to assure that certain activities will lead to desired results.1 Second, getting local stakeholders involved is consistent with the "community empowerment" ethos of CCIs. Third, knowledge of how an initiative should (or could) be implemented demands local knowledge about such things as community capacity and culture. Thus, a potential role of the evaluator is to "surface" the latent theory. This process tends to take the form of a dialogue that either begins with a description of the first steps of the intervention and moves across outcomes or starts with long-term outcomes and creates a "map" back to the intervention (Brown, 1995). The intent of this guided process is to create a written, explicit description of how stakeholders expect to move from activities to their goals.

Developing a theory of change requires both art and science. In CCIs, stakeholders and groups commonly hold different (and not necessarily compatible) theories. Regardless of its theory, each group seems to feel more sure about its ultimate goals and the near-term strategies, activities, and benchmarks than about activities and outcomes that will presumably occur between current events and long-term results. In addition, stakeholders tend to view their theories as dynamic. They want to revisit their hypotheses about how events will unfold as time passes and experience suggests that revisions are necessary. This means that our current theories of change are not fixed guides that evaluators and others can use in a rigid way. Rather, as with most things in the natural sciences, they are at best well-informed propositions about how highly complex events are related at a particular time and place.

What Does "Cause" Mean?

The state of social science knowledge allows us to adopt only a rather modest standard about causal inferences. As Holland (1988) notes, since Aristotle, philosophers of science have been trying to define what it means for A to cause B. In the social sciences, the statement "A causes B" is often misleading. At best, even in situations where we can use true experimentation, we are able to make quite general, undifferentiated statements about the discrete causes of any effects. In part, this is because most social interventions are multifaceted, and their elements interact in ways we cannot predict. Holland extends this idea, referring to what he calls "encouragement" studies (where individuals are encouraged to participate in an intervention). His point is that humans exhibit varied behavior in response to such things as the "opportunity" to enter a program. Some attend and some do not, and the extent and pattern of attendance for those who come vary in unpredictable ways. Thus, it may be credible to say that X, Y, and Z are the effects of a particular CCI, but it will be virtually impossible to know with any precision what aspects of the CCI caused those effects.

Accepting that it will not be possible (or desirable) to try to pull a CCI apart for causal attribution, we are still left with questions like, "If we do all of X, will we get Y?" Think of this as seeking 100 percent predictability. While complete predictability may be a goal, as Cook and Campbell (1986) note, it will not come soon. Cook and Campbell eloquently write that "this is partly because of the quality of current social science theories and methods, partly because of the belief that society and people are ordered more like multiple pretzels of great complexity than like any structure implied by parsimonious mathematical formulas . . . [and] scientists assume that the world of complex, multivariate, particularistic, causal dependencies . . . is ordered in probabilistic rather than deterministic fashion."

Thus, evaluations of CCIs and most other social interventions, even those aided by well-articulated theories of change, will at best help us make some fairly imprecise inferences about the causal ingredients within the intervention.

How Important Is a Counterfactual to Understanding Cause?

Causal inference requires estimating effects, and one cannot estimate effects without a counterfactual. Even in disciplines where experimentation is unavailable, such as astrophysics, history, political science, and geology, causal attribution requires counterfactual inference (Tetlock and Belkin, 1996). While this is widely understood in the scientific community, it is surprising how quickly discussions about evaluating social programs lose the distinction between outcomes (a measure of the variables that follow all or some of an intervention) and effects (the outcome minus an estimate of what would have occurred without the intervention). To make this point clear, consider the following summaries of two social projects:

Summary 1

A number of children participate for up to two years in a high-quality early childhood intervention. Long-term follow-up shows that 33 percent do not finish high school or earn a general educational development (GED) certificate, 31 percent are detained or arrested, 16 percent of their school years are spent in special education, and the teen pregnancy rate for females in the program is 64 per 100.

Summary 2

A number of teen mothers and their children participate for up to 18 months in a high-quality comprehensive program meant to improve their educational achievement and credentials, increase their employment and earnings, and decrease their reliance on public assistance. After long-term follow-up, the proportion holding a high school diploma or a GED certificate has increased from 6 percent to 52 percent, the employment rate during the year preceding measurement has grown from 37 percent at baseline to 53 percent at follow-up, just 20 percent earned more than $500 during the year before baseline but at follow-up 48 percent have done so, and Aid to Families with Dependent Children (AFDC) receipt has fallen from 95 percent to 75 percent.

Query

Which intervention made a difference?

Answer

The early childhood intervention described in Summary 1.

Readers may realize that summary 1 describes the program group outcomes for participants in the Perry Preschool Project at age 19 (Berrueta-Clement et al., 1984). This small social experiment involving 123 families is arguably one of the most influential demonstration studies in history. The intervention made a positive difference across a range of important outcomes. Summary 2 represents the 42-month outcomes from the New Chance Demonstration, a program for high school dropouts who had their first children as teenagers and were on AFDC.2 The evaluation shows that many of the young women moved forward in many ways, yet, consistent with findings from other interventions for this subset of teenage parents, the program group did not advance farther than their control group counterparts in most respects. The accompanying table (128k) contains selected measures from these two studies and makes the point that a strong counterfactual is fundamental to having a good estimate of effects.3

Before moving on, it may be useful to explore a limit of the previous example. The table (128k) may suggest that a counterfactual is important only when we are interested in judging an intervention’s effects on long-term outcomes. This is not so. Rather, a counterfactual is needed for other evaluation purposes, such as refining a program. For example, suppose the developers of the Perry model wanted to know if certain staff development activities were "paying off" in changes in teacher behavior, which in turn were creating differences in student performance. To complete this analysis, they would need to assess the effects on each of these variables in the presumed causal chain. To do so, they would have to address questions such as the following: How much staff development are teachers getting? Are they getting more of it than they would have without us? Do doses of staff development predate change in teacher behavior? Do teachers who are not getting staff development also change their behavior? Does performance by students differ between those whose teachers are and those whose teachers are not receiving staff development? At each stage of the analysis, the strength of any causal attribution would rest on the strength of the counterfactual and the validity of the theory undergirding the analysis. Without the counterfactual, it would not be possible to estimate the effects on the outcomes of interest. And without the theory, it would not be possible to link those effects in a causal chain.

Of course, for practical reasons, evaluations have to pay more attention to certain effects and causes than to others. For instance, in the above example it probably makes sense to worry more about the link between teacher behavior and student effects than between effects on staff development and the link to teacher behavior. This raises the question, "When are a counterfactual and a causal inference good enough?"

What Is the Appropriate Standard for Credibility?

Establishing a simple, uniform threshold for credibility may not be possible. Instead, since evaluations are done to help people make decisions, the credibility of any causal inference should be commensurate with the importance of the judgment it will influence. A causal judgment is really a probabilistic statement about the likelihood that one thing leads to another. As with all probabilities, there is always the chance that a particular attribution is wrong. Sometimes we will assert that a CCI caused some effect and we will be mistaken; that is, the appearance of cause might exist simply due to chance or some unobserved (or uncontrolled for) phenomenon. Similarly, we may say that a CCI is not getting us what we hoped, when in fact it is making a positive difference. Accepting that mistakes are always possible, the question becomes something like, "In the scheme of things, what sorts of mistakes are more tolerable than others?"

It seems that the decisions with the greatest consequence have to do with whether or not the CCI is causing effects on the longer-term outcomes of interest. If we make a mistake on that question, two scenarios are possible, depending on the nature of the error. If we mistakenly attribute positive effects to a CCI, some people will erroneously assume that the CCI should be continued and (perhaps) replicated elsewhere. On the other hand, if we mistakenly say that the CCI is not making a difference (or is making a negative difference), then the effort may be inappropriately stopped. Are these mistakes serious? Quite possibly. Their seriousness depends on such considerations as the importance of the CCI’s effects (if any) on the participants and society at large, the cost of the CCI, and the need elsewhere for the resources consumed by the CCI (in economic terms, the "opportunity costs" of the CCI). In contrast to this example, causal misattribution regarding the exact nature and effect of some program implementation strategy, such as staff development, is of less concern. At worst, staff members might participate in some activities that are not crucial to the program’s success, or some worthwhile program development activities might be inappropriately stopped.4

The lesson here is that the credibility of inferences becomes more important as the consequences of making a mistake become graver. Furthermore, the ramifications of making a mistake must be considered from the multiple vantage points of the different stakeholders. Important consequences demand lots of credibility, and minor consequences demand some.

Testing the Value of the Theory of Change Approach

As Chen noted in Theory-Driven Evaluations (1990), debates regarding evaluation tend to be method oriented. That is, most discussions involve the relative merits of various experimental and quasi-experimental designs and their interaction with various data collection methods (nomothetic/quantitative versus idiographic/qualitative)5 and purposes (problem documentation, program refinement, and summative program assessment). Chen provides a framework for these discussions that is useful in considering the theory of change approach. He observes that evaluation results should provide evidence of four characteristics:

How does the theory of change approach measure up against each of these dimensions when the task is to make causal inferences?

Responsiveness

Given the current state of practitioner and social science knowledge about how CCIs work, stakeholders are going to be closely involved in the development of any theory of change. As Stake has pointed out, evaluations are more responsive to various stakeholders if those stakeholders are involved in selecting the evaluation’s questions, measures, and methods (1975). This is an important consideration given the political nature of most evaluation work. Common agreements, in advance, about such things as early benchmarks can help stakeholders avoid controversy and contention.

The theory of change approach goes beyond simple involvement to using credibility among stakeholders as the touchstone for assessing a theory. Even if an evaluator suggests that a CCI should import ideas documented elsewhere, the ground rule that seems to be emerging is that these ideas need to be "owned" by the local groups. Local stakeholders must believe that the theory of change makes sense. Therefore, a theory of change approach, and the causal links it depicts, ought to be highly responsive, as long as all views are considered and thoughtfully weighed.6

Objectivity

Laying out a theory a priori makes potential causal relationships explicit. Thus, it seems that a theory of change approach should increase the objectivity of causal judgments. Yet achieving such a benefit may take some work. Experience shows that different stakeholder perspectives lead to different theories, while several sources in the literature suggest that cognitive and emotional biases may systematically influence the way individuals attribute cause in indeterminate situations (Tetlock and Belkin, 1996; Granger and Armento, 1980).

One factor that appears to influence our judgments of causality is the degree to which we see outcomes as normative (Kelley, 1973). If an outcome is seen as typical, we are likely to decide that it was caused by "environmental" factors outside a program. However, if an outcome is seen as atypical, we are more likely to believe that the outcome was shaped by the program under review. Cognitive psychologists have shown that a number of factors, such as prior expectations of the attributer, perspective (having a role either inside or outside an intervention), and the "vividness" of the results all shape the judgment of normalcy (Tversky and Kahneman, 1971, 1973, 1974; Borgida and Nisbett, 1977). For example, one very vivid and recent episode in an event-outcome sequence (such as "my Toyota just broke down") tends to crowd out "pallid" baseline data (the maintenance record in Consumer Reports, for instance). Similarly, being an actor in a situation (as opposed to being an observer) seems to influence judgments about cause. Although the empirical literature on this topic contains some nuances, participants tend to assign the cause of events to forces outside themselves, while observers tend to emphasize the causal role of participants. Not surprisingly, however, some researchers have observed an emotional side to these biases. We tend to attribute perceived success to our own actions and failure to external factors, unless it is likely that we will be proven wrong (Bradley, 1978).

Given these well-documented biases in the psychological literature, it is likely that theories of change will vary by stakeholder in rather predictable ways. The solution probably lies in doing just what evaluators and CCI stakeholders are doing: laying out the various theories, critiquing each others’ conceptions, developing a consensus (or consciously leaving competing theories "on the table" for consideration), and revising theories prospectively to avoid ex post justifications.

Trustworthiness

Armed with a consensually developed theory of change that arguably makes causal inferences more responsive and credible, the evaluation inevitably has to confront the test of trustworthiness. Are the results convincing and free from confounding factors? Determining how to answer that question often engenders a fairly acrimonious debate about the fallibility of various evaluation designs. Some line up for social experimentation with random assignment and decide that other approaches are a distant second best (Hollister and Hill, 1995). Another camp asserts that random assignment is not practically possible with CCIs and that it leads to misleading and rigid analyses (Schorr, 1995). Both positions have some merit, but the debates do not move us very far forward.

The call for random assignment is driven by a desire to estimate a counterfactual in a way that controls for selection bias, along with other confounding factors that might compete as causal explanations for any estimated effects.7 Selection issues have been a major problem in the evaluation of social programs (Lalonde, 1986; Fraker and Maynard, 1987; Friedlander and Robins, 1995), and in most interventions targeted on individuals, they must be seriously addressed through randomization or very strong quasi-experimental methods. At this moment, however, the questions dominating CCIs do not demand counterfactuals that are free from selection bias in order to produce credible results. Furthermore, if CCIs reach a point where such counterfactuals are needed, randomization alone may not be the best solution.

Although CCIs have existed in various forms at other times in our domestic policy history (O’Connor, 1995; Halpern, 1994), the current resurgence is quite recent. As most stakeholders tell us, CCIs are now facing contexts that may well be more depleted than before. This means that the threshold questions facing CCIs have to do with program implementation and refinement. At this time, we do not have agreed-upon methods for creating CCIs that are sufficiently durable and strong to drive even mid-term benchmarks. In spite of pressure from funders, summative assessment of CCIs seems premature.

The process of program refinement demands causal inferences in order to allocate scarce resources. Questions include "Are the planned activities happening?" and "Do they seem to be leading to (or causing) short-term benchmarks in ways that are responsive and credible to those who need to make decisions (about staying the course, revising the approach, or revising the theory)?" Answering such questions demands a counterfactual. At issue is how strong that counterfactual must be. It appears excessive to seek counterfactuals for such estimates beyond a clear theory, careful documentation of the activities and outcomes (including intended and unintended outcomes), lots of transactions between evaluators and stakeholders about the emerging picture, and some clear-headed "counterfactual reasoning." Measuring counterfactuals is not cost-free; most near-term events are within the control of an intervention (for example, it is hard to imagine "village councils" spontaneously springing up in four neighborhoods in Cleveland without CCI activity); and mistakes about causal inference at this stage are not likely to carry high stakes.

That said, there will soon come a time when stakeholders reasonably ask about the mid-term accomplishments of CCIs. Are we on the right track? Answering such a question demands a stronger counterfactual than we are likely to get solely from a theory of change and the good work of evaluators. When this time comes, some will suggest a design that randomly assigns communities as the way to proceed. Their intent will be to create a counterfactual where there is no selection bias. (The communities in the two groups will be equivalent if a sufficient number of communities are included in the lottery.) But such a design alone will not be sufficient because it will not fully answer the first-order questions regarding what it takes to get a CCI implemented, and the relationship between implementation and subsequent effects.

Because random assignment has been characterized as "the gold standard" (Hollister and Hill, 1995), it may be useful to step back and assess what random assignment might mean at the community level. First, we would be faced with a decision about the composition of our research sample. A concern for generalizability would suggest that we should recruit a broad sweep of communities. On the other hand, a broad sweep would undoubtedly take in many communities without the will or resources to implement a CCI. Failures of implementation would be costly to the evaluation; specifically, resources would be spent on studying communities that never get a CCI going. Therefore, we would probably proceed fairly far along, using prescreening criteria, before we chose the final sample for the research. For example, we might screen out communities that did not express a strong willingness to start a CCI. Then, assuming that there were not enough CCI start-up resources to go around, we would use the lottery-like process of random assignment to allocate the finite resources, creating a "program" group of communities and a "control" group of communities.

Given the nature of CCIs, even with our prescreening, some experimental communities would only partially implement the initiative. Conversely, some communities in the control condition would begin their own CCIs. The only unbiased estimates would compare all the communities in the program group with all those in the control group—an unsatisfactory comparison, given the mixed levels of implementation in each group. When policymakers, practitioners, and funders ask about the intermediate effects of CCIs, they do not want answers that include lots of sites where implementation has failed. Nor do they want to muddy the estimation of CCI effects by comparing CCIs in some communities with different CCIs in others. Rather, the two likeliest questions are, "When a CCI is in place, does it make a difference?" and "What does it take to put a successful CCI in place?"

Assuming those two questions, three strategies can help generate trustworthy causal inferences, especially if they are used in tandem: creatively blend designs to create reasonably strong counterfactuals; explicate and test for patterns within and across sites and time; and investigate possible causes and effects using mixed data collection methods and modes of analysis. All are assisted by a clear theory of change. An overarching recommendation about causal inference in CCIs is to come to such inferences slowly, especially if the stakes regarding a misattribution are large.

Blend designs. Quasi-experimental methods were born of the inability to use experimental methods sensibly in all situations (Campbell and Stanley, 1963; Cook and Campbell, 1986; Cook, 1991). Since Campbell’s and Stanley’s seminal Experimental and Quasi-Experimental Designs for Research, the language of "internal" and "external" validity has dominated most discussions about causality and design. Hollister and Hill (1995), drawing in particular on a study that compared experimental and quasi-experimental estimates from the same data sets (Friedlander and Robins, 1995), raised important questions about relying on any one quasi-experimental approach. That is not what I have in mind.

The recommendation instead is for the sort of planned and creative blending of designs that has long been advocated by researchers including Cook, Campbell, and Stanley. For example, as noted above, assume that a first-order "effects" question for CCIs is, "When a CCI is in place, does it make a difference?" One quasi-experimental design that could help us answer that question is an interrupted time series (Campbell and Stanley, 1963; Cook and Campbell, 1979). The design estimates effects by longitudinally taking a number of pre-intervention observations to establish a pre-intervention trend. Then the series is "interrupted" by an intervention (that is, the implementation of the CCI), followed by the collection of further longitudinal data. If the trend in the data changes with the advent of the intervention, the change (or effect) is attributable to the intervention.

In an interrupted time series, selection bias is controlled for by using each site as its own control in the time series analysis. This feature of the design helps with the problem of sites achieving different levels of implementation by allowing the evaluator to compare the deviations from the trend for different levels of implementation, with each deviation being free from selection effects.

The main threat to the validity of this causal inference is that some other event outside the intervention might cause us to miss—or mistakenly find—some effects. For example, a sudden general economic downturn that coincided with the beginning of a CCI might mask its effects on a variety of economic outcomes. Conversely, a general improvement in the economy that was unrelated to a CCI but coincided with its implementation might create some positive effects that the evaluation would mistakenly attribute to the CCI.

Some recent studies are trying to guard against such misattribution by adding data from nonintervention sites to the design. Here, some form of matching of communities, or matching coupled with randomization (referred to as "stratified" random assignment), can be helpful. For example, as reported in Rossi (1996), the evaluation of the Rapid Early Action for Coronary Treatment (REACT) public health intervention involves the random assignment of ten communities, five to a program group that receives the REACT intervention and five to a control condition.8 Similarly, in the Jobs-Plus demonstration, the Manpower Demonstration Research Corporation (MDRC) has created program and comparison sites by having six communities each nominate two or three public housing developments for the intervention.9 MDRC then randomly assigned one development in each city to the program group and the remainder to the comparison group. The REACT and Jobs-Plus evaluation teams do not believe that their randomization of a few communities has created a fully unbiased counterfactual in either evaluation. But it has addressed the potential concern that the interventions somehow "stacked the deck" in their favor.

Furthermore, the Jobs-Plus team is gathering time series data on both intervention and comparison sites. This strengthens the overall design in two important ways. First, it helps minimize the threat of history. If the time series data show an improvement (measured as the deviation from the historical trend) in the treatment communities that exceeds the deviation in the comparison communities, it is more likely that the estimated effects are the result of the intervention. Conversely, if a deviation in the treatment communities is matched by an equivalent deviation in the comparison communities, the results imply that some other co-occurring event is driving the change.

The time series data especially helps with the potential problem that some communities chosen for the "noninterruption" time series cohort will implement CCIs on their own. (The closer the equivalence of the two groups, the more this scenario is likely.) Because each community serves as its own control, we can include such communities in the time series analysis as "interrupted" cases. Second, we can examine these cases to understand how they managed to create a CCI without the (presumably necessary) help available to those communities in the program group. This will get us closer to understanding what it takes to create community change.

The use of an interrupted time series with an attendant, uninterrupted comparison series is not a panacea. The design demands the ability to gather longitudinal data, and few current administrative record systems are up to the task. Second, such designs tend to work best when the intervention "interrupts" the time trend abruptly and the effects are large. Neither may be the case with a particular CCI. However, the time series approach seems more appropriate in this situation than randomization alone.

Explicate and test for patterns. While a blend of time series data appears promising for evaluations of CCIs, we have to go farther to get trustworthy causal inferences. In simple terms, one person’s cause is another’s effect. That is, we look for causality by understanding how patterns of effects (estimated by comparing the intervention cases with their counterfactuals) are linked together in the data. For example, CCIs tend to assume that participation in various cross-sector planning bodies will lead to more resources coming to the community. It is tempting to estimate only the effect on resources, because that is a long-term outcome of interest. But to link any estimated effect back to the presumed cause of increased participation, one needs first to estimate the effect on that outcome. That is, has participation changed significantly (using some reliable counterfactual like the historical trend)? Then, does that change seem to be related in a predictable way to any changes in resources? Several authors have suggested that a finely drawn theory of change can guide the testing for such patterns in results (Weiss, 1972; Chen, 1990; Trochim and Cook, 1992; and Rossi, 1996). The suggestion, following Cook and Campbell (1979), Freedman (1991), and others, is to rely on design (as opposed to statistical modeling) as the primary vehicle for estimating effects. Then, these design-driven estimates of effects can be included in more sophisticated models.

Integrate methods. A third recommendation is to enhance causal analysis in CCIs by integrating methods in a predetermined way. Most evaluators recognize that the integration of methods makes good sense. Different methods are better suited to learning about different phenomena. For example, quantitative techniques are typically better for such tasks as assessing the prevalence of discrete phenomena in a community (such as the rate of housing starts), while qualitative techniques more usefully expose context, certain processes, and the meanings people attach to events.10

At first blush, the a priori statement of a theory of change may seem antithetical to the qualitative paradigm. After all, some people would argue that the "theory" should emerge developmentally over time, influenced by the unfolding events. That sounds right, and fortunately it is consistent with the behavior of evaluators pursuing a theory of change approach to CCI evaluation. As described earlier, current theories of change are incomplete, particularly in the space between close-in (short-term) program activities and long-term outcomes of interest. This implies that field researchers should be close to stakeholders over time, presenting the stakeholders with information (from both quantitative and qualitative methods) and extracting their revised theories about the future. Without these refinements, it is going to be impossible to make intelligent guesses about effects or their potential causes.

A second reason to mix methods involves the need to measure things well. Measurement error can obscure potential effects, while measurement bias (systematic measurement error) can distort results and lead to mistaken notions about cause. As described in the literature, theories of change are arrayed as sets of outcomes that are connected by processes or mechanisms. Many current "theories" present significant measurement challenges: just what does it mean for the community to "feel empowered" or for residents to "participate in decision making," and how can these be measured? This situation demands a mix of qualitative and quantitative work to generate a consensus (via convergence of results from different methods) that events, processes, and outcomes have occurred and are in fact linked. Thus, we need to take some qualitative and some quantitative measures of the same phenomena.

Methods may be mixed iteratively (for example, field work leads to a survey, which leads to more field work), or in a more integrated fashion (Greene, 1995; Ragin, 1989). While a discussion of the relative merits of these approaches is beyond the scope of this paper, a concern for causality probably steers one toward integration.

Generalizability

Chen (1990) discusses the idea that the results of an evaluation should be pertinent to other people, places, times, and related problems. This concept of generalizability has been in common conversation among applied social scientists since Campbell and Stanley (1963), and it has been refined in subsequent publications (Bracht and Glass, 1968; Campbell, 1986).

Multi-site demonstrations are common in applied social science research, as is the finding that an intervention can seem to "work" in some sites but not in others (Riccio and Orenstein, 1996). This is the sort of variation that ought to make anyone nervous about firmly latching onto a causal attribution based on results from one CCI evaluation. When interventions occur in multiple contexts, many things vary. Chief among these are the people involved (participants, staff, and others), the community context, the "quality" of the intervention, the moment in time, and any interaction involving some subset of these factors. As developed by Campbell (1957), Cook (1990), and others, the key to generalizability is replication across people, places, and time. The goal here is not that effects always be the same but that they occur in some predictable fashion.

Armed with a strong theory, evaluators are better prepared to anticipate and then examine how between-site variation may shape effects. For example, a theory may suggest that a CCI emphasizing employment will make more of a difference for poorly skilled unemployed persons in a weak labor market (where they must compete with others in a pool of unemployed persons who are relatively skilled) than in a strong one (where most can get jobs without much special assistance). Such hypotheses can be explored if we pay attention to developing some common baseline and outcome measures for cross-site work.

It is going to take time and considerable coordination of evaluation activities before we can use variation across different studies as a way to pursue theory-driven hypotheses about how results should differ. But until we have done so, decision makers should be cautious.

In Closing

This paper has echoed much of what others have said about program evaluation research in the past thirty years. The advice, that is, is to use theory as a guide, mix methods, seek patterns that corroborate each other (both within and across studies), and creatively combine various designs. None of this will surprise applied social scientists, nor will it be particularly reassuring to those who call for redefining the standards of proof or discarding questions about effects. In short, the recommendation is to do the conventional work better, recognizing that CCI evaluation is helped in many ways by a theory-based approach.

This analysis suggests that a theory of change approach can assist in making causal inferences, regardless of an evaluation’s immediate purpose. It is easier to document problems when a clear theory is available that will direct the baseline analysis and help a community design a CCI that can cause change. Program refinement demands causal analyses that can help decision makers allocate start-up resources, and these decision makers will be assisted by thinking through the links between strategies and early outcomes. Summative program assessment demands strong counterfactuals (the stakes regarding misjudgments are high at this stage), multiple measures of effects, and strong theory to lead the search for confirming patterns in those effects. Finally, generalizability to other persons, places, and times requires a theory to help us make and investigate such generalizations. All this seems especially true with CCIs, given their extreme complexity.

The main caution for the CCI community (including funders) is that a premature push for "effects" studies is likely to be very unsatisfying. Too much time will be spent gathering too much data that will not get synthesized across efforts. In contrast, funding of CCIs should rest on the prima facie merit of their activities at the present time. Funders should encourage mixed-inquiry techniques, theory building, and cross-site communication so the field can aggregate useful information over time.


Notes

  1. This does not mean that we know nothing or that all theories have equal merit. For example, Connell, Aber, and Walker (1995) have described elements of what we know with some surety about the design of programs for children, youth, and families.
  2. The author directs this study under the auspices of the Manpower Demonstration Research Corporation.
  3. In the Perry Preschool Study and the New Chance Demonstration, the counterfactual is represented by the experiences and outcomes of members of the research sample who were assigned, in these studies at random, to the "control" group. At the point of random assignment, this group is equivalent to their counterparts randomly assigned to the "program" group. Thus, any subsequent differences between the groups that are large enough not to be due to chance are reasonably described as effects of the intervention.
  4. Weighing the consequences of mistaken judgments about short-, mid-, and long-term effects is not straightforward. In asserting the importance of long-term outcomes, I recognize that CCIs must show "effects" all along the way, in order to steer their implementation efforts and justify their resources. But given how little we know about the causal pathways from activities to effects, I am assuming that funders will not "pull the plug" prematurely, unless there is a pervasive failure to meet early benchmarks.
  5. Nomothetic research attempts to establish general, universal, abstract principles or laws, while idiographic research deals with individual, singular, unique, or concrete cases.
  6. It also may be the case that the very act of having local stakeholders participate in the "surfacing" process will improve the intervention. This follows from Schöen (1983) and others who have argued that "reflective" practitioners who have consciously considered their own assumptions and strategies will do a better job.
  7. To understand selection bias, imagine we are estimating the effects of a CCI by comparing the results from one set of communities with those from a set of matched pairs. Extraneous influences on our outcomes might be controlled for by this design, but they also might exist due to factors related to how we selected communities into our CCI and comparison samples. Therefore, the argument goes, counterfactuals should be created through a lottery process so that the procedures for creating the counterfactual do not unwittingly bias the findings.
  8. The REACT intervention seeks to improve the survival rate of heart attack victims by shortening the time between a heart attack and administering appropriate medication. The intervention consists of an educational campaign aimed at persons at risk, emergency personnel, and primary care physicians.
  9. The Jobs-Plus demonstration seeks to increase employment and earnings in public housing through a combination of employment and training, financial incentives, and community organizing. See Bloom (1996) for a discussion of the demonstration and the research design.
  10. This recommendation takes a pragmatic position regarding the "paradigm debates" involved in the mixing of various methods (Greene and Caracelli, 1997; Rossman and Wilson, 1985). The position here is that it is possible to integrate methods in a manner that preserves the integrity of their root assumptions.

References

Berrueta-Clement, John, Lawrence J. Schweinhart, W. Steven Barnett, Ann S. Epstein, and David P. Weikart. 1984. Changed Lives: The Effects of the Perry Preschool Program on Youths through Age 19. Ypsilanti, MI: High School Educational Research Foundation.

Bloom, Howard S. 1996. "Building a Convincing Test of a Public Housing Employment Program Using Non-Experimental Methods Planning for the Jobs-Plus Demonstration." Paper commissioned by the Manpower Demonstration Research Corporation, in partnership with the U.S. Department of Housing and Urban Development and the Rockefeller Foundation. Excerpted from "Research Design Issues and Options for Jobs-Plus," Howard S. Bloom and Susan Bloom, March 1996.

Borgida, E., and R. E. Nisbett. 1977. "The Differential Impact of Abstract vs. Concrete Information on Decisions." Journal of Applied Social Psychology 7: 258-71.

Bracht, Glenn H., and Gene V. Glass. 1968. "The External Validity of Experiments." American Educational Research Journal 5 (4):437-38.

Bradley, G. W. 1978. "Self-Serving Biases in the Attribution Process: A Reexamination of the Fact or Fiction Question." Journal of Personality and Social Psychology 36:56-71.

Brown, Prudence. 1995. "The Role of the Evaluator in Comprehensive Community Initiatives." In New Approaches to Evaluating Community Initiatives: Concepts, Methods, and Contexts, ed. James P. Connell et al. Washington, DC: The Aspen Institute.

Campbell, D. T. 1986. "Relabeling Internal and External Validity for Applied Social Scientists." In Advances in Quasi-Experimental Design and Analysis, ed. W. M. K. Trochim. San Francisco: Jossey-Bass.

Campbell, D. T. 1957. "Factors Relevant to the Validity of Experiments in Social Settings." Psychological Bulletin 54:297-312.

Campbell, D. T., and J. C. Stanley. 1963. Experimental and Quasi-Experimental Designs for Research. Chicago: Rand-McNally.

Chen, Huey-tsyh. 1990. Theory-Driven Evaluations. Newbury Park, CA: Sage Publications.

Chen, Huey-tsyh, and Peter H. Rossi, eds. 1992. Using Theory to Improve Program and Policy Evaluations. New York: Greenwood Press.

Connell, James P., J. Lawrence Aber, and Gary Walker. 1995. "How Do Urban Communities Affect Youth? Using Social Science Research to Inform the Design and Evaluation of Comprehensive Community Initiatives." In New Approaches to Evaluating Community Initiatives: Concepts, Methods, and Contexts, ed. James P. Connell et al. Washington, DC: Aspen Institute.

Connell, James P., Anne C. Kubisch, Lisbeth B. Schorr, and Carol H. Weiss, eds. 1995. New Approaches to Evaluating Community Initiatives: Concepts, Methods, and Contexts. Washington, DC: Aspen Institute.

Cook, Thomas D. 1991. "Clarifying the Warrant for Generalized Causal Inferences in Quasi-Experimentation." In Evaluation and Education at Quarter Century, ed. M. W. McLaughlin and D. Phillips. Chicago: National Society for Studies in Education.

Cook, Thomas D. 1990. "The Generalization of Causal Connections: Multiple Theories in Search of Clear Practice." In Research Methodology: Strengthening Causal Interpretations of Nonexperimental Data, ed. Lee Sechrest, Edward Perrin, and John Bunker. Washington, DC: US Department of Health and Human Services, Public Health Service Agency for Health Care Policy and Research.

Cook, T. D., and D. T. Campbell. 1986. "The Causal Assumptions of Quasi-Experimental Practice." Synthesis 68:141-80.

Cook, T. D., and D. T. Campbell. 1979. Quasi-Experimentation: Design and Analysis Issues for Field Settings. Chicago: Rand-McNally.

Cronbach, Lee J., S. R. Ambron, S. M. Dornbusch, R. D. Hess, R. C. Hornik, D. C. Phillips, D. F. Walker, and S. S. Weiner. 1980. Toward Reform of Program Evaluation. San Francisco: Jossey-Bass.

Fraker, T., and R. Maynard. 1987. "The Adequacy of Comparison Group Designs for Evaluations of Employment-Related Programs." Journal of Human Resources 22:194-227.

Freedman, D. A. 1991. "Statistical Models and Shoe Leather." In Sociological Methodology 21:291-358.

Friedlander, Daniel, and Philip K. Robins. 1995. "Evaluating Program Evaluations: New Evidence on Commonly Used Nonexperimental Methods." American Economic Review 85(4):923-37.

Granger, R. C., and B. Armento. 1980. "Debate Concerning Program Evaluation Results: A Natural Event." Paper presented at the Annual Meeting of the American Educational Research Association, Boston.

Greene, Jennifer C. 1995. "The Paradigm Issue in Mixed-Method Evaluation: Towards an Inquiry Framework of Bounded Pluralism." Draft. Cornell University.

Greene, Jennifer C., and Valerie J. Caracelli. 1997. "Defining and Describing the Paradigm Issue in Mixed-Method Evaluation." In Advances in Mixed-Method Evaluation: The Challenges and Benefits of Integrating Diverse Paradigms, ed. Jennifer C. Greene and Valerie J. Caracelli. San Francisco: Jossey-Bass.

Halpern, Robert. 1994. Rebuilding the Inner City: A History of Neighborhood Initiatives to Address Poverty in the United States. New York: Columbia University Press.

Holland, Paul W. 1988. "Causal Inference, Path Analysis, and Recursive Structural Equations Models." American Sociological Association 13:449-50.

Hollister, Robinson G., and Jennifer Hill. 1995. "Problems in the Evaluation of Community-Wide Initiatives." In New Approaches to Evaluating Community Initiatives: Concepts, Methods, and Contexts, ed. James P. Connell et al. Washington, DC: Aspen Institute.

Kelley, H. 1973. "The Process of Causal Attribution." American Psychologist 28:107-28.

Lalonde, R. J. 1986. "Evaluating the Econometric Evaluations of Training Programs with Experimental Data." American Economic Review 76:604-20.

O’Connor, Alice. 1995. "Evaluating Comprehensive Community Initiatives: A View from History." In New Approaches to Evaluating Community Initiatives: Concepts, Methods, and Contexts, ed. James P. Connell et al. Washington, DC: Aspen Institute.

Ragin, C. C. 1989. The Comparative Method: Moving Beyond Qualitative and Quantitative Strategies. Berkeley: University of California Press.

Riccio, James A., and Alan Orenstein. 1996. "Understanding Best Practices for Operating Welfare-to-Work Programs." Evaluation Review 20(1):3-28.

Rossi, Peter H. 1996. "Evaluating Community Development Programs: Problems and Prospects." Discussion draft. Amherst, MA: Social and Demographic Research Institute, University of Massachusetts.

Rossman, G. B., and B. L. Wilson. 1985. "Numbers and Words: Combining Quantitative and Qualitative Methods in a Single Large Scale Evaluation Study." Evaluation Review 9:627-43.

Schoën, Donald A. 1983. The Reflective Practitioner. New York: Basic Books.

Schorr, Lisbeth B. 1995. "New Approaches to Evaluation: Helping Sister Mary Paul, Geoff Canada and Otis Johnson while Convincing Pat Moynihan, Newt Gingrich and the American Public." In Getting Smart, Getting Real: Using Research and Evaluation Information to Improve Programs and Policies. Baltimore: Annie E. Casey Foundation.

Stake, R. E., ed. 1975. Evaluating the Arts in Education: A Responsible Approach. Columbus, OH: Merrill.

Tetlock, Philip E., and Aaron Belkin. 1996. "Counterfactual Thought Experiments in World Politics." Social Science Research Council 50(4):77-85.

Trochim, William M. K., and Judith A. Cook. 1992. "Pattern Matching in Theory-Driven Evaluation: A Field Example from Psychiatric Rehabilitation." In Using Theory to Improve Program and Policy Evaluations, ed. Huey-tsyh Chen and Peter H. Rossi. New York: Greenwood Press.

Tversky, A., and D. Kahneman. 1974. "Judgment under Uncertainty: Heuristics and Biases." Science 185:1124-31.

Tversky, A., and D. Kahneman. 1973. "Availability: A Heuristic for Judging Frequency and Probability." Cognitive Psychology 5:207-32.

Tversky, A., and D. Kahneman. 1971. "Belief in Small Numbers." Psychological Bulletin 76:105-10.

Weiss, Carol Hirschon 1995. "Nothing as Practical as Good Theory: Exploring Theory-Based Evaluation for Comprehensive Community Initiatives for Children and Families." In New Approaches to Evaluating Community Initiatives: Concepts, Methods, and Contexts, ed. James P. Connell et al. Washington, DC: Aspen Institute.

Weiss, Carol Hirschon 1972. Evaluation Research: Methods for Assessing Program Effectiveness. Englewood Cliffs: Prentice Hall.


Acknowledgements

The author would like to thank the following people for their helpful comments: Howard Bloom, James Connell, Thomas Cook, Janet Quint, James Riccio, and Peter Rossi.


Back to New Approaches to Evaluating Community Initiatives index.


Copyright © 1999 by The Aspen Institute
Comments, questions or suggestions? E-mail webmaster@aspenroundtable.org.
This page designed, hosted, and maintained by Change Communications.