New Approaches
to Evaluating

Volume 1
Concepts, Methods, and Contexts

Problems in the Evaluation of Community-Wide Initiatives
Robinson G. Hollister and Jennifer Hill

In this paper we outline the types of problems that can arise when an attempt is made to evaluate the effects of community-wide programs, or comprehensive community initiatives (CCIs). Our particular focus is on interventions that target all the individuals in a given geographic area or among a given class of people. We emphasize this feature at the outset because we make sharp distinctions between evaluations of those types of interventions and those in which it is possible to use random assignment methods to create control and treatment groups of individuals. We begin with a brief introduction of some key problems in the evaluation of community-wide initiatives: establishing a counterfactual (for determining what would have happened in the absence of the intervention), defining the unit of analysis, assigning community boundaries, and defining and measuring outcomes. The next section of the paper goes into some detail on the creation of a counterfactual and, specifically, the problems of establishing comparison groups against which to judge the effects of an intervention. We introduce random assignment as the preferred method for creating comparison groups but, given that random assignment is not possible in the evaluation of community-wide initiatives, we go on to review experience using alternative methods for establishing comparison groups of individuals, institutions, and communities. The third part of the paper discusses the types of research questions that could be addressed in community-wide initiatives if key methodological problems could be resolved.

The general conclusion from our review is that we find all of the alternative strategies for establishing counterfactuals problematic with respect to evaluations of community-wide initiatives. As a result, in the final section, we provide some suggestions for developing improved methods to use in these situations.

Key Problems in the Evaluation of Community-Wide Initiatives

The Counterfactual

The basic question an evaluation seeks to address is whether the activities consciously undertaken in the community-wide initiative generated a change in the outcomes of interest. The problem in this case, as in virtually all evaluation cases, is to establish what would have happened in the absence of the program initiative. This is often referred to as the counterfactual. Indeed, most of our discussion turns around a review of alternative methods used to establish a counterfactual for a given type of program intervention.

To those who have not steeped themselves in this type of evaluation, it often appears that this is a trivial problem, and simple solutions are usually proposed. For example, we might look at the situation before and after the initiative is implemented in the given community. The counterfactual, in this case, would be the situation before the initiative. Or, we might find another community that initially looks very much like our target community, and then see how the two compare on desired outcome measures after the initiative is in place. In this case, the comparison community would provide the counterfactual--what would have happened in the absence of the program.

As we shall see, however, and as most of us know, these simple solutions are not adequate to the problem--primarily because individuals and communities are changing all the time with respect to the measured outcome even in the absence of any intentional intervention. Therefore, measures of the situation before the initiative or with comparison communities are not secure counterfactuals--they may not represent well what the community would have looked like in the absence of the program. Let's turn to some concrete examples. In the late 1970s and early 1980s, the federal government funded the Youth Incentive Entitlement Pilot Project (YIEPP) to encourage school continuation and employment among all low-income 16–19-year-olds in school catchment areas in several states. YIEPP pursued a strategy of pairing communities in order to develop the counterfactual. For example, the Baltimore school district was paired with Cleveland, the Cincinnati school district was paired with a school district in Louisville, and so forth. In making the pairs the researchers sought communities that had labor market conditions similar to those of the treatment community. Even though the initial match seemed to be quite good, circumstances evolved in ways that made the comparison areas doubtful counterfactuals. For example, Cleveland had unexpectedly favorable improvement in its labor market compared with Baltimore. Louisville had a disruption of its school system because of court-ordered school desegregation and busing. Those developments led the investigators to discount some of the results that came from using these comparison cities.

A similar procedure, with much more detailed analysis, was adopted as part of an ongoing study of school dropout programs being conducted by Mathematica Policy Research, Inc. The school districts with the dropout program were matched in statistical detail with school districts in a neighborhood within the same city or standard metropolitan statistical area. Although these districts initially matched well in terms of detailed school and population demographics, when surveys were done of the students, teachers, and school processes, it was found that the match was often very bad indeed. The schools simply were operating quite differently in the pre-program period and that had different effects on students and teachers.

The Unit of Analysis

For most of the programs that have been rigorously analyzed by quantitative methods to date, the principal subject of program intervention has been the individual. When we turn to community-wide initiatives, however, the target of the program and the unit of analysis usually shift away from just individuals to one of several possible alternatives. In the first, with which we already have some experience, the target of the program is still the individual, but individuals within geographically bounded areas--a defining factor that remains important. It is expected that interactions among individuals or changes in the general context will generate different responses to the program intervention than would treatment of isolated individuals.

Another possible unit of analysis is the family. We have had some experience with programs in which families are the targets for intervention (for example, family support programs), where the proper unit of analysis is the family rather than sets of individuals analyzed independently of their family units. When the sets of families considered eligible for the program and therefore for the evaluation are defined as residing within geographically bounded areas , these family programs become community-wide initiatives. Many of the recent community-wide interventions seem to have this type of focus.

Another possibility for community initiatives is where the target and unit of analysis are institutions rather than individuals. Thus, within a geographically bounded area a program might target particular sets of institutions--the schools, the police, the voluntary agencies, or the health providers--to generate changes in the behavior of those institutions per se. In this case, the institutions become the relevant unit of analysis.

The unit of analysis becomes critical because, when using statistical theory, the ability to make statements about the effects of interventions will depend on the size of the samples. So if the community is the unit of analysis, then the number of communities will be our sample size. If we are asking about changes in incarceration rates generated by alternative court systems, the size of the sample would be the number of such court systems that are observed. Using a unit of analysis of this size might make it more difficult to reach a sample size adequate for effective statistical inference.

The Problem of Boundaries

In community-wide initiatives, we generally focus on cases where geographical boundaries define the unit or units of analysis. Of course, the term "community" need not imply specific geographic boundaries. Rather it might have to do with, for example, social networks. What constitutes the community may vary depending upon the type of program process or the outcome that we are addressing. The community for commercial transactions may be quite different from the community for social transactions. The boundaries of impact for one set of institutions--let us say the police--may be quite different from the boundaries for impacts of another set of institutions--let us say schools or health care networks. That might suggest particular problems for community-wide initiatives that have as one of their principal concerns the "integration of services": the catchment areas for various types of service units may intersect or fail to intersect in complicated ways in any given area. (For a thorough discussion of the problems of defining neighborhood or community, see Chaskin 1994.)

For the purposes of evaluation, these boundary problems introduce a number of complex issues. First, where the evaluation uses a before-and-after design--that is, a counterfactual based on measures of the outcome variables in a community in a period before the intervention is compared with such measures in the same area after the intervention--the problem of changes in boundaries may arise. Such changes could occur either because some major change in the physical landscape takes place--for example, a new highway bisects the area or a major block of residences are torn down--or because the data collection method is based on boundaries that are shifted due to, say, redistricting of schools or changing of police districts. Similar problems would arise where a comparison community design is used for the evaluation, and boundary changes occur either in the treatment community or the comparison community.

Second, an evaluation must account for inflow and outflow of people across the boundaries of the community. Some of the people who have been exposed to the treatment are likely to migrate out of the community and, unless follow-up data are collected on these migrants, some of the treatment effects may be misestimated. Similarly, in-migrants may enter the area during the treatment period and receive less exposure to the treatment, thereby "diluting" the treatment effects measured (either negatively or positively).

Finally, the limited availability of regularly collected small-area data causes serious problems for evaluations of community-wide initiatives. The decennial census is the only really complete data source that allows us to measure population characteristics at the level of geographically defined small areas. In the intercensal years, the best we can do in most cases is to extrapolate or interpolate. For the nation as a whole, regions, states, and standard metropolitan statistical areas, we can get some regularly reported data series on population and industry characteristics. For smaller areas, we cannot obtain reliable, regularly reported measures of this sort. We suggest below some steps that might be taken to try to improve our measurements in small geographic areas, but at present this remains one of the most serious handicaps faced in quantitative monitoring of the status of communities. (See the paper by Claudia Coulton in this volume for further discussion of these measurement dilemmas.)

Problems with Outcome Measures

In many past evaluations in the social policy arena, the major outcome variables have been relatively straightforward and agreed-upon--for example, the level of employment, the rate of earnings, the test scores of children, the incidence of marriage and divorce, birth outcomes, arrests and incarcerations, and school continuation rates or dropout rates. For community-wide initiatives, these traditional types of outcomes may not be the primary outcomes sought, or, even if they are, they may not show detectable effects in the short term. For example, in the famous Perry Pre-school study, the long-term outcomes are now often talked about--employment, earnings, and delinquency, among others--but during the early phases of the program's evaluation these outcomes could not, of course, be directly measured. This may be true for some of the community initiatives as well: during the period of the short-term evaluation, it may be unlikely that traditional outcome measures will show much change even though it is hypothesized that in the long run they will show change. For community initiatives, then, we need to distinguish intermediate outcomes and final outcomes.

In addition, in community initiatives there may be types of outcome measures that have not been used traditionally but are regarded as outcomes of sufficient interest in and of themselves, regardless of whether they eventually link to more traditional outcome measures. That might be particularly relevant where the object of the community initiative is a change in institutional behavior. For example, if an institution is open longer hours or disburses more funds or reduces its personnel turnover, these might be outcomes of interest in their own right rather than being viewed simply as intermediate outcomes.

Finally, we would want to make a careful distinction among input measures, process measures, and outcome measures. For instance, an input measure might be the number of people enrolled in a GED (general educational development) program, whereas the outcome measure might be the number of people who passed their GED exam or, even further down the road, the employment and earnings of those who passed. Process measures might be changes in the organizational structure, such as providing more authority to classroom teachers in determining curriculum content rather than having it determined by superintendents or school boards. The ultimate outcome measure of interest for such a process measure would be the effect of the teachers' increased authority on student achievement.

For community-wide initiatives, the types of measurement questions that are likely to emerge are:

As one seeks to address these questions it becomes clear that it is important to try to determine as best as possible the likely audience for the evaluation results. The criteria for determining the important outcomes to be measured and evaluated are likely to vary with that audience. Will the audience in mind, for example, be satisfied if it can be shown that a community-wide initiative did indeed involve the residents in a process of identifying and prioritizing problems through a series of planning meetings, even if that process did not lead to changes in school outcomes or employment outcomes or changes in crime rates in the neighborhood? Academics, foundation staff, policymakers, and administrators are likely to differ greatly in their judgment of what outcomes provide the best indicators of success or failure.

Another dimension of this problem is the degree to which the audience is concerned with the outcomes for individuals versus the outcomes for place. This, of course, is an old dilemma in neighborhood change going back to the time of urban renewal programs. In those programs the geographical place may have been transformed by removing the poor people and replacing them through a gentrification process with a different population: place was perhaps improved but people were not. At the other extreme, experiments that move low-income people from the center city to the suburban fringe may improve the lives of the participants in the program, but the places that they leave may be in worse shape after their departure.

Establishing the Counterfactual Using Comparison Groups: Selection Bias and Other Problems

Many of the above problems associated with evaluations of CCIs are generic to the evaluation of any complex program. Most particular to CCIs is the degree of difficulty associated with creating a credible counterfactual for assessing impact. We now turn our attention to this issue.

Random Assignment as the Standard for Judgment

For quantitative evaluators random assignment designs are a bit like the nectar of the gods: once you've had a taste of the pure stuff it is hard to settle for the flawed alternatives. In random assignment design, individuals or units that are potential candidates for the intervention are randomly assigned to be in the treatment group, which is subject to the intervention, or to the control group, which is not subject to any special intervention. (Of course, random assignment does not have to be to a null treatment for the controls; there can be random assignment to different levels of treatment or to alternative modes of treatment.)

The key benefit of a random assignment design is that, as soon as the number of subjects gets reasonably large, there is a very low probability that any given characteristic of the subjects will be more concentrated in the treatment group than in the control group. Most important, this holds for unmeasured characteristics as well as measured characteristics.

Random assignment of individuals to treatment and control groups, therefore, allows us to be reasonably sure that no selection bias has occurred when we evaluate the intervention. This means that when we compare average outcomes for treatments and controls we can have a high degree of confidence that the difference is not due to some characteristics, which we may not even be aware of, that made the treatment group more or less likely to respond to the intervention. We can conclude instead that the difference is due to the treatment itself. The control group provides a secure counterfactual because, aside from the intervention itself, the control group members are subject to the same forces that might affect the outcome as are the treatment group members: they grow older just as treatment group members do, they face the same changes in the risks of unemployment or increases in returns to their skills, and they are subject to the same broad social forces that influence marriage and family practices.

We realize that this standard is very difficult, often impossible, for evaluations of community-wide initiatives to meet. Unfortunately, there appear to be no clear guidelines for selecting second-best approaches, but a recognition of the character of the problems may help set us on a path to developing such guidelines.

Experiences With Creating Comparison Groups

We now turn to assessing the utility of more feasible alternatives for establishing comparison groups. We compare impact results from studies in which random assignment of individuals was used to create comparison groups with impact results when alternative methods were used to create the comparison groups. In this case, we use the results from the randomly assigned treatment versus control groups as a standard against which to evaluate the types and magnitude of errors that can occur when this best design is not feasible.1 Our hope is that if one or more of the alternatives looks promising in the evaluation of programs with individuals as the unit of analysis, then we would have a starting point for considering alternatives to random assignment in the evaluation of CCIs. Toward the end of this section we discuss experience with comparison institutions and comparison communities.

Constructed Groups of Individuals. Constructed comparison groups of individuals were the most-often used method of evaluation prior to the use of random assignment in large-scale social policy studies and other programs in the 1970s and 1980s. The earliest type of constructed group was a before-and-after, or "pre–post," design. Measurements were made on the individuals before they entered the treatment, during the treatment, and following the conclusion of the treatment. Impacts were measured as the change from before program to after program.

This strategy for establishing counterfactuals is recognized as highly vulnerable to naturally occurring changes in individuals. For example, criminal behavior is known to decline with age regardless of treatment efforts, a phenomenon referred to as "aging out." With respect to employment and training programs, eligibility is often based on a period of unemployment prior to program entry. But, for any group of people currently unemployed, the process of job search goes on and often results in employment or re-employment. In those cases, it is difficult to untangle the program effects from those of normal job-finding processes.

Another strategy for constructing comparison groups is to compare non-participants with participants in a program. This strategy was used in early evaluations of the Jobs Corps and in evaluation of the special supplemental food program for women, infant, and children (WIC) (Devaney, Bilheimer, and Schore 1991), and the National School Lunch Program and School Breakfast Programs (Burghardt, et al. 1993). This type of design is recognized as producing bias due to selection on unobserved variables. Usually there is a reason why an individual does participate or does not participate in the program--for example, an individual's motivation, or subtle selection procedures followed by the program administrators. If characteristics affecting the selection could also affect the final outcome, and if these characteristics are not measured, then the difference between the participant and the non-participant groups is a potentially biased estimate of program impact. This bias could either over- or under-estimate program effects.

A third strategy for creating comparison groups is to use existing survey data to sample individuals for the comparison group. The most commonly used source of information is the U.S. Census Bureau's Current Population Survey (CPS), which has large national samples of individuals. Comparison groups are usually constructed by matching the characteristics of the individuals in the treatment group to individuals in the CPS. This procedure was used in evaluations of employment training programs (Bloom 1987, Ashenfelter and Card 1985, Bassi 1983 and 1984, Bryant and Rupp 1987, Dickinson, Johnson, and West 1987), where program enrollment data were often used in combination with the CPS data or data from Social Security records. These data do provide a long series of observations on individuals prior to the time of program eligibility as well as during program eligibility.

One important set of studies directly demonstrates the pitfalls of constructing comparison groups of individuals from data sources that differ from the source used for the treatment-group data. These studies were based on the National Supported Work Demonstration, which ran between 1975 and 1979 in eleven cities across the United States. This was a subsidized employment program for four target groups: ex-addicts, ex-offenders, high school dropouts, and women on Aid for Families with Dependent Children (AFDC). Two sets of investigators working independently used these experimental data and combined them with nonexperimental data to construct comparison groups (see Fraker and Maynard 1987, La Londe 1986, La Londe and Maynard 1987). Both studies used data generated from the random assignment experiment--differences between randomly assigned treatment and control groups--as the "true" estimates of program effects. Then, alternative comparison groups were constructed in a variety of other ways and estimates of the effects of the program on the outcome variable were made using the constructed comparison group in place of the randomly selected control group. These two estimates of effects--one based on the randomly selected and the other based on constructed control groups--were then compared. Both sets of investigators looked at various ways of matching the treatment subjects from the experiment with counterparts taken from other data sources--the CPS and Social Security data were used in combination in one study and the data from the Panel Study on Income Dynamics were used in the other. In constructing the comparison group from other data sources, the investigators followed the method that other investigators had used previously to study employment and training programs such as the Concentrated Employment and Training Act (CETA). The "true impact" effects of the program on earnings were available from the experimental treatment–control differences. Thus, it was possible to demonstrate the extent to which constructed comparison groups were able to provide impact estimates that approximated the "true impact."

The major conclusion from this set of important studies was that the constructed comparison groups provided unreliable estimates of the true program impacts on employment and earnings, and that none of the matching techniques used looked particularly superior one to the other--there was no clear second-best.

As part of their studies these investigators also tried to see if the bias resulting from constructed comparison groups could be corrected statistically. To try to address potential bias of this type, due to unobserved variables, analysts since the late 1970s have often relied on methods that attempt statistical correction for the bias. The methods used most often were developed by James Heckman (1979). Basically, these corrections try to "model" the selection process--that is, to develop a statistical equation that predicts the likelihood of being in the treatment group or in the comparison group. The results of this equation are then used to "adjust" the estimates of treatment versus control group differences. While the approach proposed could work in certain situations, experience has shown that it is not generally reliable for dealing with the problem of unobserved variables. Understanding the problem of unobserved variables and the weakness of any methodologies other than random assignment in dealing with this problem is central to appreciating the difficulties that are faced in the evaluation of community-wide initiatives. We will touch on this repeatedly in the following sections.

Constructed Comparisons: Institutions. In a few cases, where the primary unit of intervention and analysis has been an institution, attempts have been made to construct comparison groups of institutions. Those procedures come closer to the problems encountered in community-wide initiative evaluations.

For example, in parts of the school dropout studies that were introduced earlier in this paper (Dynarski, et al. 1992), individuals were randomly assigned to a school dropout prevention program and to a control group. At other sites, however, the random assignment of individuals was not feasible, so an attempt was made to find other schools that could be used as a comparison group to judge the effectiveness of the dropout program. After the schools had been initially matched, survey data were collected from students, parents, and school administrators. As noted previously, comparison of these data showed that, in fact, the schools being "matched"--in spite of being demographically similar--were quite different in their operational aspects. Note that in this case even though the student outcomes are the ultimate subject of the study--that is, whether the students drop out or not--the institution was the unit of comparison selected in order to create a comparison group of "environments" similar to those in the treatment schools.

In one study, there was a large enough number of schools to attempt a quasi-random assignment of schools to treatment and control groups. Twenty-two schools were first matched on socioeconomic characteristics and then randomly assigned within matched pairs to treatment and control groups (Flay, et al. 1985). It is doubtful that a sample size of twenty-two is adequate to assure that the random assignment has achieved balance on unmeasured characteristics.

Comparison Communities. There are several examples of attempts to use communities as the units for building the comparison group. At first blush, the idea is quite appealing: find a community that is much like the one in which the new treatment is being tested and then use this community to trace how the particular processes of interest or outcomes of interest evolve compared with those in the "treatment community." In most cases, the treatment site has been selected before the constructed comparison site is selected.

The most common method for selecting comparison communities is to attempt to match areas on the basis of selected characteristics that are believed, or have been shown, to affect the outcome variables of interest. Usually, a mixture of statistical weighting and judgmental elements enters into the selection.

Often a first criterion is geographic proximity--same city, same metropolitan area, same state, same region--on the grounds that this will minimize differences in economic or social structures and changes in area-wide exogenous forces. Sometimes an attempt is made to match communities based on service structure components in the pre-treatment period--for example, similarities in health service provision.

Most important, usually, is the statistical matching on demographic characteristics. In carrying out such matching the major data source is the decennial Census, since this provides characteristic information even down to the block group level (a subdivision of Census tracts). Of course, the further the time period of the intervention from the year in which the Census was taken, the weaker this matching information will be. One study used 1970 Census data to match sites when the program implementation occurred at the very end of the decade, and found later that the match was quite flawed.

Since there are many characteristics on which to match, some method must be found for weighting the various characteristics. If one had a strong statistical model of the process that generates the outcomes of interest, then this estimated model would provide the best way to weight together the various characteristics. We are not aware of any case in which this has been done. Different schemes for weighting various characteristics have been advocated and used.2

In a few cases, time-trend data are available on the outcome variable at the small-area level that cover the pre-intervention period. For example, in recent years, birth-record data have become more consistently recorded and made publicly available, at least to the zip-code level. In some areas, AFDC and Food Stamp receipt data aggregated to the Census tract level are available. The evaluation of the Healthy Start program, a national demonstration to improve birth outcomes and promote the healthy development of young children, proposes to attempt to match sites on the basis of trends in birth data.

In the Youth Incentive Entitlement Pilot Project, which was described at the outset of this paper, four treatment sites were matched with sites in other communities in other cities, based on weighted characteristics such as the labor market, population characteristics, the high school dropout rate, socioeconomic conditions, and geographic proximity to the treatment site. Unforeseen changes in the comparison sites, however, reduced their validity as counterfactuals. The Employment Opportunity Pilot Project (EOPP) was a very large-scale employment opportunity program, which began in the late 1970s and continued into the early '80s, focused on chronically unemployed adults and families with children. It also used constructed comparison sites as part of its evaluation strategy. Once again there were problems with unexpected changes in comparison sites. For example, Toledo, which had major automobile supplies manufacturers, was subject to a downturn in that industry. Further, out of ten sites, one had a major hurricane, a second had a substantial flood, and a third had a huge unanticipated volcanic eruption.

Two projects under way may give us additional information about selecting comparison communities. For the Healthy Start evaluation, two comparison sites are being selected for each treatment site (Devaney and Morano 1994). In developing comparison sites, investigators have tried to add to the more formal statistical matching by asking local experts whether the proposed comparison sites make sense in terms of population and service environment. The evaluation of community development corporations (CDCs), being carried out by the New School for Social Research, has selected comparison neighborhoods within the same cities as the three CDC sites under evaluation.

Treatment and comparison sites randomly assigned. There are a couple of examples where the treatment sites were not predetermined but rather were selected simultaneously with the comparison sites. The largest such evaluation is that of the State of Washington's Family Independence Program (FIP), an evaluation of a major change in the welfare system of the State (Long and Wissoker 1993). The evaluators, having decided upon a comparison group strategy, created east/west and urban/rural stratifications within the state in order to obtain a geographically representative sample. Within five of these subgroups, pairs of welfare offices, matched on local labor market and welfare caseload characteristics, were chosen and randomly allocated to either treatment (FIP) or control (AFDC) status. This project's initial results surprised the researchers: utilization of welfare increased and employment decreased, whereas the intent of the reform was to reduce welfare use and increase employment. The researchers do not attribute these counterintuitive findings to flaws in the comparison site method, but that possibility exists. Again, it is doubtful that random assignment of just five matched pairs is sufficient to assure a balance between the treatment group office and comparison office in unmeasured variables affecting outcomes, even though the pairs were matched on several characteristics.

The Alabama Avenues to Self-Sufficiency through Employment and Training Services (ASSETS) Demonstration uses a similar strategy for the selection of demonstration and comparison sites, except that only three pairs were chosen. The primary sampling unit was the county, and counties were matched on caseload characteristics and population size (Davis 1993). Results from that study did not match those of a similar study in San Diego in which random assignment of individuals was used to establish the counterfactual comparison group. In the San Diego study, the estimated reduction in food consumption following Food Stamp cash-out was much less.

Pre–post design, using communities. As was noted with respect to individuals, contrasting measurements before and after exposure to the treatment is a method that has been often been advocated. This procedure can also be applied with communities as the unit of analysis. The attraction of this approach is that the structural and historical conditions that might affect the outcome variables that are unique to this location are controlled for directly.

Often a pre–post design simply compares a single pre-period measurement with the post-treatment measure of the same variables. However, as in any longitudinal study, multiple measures of the outcome variable (especially in the pre-treatment period) allows for more reliable estimates of change in the variable. This procedure is often referred to as an "interrupted time-series," with the treatment taken to be the cause of the interruption (see, for example, McCleary and Riggs 1982).

The better the researcher's ability to model the process of change in a given community over time, the stronger is this approach. We discuss the evidence on ability to model community change below. Note also that this approach depends on having time-series measures of variables of interest at the community level and therefore runs into the problem, introduced above, of the limited availability of small-area data that measure variables consistently over many time periods. We are often limited to the decennial censuses for small-area measurements.

As with pre–post designs where individuals are the unit of analysis, events other than the treatment--for example, a plant closing, collapse of a transportation network, or reorganization of health care providers--can impinge on the community during the post-treatment period and affect the outcome variable. Those effects would be attributed to the treatment unless a strong theoretical and statistical model is available that can take such exogenous events into account.

We have been unable thus far to locate examples of community pre–post designs using time-series. The EOPP (Brown, et al. 1983) used as one of its analysis models a mixture of time-series and comparison communities to estimate program impacts. The model had pre- and post-measures for both sets of communities. The impact of the intervention was estimated as the difference in percentage change (pre- to post-) between the treatment site and comparison site(s). Finally, the Youth Fair Chance demonstration (Dynarski and Corson 1994) has proposed an evaluation design that uses both pre- and post-measures and comparison sites.

Problems of spillovers, crossovers, and in- and out-migration. Where comparison communities are used, potential problems arise because of the community's geographic location relative to the treatment site and/or the movement of individuals in and out of the treatment and comparison sites.

Often investigators have chosen communities in close physical proximity to the treatment community on the grounds that it helps to equalize regional influences. However, proximity can cause problems. First, economic, political, and social forces often create specialized functions within a region. For example, one area might provide most of the manufacturing activities while the other provides the services; one area has mostly single-family dwellings while the other features multi-unit structures; one is dominated by Republicans, the other by Democrats; one captures the state's employment services office and the other gets the state's police barracks. These can be subtle differences that can generate different patterns of evolution of the two communities. Second, spillover of services and people can occur from the treatment community to the comparison community, so the comparison community is "contaminated"--either positively, by obtaining some of the services or governance structure changes generated in the treatment community, or negatively, by the draining away of human and physical resources into the now more attractive treatment community.

Two features of the New School's CDC study introduced above make it less susceptible to these types of problems. First, the services being examined relate to housing benefits, which are not easily transferable to nonresidents. Second, the CDCs in the study were not newly established, so to a large extent it can be assumed that people had already made their housing choices based on the available information (though even these prior choices could create a selection bias of unknown and unmeasured degree).

An example where this spillover effect was more troublesome was in the evaluation of The School/Community Program for Sexual Risk Reduction Among Teens (Vincent, Clearie, and Schluchter 1987). This was an education-oriented initiative targeted at reducing unwanted teen pregnancies. The demonstration area was designated as the western portion of a county in South Carolina, using school districts as its boundaries. Four comparison sites were selected, one of which was simply the eastern portion of the same county. Because the county is quite homogenous, the two halves were matched extremely well on factors that might influence the outcome measures (Vincent, Clearie, and Schluchter 1987: 3382). However, a good deal of the information in this initiative was to be disseminated through a media campaign, and the county shared one radio station and one newspaper. Moreover, some of the educational sites, such as certain churches and work places, served or employed individuals from both the western and eastern parts of the county (3386). Obviously, a comparison of the change in pregnancy rates between these two areas will not provide a pure estimate of program impact.

In-migration and out-migration of individuals occur constantly in communities. At the treatment site, these migrations might be considered "dilutions of the treatment." In-migration could be due to the increased attraction of services provided or it could just be a natural process that will diversify community values and experiences. Out-migration means loss of some of the persons subject to the treatment. Focusing data collection only on those who stay in the community creates a selection bias arising from both migration processes. Also, it is not clear whether the program treatment itself influenced the extent and character of in- and out-migration.

Dose-response models of treatment versus comparison communities. Sites can vary in the types and/or intensity of treatment, and this variation in dosage can be examined as part of the evaluation. For example, the South Carolina teen pregnancy prevention program, discussed above, could be viewed as having three different "treatment" groups: the western part of the county received full treatment, the eastern part of the county received moderate treatment, and the three noncontiguous comparison counties received little to no treatment.

The absolute changes in numbers in these three treatment groups seem to confirm the "dosage" effect. The noncontiguous comparison communities' estimated pregnancy rates stayed the same or increased, the rates in the eastern portion of the county were reduced slightly, and those in the western portion were more than halved (Vincent, Clearie, and Schluchter 1987). Of course, these estimates should be viewed with caution given the small sample size and failure to control statistically for even observed differences between communities other than dosage of treatment.

Another example of dose-response methodology is an evaluation of a demonstration targeted at the prevention of alcohol problems (Casswell and Gilmore 1989). Six cities were chosen and then split into two groups of three cities each, based on sociodemographic similarity. Within these groups, each city received a treatment of different intensity. One was exposed to both a media campaign and the services of a community organizer; the second had only the media campaign; and the third had no treatment. In this way researchers could examine the effect of varying levels of intervention intensity to determine, for instance, if there was an added benefit to having a community organizer available (in addition to the media campaign). It should be noted, however, that random assignment of cities within groups had to be sacrificed in order to avoid possible spillover effects from the media campaign. Results showed positive, though generally tiny, effects for many of the variables studied. As we would expect, the magnitude of the effects tended to grow with the intensity of the "dosage level." That is, the communities with the media campaign and a community organizer generally experienced stronger impacts than the communities with only a media campaign.

Most important, this procedure does not get around the underlying problem of comparison communities--the questionable validity of the assumption that once matched on a set of characteristics, the communities would have evolved over time in essentially the same fashion with respect to the outcome variables of interest. If this assumption does not hold, then the "dose of treatment" will be confounded in unknown ways with underlying differences among the communities, once again a type of selection bias.

The Magnitude of Problems with Comparison Communities Methods: A Case Study. A recent study allows us to get a fix on the magnitude of bias that can arise when comparison community designs of the several types just reviewed are used. This study used data from the Manpower Demonstration Research Corporation's (MDRC) Work\Welfare studies in several states (Friedlander and Robins 1994). Once again, as in the studies using National Supported Work Demonstration data, cited above (Fraker and Maynard 1987, La Londe 1986, La Londe and Maynard 1987), the basic data were generated by experiments using random assignment of individuals. In this case, the investigators used the treatment group from the Work/Welfare experiments and constructed comparison groups by using control groups from other program locations or other time periods to construct alternative comparison groups. For example, they used the treatment group from one state and the control group from another state. They also used a treatment group from one geographic location within the state or city and the control group from another geographic location within the state or city. Finally, they used the treatment group from one time period at a given site and the control group from another time period at the same site to get "across-cohort studies," similar to a pre–post study of a single community. In addition to trying these different strategies for constructing groups, the investigators also tried matching groups on different measured characteristics. And, they tried some sophisticated specification tests that had been suggested by Heckman and others to improve the match of the constructed comparison groups to the treatment groups. (See Heckman and Hotz 1989.)

This study is, in our view, so important that we have provided an appendix in which the results are discussed in detail and some of the estimates of magnitude of bias are summarized (Table A1). The study showed substantial differences between the estimated impacts from the true experimental results and the constructed comparison groups. In many cases, not only was the magnitude of the effect estimated from the constructed comparison group different from the "true effect" estimates provided by the random assignment control group, but the direction of the effect was different. Overall, the results from this study show substantial bias with all methods but that, at least for these data, comparison groups constructed from different cohorts in the same site perform somewhat better than the other types of comparison groups.3

The importance of this study is that it clarifies the problem of bias arising when comparison groups are constructed by methods other than random assignment, and it points to the severity of the problem. It shows that statistical controls using measured characteristics are in most cases inadequate to overcome this problem.

It has long been recognized that counterfactuals obtained by using constructed comparison groups (as opposed to control groups obtained by random assignment) may, in theory, yield biased estimates of the true impact of a program. What is important about this study is that it demonstrates, through the use of actual program data, that various types of constructed comparison groups yield substantially biased estimates. These real-life experiments demonstrate that investigators could have been seriously misled in their conclusions about the effectiveness of these programs had they used methods other than random assignment to construct their comparison groups. Moreover, we must keep in mind that these studies created comparison groups after the fact, with the luxury of making adjustments to potential comparison groups using all the data from the study. The problems described above are likely to be exacerbated when one is developing a design for an evaluation and must make a priori judgments about the extent of bias that might occur in the results.

Statistical Modeling of Community-Level Outcomes. Another approach to creating counterfactuals for the evaluation of community-level interventions is statistical modeling. This approach develops a statistical model of what would have happened to a particular outcome or set of outcomes at the community level had an intervention not been instituted. The predictions from the model are then used as the counterfactual and are compared with what happens in the community following the intervention. The difference is the estimated impact of the intervention.

Time-series modeling. Time-series models of community-level outcomes have long been advocated as a means of assessing the effects of program innovations or reforms (Campbell and Stanley 1966). In the simplest form, the time-series on the past values of the outcome variable for the community is linearly extrapolated to provide a predicted value for the outcome during and after the period of the program intervention. In a sense, the pre–post designs discussed above are a simple form of this type of procedure. It has been recognized for a long time that the simple extrapolation design is quite vulnerable to error because, even in the absence of any intervention, community variables rarely evolve in a simple linear fashion. An example of this procedure is a study assessing the impact of seat-belt legislation in Australia. The researchers used twenty years of fatality data to predict the number of deaths there would have been in the absence of the new legislation (Bhattacharyya and Layton 1979).

Some attempts have been made to improve on the simple linear form by introducing some of the more formal methods of time-series modeling.4 Introducing non-linearities in the form can allow for more complex reactions to the program intervention (McCleary and Riggs 1982). One study had a series of cohorts enrolled in a program over time and used the pre-enrollment data for a later cohort (enrolled at time t) as the comparison with the in-program data of an earlier cohort (enrolled at time t-1) (McConnell 1982).

The problem with these methods is that they do not always explicitly control for variables, other than the program intervention, that may have influenced the outcome variable.

Multivariate statistical modeling. Some attempts have been made to estimate multivariate models of the community-level outcome variables in order to generate counterfactuals for program evaluation.5 These multivariate models would attempt to specify, measure, and estimate the effects of the variables that determine the community-level outcome that are not themselves affected by the treatment. Then, with these variables "controlled," the effect of treatment would be estimated.

We have not been able to find examples of this approach at the community level, but there are several examples of attempts to estimate case-load models at the state or national level for programs such as AFDC and Food Stamps (Grossman 1985, Beebout and Grossman 1985, Garasky 1990, Garasky and Barnow 1992, Mathematica Policy Research 1985). Most analysts consider the results of these models to be unreliable for program evaluation purposes. For example, an attempt was made to model the AFDC caseload in New Jersey in order to assess the effect of a welfare reform. However, subsequent to the reform, the effects of changes in the low-wage labor market appeared to have swamped any changes in AFDC caseload predicted by the model, leading to implausible estimates of the impact of the welfare reform on AFDC levels. The model was unable to capture the way in which the low-wage labor markets operated to affect AFDC caseloads. Also recall, in the examples discussed above, how comparison communities in EOPP were affected by floods, hurricanes, and volcanic eruptions, or in YIEPP, where court-ordered school desegregation occurred in the comparison community. Adequate statistical modeling would have to attempt to incorporate such factors.

Statistical modeling at the community level also runs up against the persistent lack of small-area data, particularly data available on a consistent basis, over several periods of time or across different communities. Such data are necessary both to estimate the statistical model of the community-level outcome and to project the counterfactual value of the outcome for the program period. For example, if the model includes local employment levels as affecting the outcome, then data on local employment during the program period must be available to use in the model.

Research Questions to Address in the Context of Community-Wide Initiatives

In this section we outline the types of research questions that are of particular relevance to community-wide initiatives and that, with the development of new evaluation strategies, might be investigated. This set of questions goes well beyond the simple models of a single treatment affecting a single outcome or even multiple treatments affecting multiple outcomes. Rather, we focus on several types of multivariate effects. These are effects that help explain how the participants' characteristics might influence treatment outcomes, how various dimensions of one treatment or multiple types of treatments may interactively affect treatment outcomes, and how different configurations of participant or institutional characteristics may produce different outcomes.

It seems evident that arguments for carrying out community-wide interventions are based on assumptions about the importance of several of these types of more complicated theories of how and for whom an intervention will work. Brown and Richman illustrate a key aspect of this multidimensional framework: "Too often in the past, narrowly defined interventions have not produced long-term change because they have failed to recognize the interaction among physical, economic and social factors that create the context in which the intervention may thrive or flounder" (Brown and Richman 1993, 8). Commentators have classified such interactions in a variety of ways: contagion or epidemic effects, social capital, neighborhood effects, externalities, and social comparison effects. We have not taken the time to carefully catalog and reorder these classifications (though such an analysis might help with an orderly development of evaluation research). We simply give examples of some broad categories that might be of concern to evaluators of community-wide initiatives.

It is important to be clear at the outset that credible estimation of these more complicated models of treatment-outcome linkages at the individual level depends on the presence of one or more "control" groups created by random assignment of individuals.6 Given the apparent lack of feasibility of random assignment at the community level and the terribly flawed alternatives to it, answers to these research questions with respect to communities await further methodological work.

Networks and Group Learning

The importance of associational networks has been increasingly emphasized in the literature on communities and families. Some interventions may seek to operate directly on networks, having social network change as either an intermediate or final outcome of interest. These networks can also affect the way in which information about the form of the intervention and its treatment of individuals in various circumstances are likely to be passed from individual to individual. As a result, the group learning about the intervention is likely to be faster and greater than the learning of the isolated individual. This faster, deeper, and perhaps different communication of information could, in turn, change the ways individuals in different associational networks respond to the intervention.

Stronger forms of interaction within networks are what some have called "norm formation" (see Garfinkel, Manski, and Michalopoulous 1992). Network norms are potentially important in this context in two ways: pre-existing norms could either impede or facilitate response to the intervention, and new norm formation in response to the intervention could reshape pre-existing norms. For example, the existence of "gang cultures" may impede interventions, or some interventions may seek to use norm formation processes within gang cultures to reshape the norms of the gang and enlist it in promoting the goals of the intervention.

The evaluation problems will differ depending on how these associational networks are considered. For example, suppose the objective is to test how different associational networks affect response to a given intervention. If networks are measured and classified prior to the intervention, then individuals could be broken into different subgroups according to network type, and subgroup effects could be analyzed in the usual manner.

To the extent that the network characteristics are outcome variables (intermediate or final), they can be measured and the impact of the intervention upon them analyzed in the same fashion as for other outcome variables. However, the reliability and consistency of measures of associational networks may be problematic, as may be the determination of other relevant properties such as their normal variance or their likely sensitivity to intervention impacts.

Notice that the previous paragraphs take the network to be something that can be treated as a characteristic of the individual and the individual as the unit of analysis. These analyses could be carried out even without a community-wide intervention. Most would argue, however, that the group learning effects are really most important when groups of people, all subject to the intervention, interact. In such cases, we are immediately faced with the problems covered earlier in the discussion of using communities for constructed comparison groups; since random assignment of individuals to the treatment or control group is precluded when one wishes to treat groups of individuals who are potentially in the same network, testing for this form of interaction effect will be subject to the same problems of selection bias outlined above.

Effects of Formal and Informal Institutions

Most interventions take the form of an attempt to alter some type of formal institution that affects individuals: a day care center, a welfare payment, an education course. The interactions of formal institutions with treatments have been evaluated, for example, in studies of food-stamp cash-outs. In this case, we would ask, "Is the impact of the treatment affected by the way in which it is delivered to participants--as a food stamp versus as a cash payment?"

However, most of those concerned with community-wide initiatives appear to be more interested in either the way the formal institutional structure in a given community conditions the individuals' responses or with the behavior of the formal institutions themselves as outcomes of the intervention.

With respect to the former concern, some studies seek to have the formal institutional structure as one of the criterion variables by which communities are matched and thus seek to neutralize the impact of interactions of formal institutions and the treatment. Both the Healthy Start and the school dropout studies have already been mentioned as examples in which matching formal institutional structures is a concern in selection of comparison sites, and we have already mentioned the problems of measurement and the limits of statistical gains from such attempted matches.

Access to a formal institution is sometimes regarded as an outcome, and it may be easy to measure--for example, the number of doctors visits by pregnant women or participation in bilingual education programs. The behavior of the institution may be the outcome variable of interest. For example, do schools change their tracking behavior? Are police procedures for intervention in domestic violence altered? In these cases, the institution itself, rather than individuals, may be the primary unit of analysis. Then we must face all the aspects of sample design for institutions as a unit of analysis if we wish to use formal statistical inference to estimate intervention effects on institutional behavior.

Informal institutions are also subjects of interest. The associational networks discussed above are surely examples, as are gangs. But there are informal economic structures that also fall into this category. The labor market is an informal institution whose operations interact with the intervention and condition its impact. This can be most concretely illustrated by reference to a problem sometimes discussed in the literature on employment and training programs: "displacement." The basic idea is that workers trained by a program may enter the labor market and become employed, but if involuntary unemployment has already occurred in the relevant labor market, total employment may not be increased because that worker simply "displaces" a worker who would have been employed in that job had the newly trained worker not shown up.7 An evaluation with a number of randomly assigned treatment and control group members that is small relative to the size of the relevant labor market would be unable to detect such "displacement" effects because their numbers are too small relative to the size of the market; the trained treatment group member is not likely to show up at exactly the same employer as the control group member would have. It has been argued by some that use of community-wide interventions in employment and training would provide an opportunity to measure the extent of such "displacement effects" because the size of the intervention would be large relative to the size of the local labor market. Indeed one of the hopes for the YIEPP was that it would provide such an opportunity. But, as the experience with YIEPP, described above, illustrates, the use of comparison communities called for in this approach is subject to a number of serious pitfalls.8

Interactions with External Conditions

Some attempts have been made to see how changes in conditions external to an intervention--experienced by both the treatment and control group members--have conditioned the response to the treatment. For example, in the National Supported Work Demonstration, attempts were made to see if the response to the treatment (supported work) varied systematically with the level of local unemployment. In that case there were no statistically significant differences in response, but researchers felt it may well have been due to the weakness of statistics on the city-by-city unemployment rate.


An intriguing and largely unaddressed question for evaluation of community-wide initiatives is how to represent the dynamics of interventions as they change over time--in response to lessons learned from implementation and where the alterations are largely idiosyncratic. Although some evaluators of programs might prefer to delay their initial measures of outcomes until the program has stabilized and matured, many community-level initiatives are not expected to achieve a "steady state" but rather to evolve constantly in response to incoming results.9

Similarly, few attempts have been made to measure changes in the response of communities and their residents to treatments over time. Evaluations of employment and training programs have carried out post-program measurements at several points in time in an attempt to measure the time path of treatment effects. These time paths are important for the overall cost-benefit analyses of these programs because the length of time over which benefits are in fact realized can greatly influence the balance of benefits and costs. For example, studies have shown cases in which impacts appear in the early post-program period and then fade out quickly thereafter (as is often claimed about the effects of Headstart) and cases in which no impacts are found immediately post-program but emerge many months later (for example, in the evaluation of the Job Corps). Similar issues will arise in the evaluation of community-based initiatives and the tracking of outcomes over longer periods of time would appear to be a step forward addressing this issue.

Steps in the Development of Better Methods

We can make no strong recommendations on how best to approach the problem of evaluating community-wide initiatives. When the random assignment of individuals to treatment and control groups is precluded, no surefire method exists for assuring that the evaluation will avoid problems of selection bias. Constructed comparison groups--whether of individuals or communities--are problematic, and pre–post designs remain vulnerable to exogenous shifts in the context that may affect outcome variables in unpredictable (and often undetectable) directions. As of now, we do not see clear indications of what second-best methods might be recommended, nor have we identified what situations make a given method particularly vulnerable.

It is important to stress, once again, that the vulnerability to bias in estimating the impacts of interventions should not be taken lightly. First, the few existing studies of the problem show that the magnitude of errors in inference can be quite substantial even when the most sophisticated methods are used. Second, the bias can be in either direction: we may not only be led to conclude that an intervention has had what we consider to be positive impacts when in fact it had none; we may also find ourselves confronted with biased impact estimates indicating that an innocuous--or perhaps even valuable--intervention was actually harmful. As a result, we may end up promoting policies that use up resources and provide few benefits or we may recommend discarding interventions that actually have merit. Once these biased findings are in the public domain, it is very hard to get them dismissed or revisited and to prevent them from influencing policy decisions.

Beyond these rather dismal conclusions and admonitions, the best we can suggest at this time are some steps that might improve our understanding of how communities evolve over time and thereby help us create methods of evaluation that are less vulnerable to the types of bias we have pointed out.

1. Improve small-area data. We have stressed at several points that detailed small-area demographic data are very hard to come by except at the time of the decennial census. The paper by Claudia Coulton in this volume provides further confirmation of this problem and some suggestion for remedying it. Increasingly, however, records data are being developed by a wide variety of entities that can be tied to specific geographic areas (geo-coded data). One type of work that might be fruitfully pursued would combine various types of records data with data taken from two or more censuses.10 At the base period, correlations of the records data with census variables would be established. Then the time-series of the records data would be used along with the baseline correlations to predict the end-period values for the census variables. If the predictions were reasonably close, then the records data would provide a basis for tracking small-geographic-area variables in the intercensal period.

Our experience with availability of records data at the state level (when working on the design of the evaluation of the Pew Charitable Trusts' Children's Initiative convinced us that there are far more systems-wide records being collected--in many cases with individual and geographic area–level information--than we would have thought. Much of the impetus for the development of these data systems comes from the federal government in the form of program requirements (both for delivery of services and for accountability) and, more importantly, from the federal financial support for systems development.

Evaluations of employment and training programs have already made wide use of Unemployment Insurance records and these records have broad coverage of the working population. More limited use has been made of Social Security records. In a few cases, it has been possible to merge Social Security and Internal Revenue Service records. Birth records collection has been increasingly standardized and some investigators have been able to use time-series of these records tied to geographic location. The systems records, beyond these four, cover much more restricted populations--for example, Welfare and Food Stamps, Medicaid and Medicare, and WIC.

More localized record systems include education and criminal justice records but they present greater problems of developing comparability. In some states, however, statewide systems have been, or are being, developed to draw together the local records.

We are currently investigating other types of geo-coded data that might be relevant to community-wide measures. Data from the banking systems have become increasingly available as a result of the Community Reinvestment Act (Housing Mortgage Disclosure Act [HMDA] data). Local real estate transaction data can sometimes be obtained but information from tax assessments seems harder to come by.

In all of these cases, whenever individualized data are needed, problems of confidentiality present substantial barriers to general data acquisition by anyone other than public authorities. Even with the census data, there are many variables for which one cannot get data at a level of aggregation below block group level.

2. Enhance community capability to do systematic data collection. We believe that it is possible to pull together records data of the types just outlined to create community data bases that could be continuously maintained and updated. These data would provide communities with some means to keep monitoring, in a relatively comprehensive way, what is happening in their areas. This would make it possible to get better time-series data with which to look at the evolution of communities. To the degree that communities could be convinced to maintain their records within relatively common formats, an effort could be made to pull together many different communities to create a larger data base that would have a time-series, cross-section structure and would provide a basis for understanding community processes.

Going a step beyond this aggregation of records, attempts could be made to enhance the capability of communities to gather new data of their own. These could be anything from simple surveys of physical structures based on externally observed characteristics (type of structure, occupied, business or organization, public facility, and the like), carried out by volunteers within a framework provided by the community organization, to full-scale household surveys on a sample or on a census basis.

3. Create a panel study of communities. As already noted above, if many communities used common formats to put together local records data one would have the potential for a time-series, cross-section data base. In the absence of that, admittedly unlikely, development, it might be possible to imitate the several nationally representative panel studies of individuals (The Panel Study on Income Dynamics, The National Longitudinal Study of Youth, or High School and Beyond, to name the most prominent), which have been created and maintained in some cases since the late 1960s. Here the unit of analysis would be communities--somehow defined. The objective would be to provide the means to study the dynamics of communities. They would provide us with important information on what the cross-section and time-series frequency distributions of community level variables look like--important ingredients, we have argued above, for an evaluation sample design effort with communities as units of observation. That would provide the best basis for our next suggestion, work on modeling community-level variables.

Short of creating such a panel study, some steps might be taken to at least get federally funded research to try to pull together across projects information developed on various community-level measures. In an increasing number of studies, community-level data are gathered for evaluating or monitoring programs or for comparison communities. We noted above several national studies that were using a comparison-site methodology (Healthy Start, Youth Fair Chance, the School Dropout Study), and some gains might be made if some efforts of coordination resulted in pooling some of these data.

4. Model community-level variables. As we mentioned above, statistical modeling might provide the basis for generating more reliable counterfactuals for community initiatives. A good model would generate predicted values for endogenous outcome variables for a given community in the absence of the intervention by using an historical time-series for that community and such contemporaneous variables as are judged to be exogenous to the intervention. At least such models would provide a better basis for attempting to match communities if a comparison-community strategy is attempted.

5. Develop better measures of social networks and formal and informal community institutions. We have not studied the literature on associational networks in any depth, so our characterization of the state of knowledge in this area may be incorrect. However, it seems to us that considerably more information on and experience with various measures of associational networks are needed, given their central role in most theories relating to community-wide processes.

Measures of the density and character of formal institutions appear to us to have been little developed--though, again, we have not searched the literature in any depth. There are industrial censuses for some subsectors. We know of private-sector sources that purport to provide reasonably comprehensive listings of employers. Some Child Care Resource and Referral Networks have tried to create and maintain comprehensive listings of child-care facilities. There must be comprehensive listings of licensed health-care providers. Public schools should be comprehensively listed. However, when (for recent projects) we have talked about how one would survey comprehensively formal institutions, the choice of a potential sampling frame was not at all clear.

Informal institutions present even greater problems. Clubs, leagues, volunteer groups, and so forth are what we have in mind. Strategies for measuring such phenomena on a basis that would provide consistent measures over time and across sites need to be developed.

6. Tighten relationships between short-term (intermediate) outcome measures and long-term outcome measures. The inability or unwillingness to wait for the measurement of long-term outcomes is a problem that many studies of children and youth, in particular, face. Increasingly we talk about "youth trajectories." Again, perhaps good comprehensive information--of which we are unaware--exists, linking many short-term, often softer measures of outcomes to the long-term outcomes further along the trajectory. We find ourselves time and again asking, for example, What do we know about how that short-term measure--participation in some activity (say, Boy Scouts)--correlates with a long-term outcome (say, employment and earnings)? Even more rare is information on how program-induced changes in the short-term outcome are related to changes in long-term outcomes. We may know that the level of a short-term variable is highly correlated with a long-term variable, but we do not know to what extent a change in that short-term variable correlates with a change in the long-term variable. Thus, we believe systematic compilations of information about short-term and long-term correlations for outcome variables would be very helpful and could set an agenda for more data-gathering on these relationships where necessary.

7. Conduct more studies to determine the reliability of constructed comparison group designs. We have stressed the importance of information provided by the two sets of studies (reported in Fraker and Maynard 1987; LaLonde 1986; and Friedlander and Robins 1994) that used random assignment data as a base and then constructed comparison groups to test the degree of error in the comparison group estimates. It should be possible to find more situations in which this type of study could be carried out. First, the replication of such studies should look at variables other than employment or earnings as outcomes to determine whether any difference in degrees of vulnerability exist according to the type of outcome variable and/or a different type of intervention. Second, more studies of this type would give us a far better sense of whether, indeed, the degree of vulnerability of the nonexperimental methods is persistent and widely found in a variety of data sets and settings.

Community-wide programs present special problems for evaluators because the "nectar of the gods"--random assignment of individuals to program treatment and to a control group--is beyond their reach. The central problem of impact evaluations, creating a reasonable and convincing counterfactual (what would have happened in the absence of the program intervention), remains a major challenge. Our review of the experience to date with alternative methods is generally discouraging; no clear second-best method emerges from the review.

Nonetheless, we feel that it is very important for evaluators to understand this message and to convey it clearly to those who look to them for evidence of program effectiveness. In addition, we feel it is important to push forward in the effort to build a stronger foundation for understanding how communities evolve over time. That understanding should enhance the ability of evaluators to determine how community-wide program interventions alter the course of a community's evolution.

Appendix: Some Details on the Friedlander-Robins Study

In this appendix we discuss in detail some of the major findings of the study by Friedlander and Robins (1994). These details indicate the possible relative magnitude of problems with several of the alternative methods for constructing comparison groups.

Recall that this study used data from a group of work-welfare studies. In the base studies themselves random assignment of individuals was used to create control groups but Friedlander and Robins used data drawing the treatment group from one segment and the control group from a different segment (thereby "undoing" the random assignment). It was then possible to compare the effects estimated by the treatment-comparison group combination with the "true effects" estimated from the random assignment estimates of treatment-control group differences (the same treatment group outcome is used in each difference estimate).

Even more salient for the purposes of this paper, Friedlander and Robins were able to generate types of comparison groups that have often been suggested for community-wide program evaluation. In one type of constructed comparison, they were able to use the treatment group from communities in one state and a comparison group made up from the control group in a community in another state. In a second type, they used the treatment group entering the program from one office within a city, with the control group drawn from a different office within the city--a procedure that would be quite similar to a comparison neighborhood strategy. In a third type of comparison they used the treatment group from one period of time and a control group in the same site from another period of time, which would be like a pre- and post-treatment comparison in a single community.

Recall that this study is able to establish the degree of bias because the estimated impact results of the constructed comparison group are compared with the "true" impact results obtained from the randomly assigned treatment and control groups. With all three of the types of comparisons just described--comparison across state, comparison within state but across offices, and comparison of before-and-after cohorts in the same site--the average amount of bias was substantial.

The bias in using the constructed comparison groups occurred not just in differences in the order of magnitude of the estimated impact but also in the nature of the inference. A different statistical inference occurs when only one of the two impact estimates is statistically significant--for example, the random assignment estimates showed the program had a statistically significant positive impact and the constructed comparison group estimates indicated no statistically significant impact, or both random and constructed comparison estimates are statistically significant but have opposite signs, such as the random assignment showed a statistically significant positive impact and the constructed comparison indicated a statistically significant negative impact. For most of the methods of comparison, over 30 percent of the cases had such a conflict in statistical inference, and even in the best example, 13 percent of the cases had such conflicts. The point here is that when constructed comparison groups are used there is a substantial risk, not only of getting the order of magnitude of the impact wrong, but also of drawing the wrong conclusion about whether any impact exists or whether the direction of the impact is positive or negative.

It has long been recognized that counterfactuals obtained by using constructed comparison groups (as opposed to control groups obtained by random assignment) may, in theory, yield biased estimates of the true impact of a program. What is important about the Friedlander-Robins study is that it demonstrates through the use of actual program data that various types of constructed comparison groups yield very substantially biased estimates; it is not just a theoretical possibility but it would actually have given very biased results if the comparison group methods had been used rather than random assignment in evaluating these work/welfare programs. We reproduce here part of one table from their study (Table A1).

The data are drawn from four experiments carried out in the 1980s (in Arkansas, Baltimore, San Diego, and Virginia). The outcome variable is whether employed (the employment rate ranged from a low of .265 in Arkansas to a high of .517 in Baltimore). Across the top of the table there is a brief description of how the comparison group was constructed, using four different schemes for construction.

In the first two columns, the two across-site methods use the treatment group from one site (for example, Baltimore, with the control group from another site--say, San Diego--used as the comparison group). In the second column the term "matched" indicates that each member of the treatment group was matched with a member of the comparison group using the Mahalanobis "nearest neighbor" method, and then the estimates of the impact were measured as the difference between the treatment group and the matched comparison group. In the first column no such member-by-member match was done; however, in the regression equation in which the estimate of the impact is made, variables for characteristics are included and this controls for measured differences in characteristics between the two groups.

The "Within-Site/Across-Cohort" category in column three builds on the fact that the samples at each site were enrolled over a fairly long time period, and it was, therefore, possible to split the sample in two parts--those enrolled before a given date, called the "early cohort," and those enrolled after that date, called the "late cohort." The treatment group from the "late cohort" is used with the control group from the "early cohort" as their comparison group. This approximates a pre-post design for a study.

Finally, in column 4, for two of the sites the work-welfare program was implemented through several local offices. It was possible, therefore, to use the treatment group from one office with the control group from the other office as a comparison group. This procedure approximates a matching of communities in near proximity to each other.

The first row of the table gives the number of pairs tested. This is determined by the number of sites, the number of outcomes (the employment outcomes at two different post-enrollment dates were used), the number of subgroups (broken down by AFDC applicants and AFDC current recipients). The number of pairs gets large because each site can be paired with each of the three other sites. The smaller number of pairs in the "Within-Site/Across-Office" category (column 4) occurs because there were only two sites with multiple offices.

The next row gives the means of the experimental estimates--that is, the "true impact estimates" from the original study of randomly assigned treatment-control differentials. Thus for example, the experimental estimates of treatment-control differences in employment rates across all four sites was a 5.6 percent difference in the employment rate of treatments and controls.

The next row compares the results of the estimates using the constructed comparison groups with the "true impact" experimental estimates, averaged across all pairs. For example, the mean absolute difference between the "true impact" estimate and those obtained by the constructed comparison groups across-site/unmatched was .09; that is, the difference between the two sets of estimates was, on average, more than 1.5 times the size of the "true impact"!

The next row tells the percentage of the pairs in which the constructed comparison group estimates yielded a different statistical inference from the "true impact" estimates. A different statistical inference occurs when only one of the two impact estimates is statistically significant or both are statistically significant but have opposite signs. A 10 percent level of statistical significance was used.

The fifth row indicates the percent of the pairs in which the estimated impacts are statistically significantly different from each other.

Table A1: Comparison of Experimental and Nonexperimental Estimates of the Effects of Employment and Training Programs on Employment Status
All Pairs of Experimental and Nonexperimental Estimates
Comparison Group Specification
Number of pairs
Mean experimental estimate
Mean absolute experimental-nonexperimental difference
Percent with different inference
Percent with statistically significant difference
Source: Daniel Friedlander and Philip K. Robins, "Estimating the Effect of Employment and Training Programs: An Assessment of Some Nonexperimental Techniques," Manpower Demonstration and Research Corporation Working Paper (February 1994): Table 13.

For our purposes, we focus on rows 3 and 4. Row 3 tells us that under every method of constructing comparison groups the constructed comparison group estimates (called "nonexperimental" in the table) differ from the "true impact" estimates by a magnitude of over 50 percent of the magnitude of the "true impact."

Row 4 tells us that in a substantial number of cases the constructed comparison group results led to a different inference; that is, the "true impact" estimates indicated that the program had a statistically significant effect on the employment rate, and the constructed comparison group estimates indicated that it had no impact or vice versa, or that one said the impact was to increase the employment rates at a statistically significant level and the other said that it decreased the employment rate at a statistically significant level.

Now we focus more closely on columns 3 and 4 because these are the types of comparisons that are likely to be more relevant for community-wide initiatives: as already noted, the within-site/across-cohort category approximates a pre–post design in a single community, and the within-site/across-office designation approximates a close-neighborhood-as-a-comparison-group design.

It appears that these designs are better than the across-site designs in that, as indicated in row 3, the size of the absolute difference between the "true impact" and the constructed comparison group estimates is much smaller and is smaller than the size of the true impact. However, the difference is still over 50 percent the size of the "true impact." The magnitude of the difference is important if one is carrying out a benefit-cost analysis of the program. A 4.5 percent difference in employment rates might not be sufficiently large to justify the costs of the program but a 7.9 percent difference might make the benefit-cost ratio look very favorable; a benefit-cost analysis with the average "true impact" would have led to the conclusion that the social benefits of the program do not justify the costs, whereas the average constructed comparison group impact (assuming that it was a positive .034 or greater) would have led to the erroneous conclusion that the program did provide social benefits that justify its costs.

When we move to row four we have to be a bit more careful in interpreting the results because the sample sizes for the column 3 and 4 estimates are considerably smaller than those for the column 1 and 2 cases. For example, the entire treatment group is used in each pair in columns 1 and 2, but only half the treatment group is used in columns 3 and 4. Small sample size makes it more likely that both the random assignment estimates and the constructed comparison group estimates will be found to be statistically insignificant. Thus, inherently the percent with a different statistical inference should be smaller in columns 3 and 4. Even so, for the "within-site/across-cohort" category, nearly 30 percent of the pairs in the constructed comparison group estimates would lead to a different--and therefore erroneous--inference about the impact of the program. For the "within-site/across-office" estimates, 13 percent led to a different statistical inference. Is this a tolerable risk of erroneous inference? We would not think so, but others may feel otherwise.

A couple of additional points about the data from this study should be borne in mind. First, this is just one set of data analyzed for a single, relatively well-understood outcome measure, whether employed or not. There is no guarantee that the conclusions about relative strength of alternative methods of constructing comparison groups found with these data would hold up for other outcome measures. Second, in the underlying work\welfare studies, the population from which both treatment group members and control group members were drawn were very much the same--that is, applicants or recipients of AFDC. Therefore, even when constructing comparison groups across sites, one is assured that one has already selected persons whose employment situation is so poor they need to apply for welfare. In community-wide initiatives, the population involved would be far more heterogenous. There would be a far wider range of unmeasured characteristics that could affect the outcomes; therefore, the adequacy of statistical controls (matching or modeling) in assuring comparability of the treatment and constructed comparison groups could be much less.


A fuller version of this paper, including an annotated bibliography, is available as a working paper on request from the Russell Sage Foundation, 112 East 64th Street, New York, NY 10021.

  1. We recognize that even with random assignment problems remain that we can only address with nonexperimental methods, in particular, attrition from the research measurement in follow-up periods.
  2. The Friedlander and Robins (1994) study found little difference between controlling for measured differences in characteristics through a common linear regression model and using pairs matched on the Mahalanobis measure. Fraker and Maynard (1987) also compare Mahalanobis matches with other matching methods and find no clear indication of superiority.
  3. It should be recognized that this is a study using data on work/welfare programs, looking at effects on employment, and we cannot be sure that the conclusions drawn about the risks of bias with constructed comparison groups that appear in these data would hold for other types of outcomes or for other types of program interventions. However, it seems to us to provide a very strong signal that the potential risks in the use of some of these comparison group strategies is very high.
  4. For a classic reference on these methods, see Box and Jenkins (1976). Several applications of time-series modeling to program evaluation are presented in Forehand (1982).
  5. There is a rich literature on the closely related development of simulation models used to estimate the likely effects of proposed program reforms in taxes and expenditures. See, for example, Citro and Hanushek (1991).
  6. Many of these remarks would apply equally to situations in which constructed comparison groups are used, in the sense that the interaction effects themselves do not add further problems of bias beyond those associated with constructed comparison groups.
  7. See Hollister and Haveman (1991) for a full discussion of the problems of displacement and attempts to measure it.
  8. The best attempt of which we are aware to measure displacement is Crane and Ellwood (1984), but even it has serious problems. It used not comparison sites but data on national enrollments in the Summer Youth Employment Program and data on Standard Metropolitan Statistical Area (SMSA) labor markets from the Current Population Survey (CPS). The national program was large enough to have impacts on local youth labor markets, and the time-series data from the CPS made it possible to attempt to create a counterfactual with an elaborate statistical model.
  9. In the medical experimentation literature, there is also some discussion about optimal stopping rules that introduce time considerations into decisions about when to terminate clinical trials as information accumulates.
  10. Some work has been done on dynamic sample allocation. Here, learning effects are introduced sequentially as information flows back about the size of variances of outcome variables, and to some degree initial estimates of response. In light of this information, sequentially enrolled sample can be reallocated among treatments--that is, shift more sample into groups with the largest variance--so as to reduce the uncertainty of estimates. The National Supported Work Demonstration used such a sequential design to a limited degree.
  11. Michael Wiseman of the University of Wisconsin made some partial steps in this direction in the work he did for Urban Strategies in Oakland.


Ashenfelter, Orley, and David Card. 1985. "Using the Longitudinal Structure of Earnings to Estimate the Effect of Training Programs." Review of Economics and Statistics 67: 648–60.

Bassi, Laurie. 1983. "The Effect of CETA on the Post-Program Earnings of Participants." The Journal of Human Resources 18: 539–56.

-----. 1984. "Estimating the Effect of Training Programs with Nonrandom Selection." Review of Economics and Statistics 66: 36–43.

Beebout, Harold, and Jean Baldwin Grossman. 1985. "A Forecasting System for AFDC Caseloads and Costs: Executive Summary." Mimeographed. Princeton: Mathematica Policy Research.

Bhattacharyya, M. N., and Allan P. Layton. 1979. "Effectiveness of Seat Belt Legislation on the Queensland Road Toll--An Australian Case Study in Intervention Analysis." Journal of the American Statistical Association 74: 596–603.

Bloom, Howard S. 1987. "What Works for Whom? CETA Impacts for Adult Participants." Evaluation Review 11: 510–27.

Box, G. E. P. and G. M. Jenkins. 1976. Time-Series Analysis: Forecasting and Control. San Francisco: Holden-Day.

Brown, Prudence and Harold A. Richman. 1993. "Communities and Neighborhoods: How Can Existing Research Inform and Shape Current Urban Change Initiatives?" Background memorandum prepared for the Social Science Research Council Policy Conference on Persistent Poverty, Washington, DC, November 9–10. Chapin Hall Center for Children at the University of Chicago.

Brown, Randall, John Burghardt, Edward Cavin, David Long, Charles Mallar, Rebecca Maynard, Charles Metcalf, Craig Thornton, and Christine Whitebread. 1983. "The Employment Opportunity Pilot Projects: Analysis of Program Impacts." Mimeographed. Princeton: Mathematica Policy Research.

Bryant, Edward C., and Kalman Rupp. 1987. "Evaluating the Impact of CETA on Participant Earnings." Evaluation Review 11: 473-92.

Burghardt, John, Anne Gordon, Nancy Chapman, Philip Gleason, and Thomas Fraker. 1983 "The School Nutrition Dietary Assessment Study: Dietary Intakes of Program Participants and Nonparticipants." Mimeographed (October). Princeton: Mathematica Policy Research.

Campbell, D. T. and J. C. Stanley. 1966. Experimental and Quasi-Experimental Designs for Research. Chicago: Rand McNally.

Caswell, Sally, and Lynnette Gilmore. 1989. "An Evaluated Community Action Project on Alcohol." Journal of Studies on Alcohol 50: 339–46.

Chaskin, Robert J. 1994. "Defining Neighborhood. Background paper prepared for the Neighborhood Mapping Project of the Annie E. Casey Foundation. The Chapin Hall Center for Children at the University of Chicago.

Citro, Constance and E. A. Hanushek, eds. 1991. Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, vol. 1. Washington, DC: National Academy Press.

Crane, J. and D. Ellwood. 1984. "The Summer Youth Employment Program: Private Job Supplement or Substitute." Harvard Working Paper.

Davis, Elizabeth. 1993. "The Impact of Food Stamp Cashout on Household Expenditures: The Alabama ASSETS Demonstration." In New Directions in Food Stamp Policy Research, ed. Nancy Fasciano, Daryl Hall, and Harold Beebout. Draft Copy. Princeton: Mathematic Policy Research.

Devaney, Barbara, Linda Bilheimer, and Jennifer Schore. 1991. The Savings in Medicaid Costs for Newborns and their Mothers from Prenatal Participation in the WIC Program. Vols. 1 and 2. Princeton: Mathematica Policy Research.

Devaney, Barbara and Lorenzo Morano. 1994. "Comparison Site Selection Criteria." Princeton: Mathematica Policy Research.

Dickinson, Katherine P., Terry R. Johnson, and Richard W. West. 1987. "An Analysis of the Sensitivity of Quasi-Experimental Net Impact Estimates of CETA Programs." Evaluation Review 11: 452–72.

Dynarski, Mark, Alan Hershey, Rebecca Maynard, and Nancy Adelman. 1992. "The Evaluation of the School Dropout Demonstration Assistance Program--Design Report: Volume I." Mimeographed (October 12). Princeton: Mathematica Policy Research.

Flay, Brian, Katherine B. Ryan, J. Allen Best, K. Stephen Brown, Mary W. Kersell, Josie R. d'Avernas, and Mark P. Zanna. 1985. "Are Social-Psychological Smoking Prevention Programs Effective? The Waterloo Study." Journal of Behavioral Medicine 8: 3759.

Forehand, Garlie A. 1982. New Directions for Program Evaluation, No. 16. A publication of the Evaluation Research Society, Scarvia B. Anderson, Editor-in-Chief. San Francisco: Jossey-Bass, Inc.

Fraker, Thomas and Rebecca Maynard. 1987. "The Adequacy of Comparison Group Designs for Evaluations of Employment-Related Programs." The Journal of Human Resources 22: 194–297.

Friedlander, D. and P. Robins. 1994. "Estimating the Effect of Employment and Training Programs: An Assessment of Some Nonexperimental Techniques." Manpower Demonstration and Research Corporation Working Paper, February. New York: Manpower Demonstration and Research Corporation.

Garfinkel, Irwin, C. Manski, and C. Michalopoulous. 1992. "Micro Experiments and Macro Effects." In Evaluating Welfare and Training Programs, ed. Charles Manski and Irwin Garfinkel. Cambridge, Mass.: Harvard University Press.

Garasky, Steven. 1990. "Analyzing the Effect of Massachusetts's ET Choices Program on the State's AFDC-Based Caseload." Evaluation Review 14: 701–10.

Garasky, Steven, and Burt S. Barnow. 1992. "Demonstration Evaluations and Cost Neutrality: Using Caseload Models to Determine the Federal Cost Neutrality of New Jersey's REACH Demonstration." Journal of Policy Analysis and Management 11: 624–36.

Grossman, Jean Baldwin. 1985. "The Technical Report for the AFDC Forecasting Project for the Social Security Administration/ Office of Family Assistance." Mimeographed (February). Princeton: Mathematica Policy Research.

Heckman, J. 1979. "Sample Bias as a Specification Error." Econometrica 47: 153–62.

Heckman, J. and J. Hotz. 1989. "Choosing Among Alternative Nonexperimental Methods for Estimating the Impact of Social Programs: The Case of Manpower Training." Journal of the American Statistical Association 84, no. 408 (December): 862–80.

Hollister, Robinson and Robert Haveman. 1991. "Direct Job Creation." In Labour Market Policy and Unemployment Insurance, ed. A. Bjorklund, R. Haveman, R. Hollister, and B. Holmlund. Oxford: Clarendon Press.

La Londe, R. 1986. "Evaluating the Econometric Evaluations of Training Programs with Experimental Data." American Economic Review (September).

LaLonde, R. and R. Maynard. 1987."How Precise are Evaluations of Employment and Training Programs: Evidence from a Field Experiment." Evaluation Review 11, no. 4 (August): 428–51.

Long, Sharon K., and Douglas A. Wissoker. 1993. "Final Impact Analysis Report: The Washington State Family Independence Program." Draft (April). Washington, DC: Urban Institute.

Mathematica Policy Research. 1985. "Evaluation of the Nutrition Assistance Program in Puerto Rico: Volume II, Effects on Food Expenditures and Diet Quality." Mimeographed. Princeton: Mathematica Policy Research.

McCleary, Richard, and James E. Riggs. 1982. "The 1975 Australian Family Law Act: A Model for Assessing Legal Impacts." In New Directions for Program Analysis: Applications of Time-Series Analysis to Evaluation, ed. Garlie A. Forehand. New Directions for Program Evaluation, No. 16 (December). San Francisco: Jossey-Bass, Inc.

McConnell, Beverly B. 1982. "Evaluating Bilingual Education Using a Time-Series Design." In Applications of Time-Series Analysis to Evaluation, ed. Garlie A. Forehand. New Directions for Program Evaluation, No. 16 (December). San Francisco: Jossey-Bass, Inc.

Vincent, Murray L., Andrew F. Clearie, and Mark D. Schluchter. 1987. "Reducing Adolescent Pregnancy Through School and Community-Based Education." Journal of the American Medical Association 257: 3382–386.

Select Bibliography
Examples of Studies Using Various Evaluation Strategies

Counterfactual from Statistical Modeling

Bhattacharyya, M.N., and Allan P. Layton. 1979. "Effectiveness of Seat Belt Legislation on the Queensland Road Toll--An Australian Case Study in Intervention Analysis." Journal of the American Statistical Association 74: 596–603.

Fraker, Thomas, Barbara Devaney, and Edward Cavin. 1986. "An Evaluation of the Effect of Cashing Out Food Stamps on Food Expenditures." American Economic Review 76: 230–39.

Garasky, Steven. 1990. "Analyzing the Effect of Massachusetts' ET Choices Program on the State's AFDC-Basic Caseload." Evaluation Review 14: 701–10.

Garasky, Steven, and Burt S. Barnow. 1992. "Demonstration Evaluations and Cost Neutrality: Using Caseload Models to Determine the Federal Cost Neutrality of New Jersey's REACH Demonstration." Journal of Policy Analysis and Management 11: 624–36.

Grossman, Jean Baldwin. 1985. "The Technical Report for the AFDC Forecasting Project for the Social Security Administration/ Office of Family Assistance." Mimeographed (February). Princeton: Mathematica Policy Research.

Kaitz, Hyman B. 1979. "Potential Use of Markov Process Models to Determine Program Impact." In Research in Labor Economics, ed. Farrell E. Bloch, 259–83. Greenwich: JAI Press.

Mathematica Policy Research. 1985. "Evaluation of the Nutrition Assistance Program in Puerto Rico: Volume II, Effects on Food Expenditures and Diet Quality." Mimeographed. Princeton: Mathematica Policy Research.

McCleary, Richard, and James E. Riggs. 1982. "The 1975 Australian Family Law Act: A Model for Assessing Legal Impacts." In New Directions for Program Analysis: Applications of Time-Series Analysis to Evaluation, ed. Garlie A. Forehand. New Directions for Program Evaluation, No. 16 (December). San Francisco: Jossey-Bass, Inc.
McConnell, Beverly B. 1982. "Evaluating Bilingual Education Using a Time-Series Design." In New Directions for Program Analysis: Applications of Time-Series Analysis to Evaluation, ed. Garlie A. Forehand. New Directions for Program Evaluation, No. 16 (December). San Francisco: Jossey-Bass, Inc.

Comparison Group Derived from Survey Data

Ashenfelter, Orley. 1978. "Estimating the Effect of Training Programs on Earnings." Review of Economics and Statistics 60: 47–57.

Ashenfelter, Orley, and David Card. 1985. "Using the Longitudinal Structure of Earnings to Estimate the Effect of Training Programs." Review of Economics and Statistics 67: 648–60.

Barnow, Burt S. 1987. "The Impact of CETA Programs on Earnings." The Journal of Human Resources 22: 157–93

Bassi, Laurie. 1983. "The Effect of CETA on the Post-Program Earnings of Participants." The Journal of Human Resources 18: 539–56.

-----. 1984. "Estimating the Effect of Training Programs with Nonrandom Selection." Review of Economics and Statistics 66: 36–43.

Bloom, Howard S. 1987. "What Works for Whom? CETA Impacts for Adult Participants." Evaluation Review 11: 510–27.

Bryant, Edward C., and Kalman Rupp. 1987. "Evaluating the Impact of CETA on Participant Earnings." Evaluation Review 11: 473–92.

Dickinson, Katherine P., Terry R. Johnson, and Richard W. West. 1987. "An Analysis of the Sensitivity of Quasi-Experimental Net Impact Estimates of CETA Programs." Evaluation Review 11: 452–72.

Finifter, David H. 1987. "An Approach to Estimating Net Earnings Impact of Federally Subsidized Employment and Training Programs." Evaluation Review 11: 528–47.

Within-Site Comparison Groups

Burghardt, John, Anne Gordon, Nancy Chapman, Philip Gleason, and Thomas Fraker, Thomas. 1993. "The School Nutrition Dietary Assessment Study: Dietary Intakes of Program Participants and Nonparticipants." Mimeographed (October). Princeton: Mathematica Policy Research.

(See also the other reports from this study, including "Data Collection and Sampling"; "School Food Service, Meals Offered, and Dietary Intakes"; and "Summary of Findings.")

Cooley, Thomas M., Timothy W. McGuire, and Edward C. Prescott. 1979. "Earnings and Employment Dynamics of Manpower Trainees: An Exploratory Econometric Analysis." Research in Labor Economics, ed. Farrell E. Bloch, 119–48. Greenwich: JAI Press.

Devaney, Barbara, Linda Bilheimer, and Jennifer Schore. 1991. The Savings in Medicaid Costs for Newborns and their Mothers from Prenatal Participation in the WIC Program, vols. 1 and 2. Mimeographed (April). Princeton: Mathematica Policy Research.

Jiminez, Emmanuel and Bernardo Kugler. 1987. "The Earnings Impact of Training Duration in a Developing Country: An Ordered Probit Selection Model of Columbia's Servicio Nacional de Aprendizaje (SENA)." The Journal of Human Resources 22: 228–47.

Kiefer, Nicholas M. 1978. "Federally Subsidized Occupational Training and the Employment and Earnings of Male Trainees." Journal of Econometrics 8: 111–25.

-----. 1979. "Population Heterogeneity and Inference from Panel Data on the Effects of Vocational Training." Journal of Political Economy 87: 213–26.

Matched-Site Comparison Groups--No Modeling

Buckner, John C., and Meda Chesney-Lind. 1983. "Dramatic Cures for Juvenile Crime: An Evaluation of a Prisoner-Run Delinquency Prevention Program." Criminal Justice and Behavior 10: 227–47.

Duncan, Burris, W. Thomas Boyce, Robert Itami, and Nancy Puffenbarger. 1983. "A Controlled Trial of a Physical Fitness Program for Fifth Grade Students." Journal of School Health 53: 467–71.

Evans, Richard, Richard Rozelle, Maurice Mittelmark, William Hansen, Alice Bane, and Janet Havis. 1978. "Deterring the Onset of Smoking in Children: Knowledge of Immediate Physiological Effects and Coping with Peer Pressure, Media Pressure, and Parent Modeling." Journal of Applied Social Psychology 8, no. 2: 126–35.

Flay, Brian, Katherine B. Ryan, Allen J. Best, K. Stephen Brown, Mary W. Kersell, Josie R. d'Avernas, and Mark P. Zanna. 1985. "Are Social-Psychological Smoking Prevention Programs Effective? The Waterloo Study." Journal of Behavioral Medicine 8: 3759.

Freda, Margaret Comerford, Karla Damus, and Irwin R. Merkatz. 1988. "The Urban Community as the Client in Preterm Birth Prevention: Evaluation of a Program Component." Social Science Medicine 27: 1439–446.

Hurd, Peter D., C. Anderson Johnson, Terry Pechacek, L. Peter Bast, David R. Jacobs, and Russel V. Luepker. 1980. "Prevention of Cigarette Smoking in Seventh Grade Students." Journal of Behavioral Medicine 3: 15–28.

McAlister, Alfred, Cheryl Perry, Joel Killen, Lee Ann Slinkard, and Nathan Maccoby. 1980. "Pilot Study of Smoking, Alcohol and Drug Abuse Prevention." American Journal of Public Health 70: 719–21.

Perry, Cheryl L., Joel Killen, Joel and Lee Ann Slinkard. 1980. "Peer Teaching and Smoking Prevention Among Junior High Students." Adolescence 15: 277–81.

Perry, Cheryl L., Rebecca M. Mullis, and Marla C. Maile. 1985. "Modifying the Eating Behavior of Young Children." Journal of School Health 55: 399–402.

Perry, Cheryl L., Michael J. Telch, Joel Killen, Adam Burke, and Nathan Maccoby. 1983. "High School Smoking Prevention: The Relative Efficacy of Varied Treatments and Instructors." Adolescence 18: 561–66.

Vincent, Murray L., Andrew F. Clearie, and Mark D. Schluchter. 1987. "Reducing Adolescent Pregnancy Through School and Community-Based Education." Journal of the American Medical Association 257: 3382–386.

Zabin, Laurie S., Marilyn Hirsch, Edward A. Smith, Rosalie Streett, and Janet B. Hardy. 1986. "Evaluation of a Pregnancy Prevention Program for Urban Teenagers." Family Planning Perspectives 18: 119–23.

Matched-Site Comparison Groups--With Modeling

Brown, Randall, John Burghardt, Edward Cavin, David Long, Charles Mallar, Rebecca Maynard, Charles Metcalf, Craig Thornton, and Christine Whitebread. 1983. "The Employment Opportunity Pilot Projects: Analysis of Program Impacts." Mimeographed (February). Princeton: Mathematica Policy Research.

Casswell, Sally, and Lynette Gilmore. 1989. "Evaluated Community Action Project on Alcohol." Journal of Studies on Alcohol 50: 339–46.

Devaney, Barbara, Marie McCormick, and Embry Howell, Embry. 1993. "Design Reports for Healthy Start Evaluation: Evaluation Design, Comparison Site Selection Criteria, Site Visit Protocol, Interview Guides." Mimeographed. Princeton: Mathematica Policy Research.

Dynarski, Mark, and Walter Corson. 1994. "Technical Approach for the Evaluation of Youth Fair Chance." Proposal accepted by the Department of Labor (June 1994). Princeton: Mathematica Policy Research.

Farkas, George, Randall Olsen, Ernst W. Stromsdorfer, Linda C. Sharpe, Felicity Skidmore, D. Alton Smith, and Sally Merrilly (Abt Associates). 1984. Post-Program Impacts of the Youth Incentive Entitlement Pilot Projects. New York: Manpower Demonstration Research Corporation.

Farkas, George, D. Alton Smith, and Ernst W. Stromsdorfer. 1983. "The Youth Entitlement Demonstration: Subsidized Employment with a Schooling Requirement." The Journal of Human Resources 18: 557–73.

Gueron, Judith. 1984. Lessons from a Job Guarantee: The Youth Incentive Entitlement Pilot Projects. New York: Manpower Demonstration Research Corporation.

Guyer, Bernard, Susan S. Gallagher, Bei-Hung Chang, Carey V. Azzara, L. Adrienne Cupples, and Theodore Colton. 1989. "Prevention of Childhood Injuries: Evaluation of the Statewide Childhood Injury Prevention Program (SCIPP)." American Journal of Public Health 79: 1521–527.

Ketron. 1987. "Final Report of the Second Set of Food Stamp Workfare Demonstration Projects." Mimeographed (September). Wayne, Pa.: Ketron.

Long, David A., Charles D. Malla, and Craig V. D. Thornton. 1981. "Evaluating the Benefits and Costs of the Job Corps." Journal of Policy Analysis and Management 1: 55–76.

Mallar, Charles, Stuart Kerachsky, Craig Thornton, and David Long. 1982. "Evaluation of the Economic Impact of the Job Corps Program: Third Follow-Up Report." Mimeographed (September). Princeton: Mathematica Policy Research.

Polit, Denise, Janet Kahn, and David Stevens, David. 1985. "Final Impacts from Project Redirection." Mimeographed (April). New York: Manpower Demonstration Research Corporation.

Steinberg, Dan. 1989. "Induced Work Participation and the Returns to Experience for Welfare Women: Evidence from a Social Experiment." Journal of Econometrics 41: 321–40.

Matched-Pair Comparison Sites

Davis, Elizabeth. 1993. "The Impact of Food Stamp Cashout on Household Expenditures: The Alabama ASSETS Demonstration." In New Directions in Food Stamp Policy Research, ed. Nancy Fasciano, Daryl Hall, and Harold Beebout. Draft Copy. Princeton: Mathematica Policy Research.

Long, Sharon K., and Douglas A. Wissoker. 1993. "Final Impact Analysis Report: The Washington State Family Independence Program." Draft (April). Washington, DC: Urban Institute.

Institutional Comparison

Dynarski, Mark, Alan Hershey, Rebecca Maynard, and Nancy Adelman. 1992. "The Evaluation of the School Dropout Demonstration Assistance Program--Design Report: Volume I." Mimeographed (October 12). Princeton: Mathematica Policy Research.

Back to New Approaches to Evaluating Community Initiatives index.

Copyright © 1999 by The Aspen Institute
Comments, questions or suggestions? E-mail
This page designed, hosted, and maintained by Change Communications.