Evaluation: Learning from Past Mistakes

A wave of enthusiasm for evaluation has been sweeping governments around the world in recent years. In country after country, governments are acting to increase the amount of evaluation undertaken and ensure that it is used more. Government-wide evaluation policies have been developed, and specialized units and task forces established to promote, guide, and regulate evaluation. Some governments have also passed laws enshrining the obligation to evaluate. All of this is very positive.

There is, however, a risk that much of this will end in disappointment unless we learn from the repeated past failures of government-wide evaluation reforms. I say this as a strong supporter of evaluation who has advised several countries on building government-wide evaluation systems.

One key lesson from past experience is that a prominent place in the evaluation toolkit must be assigned to practical evaluation methods. This means methods that are low-cost, can be implemented quickly, cope with limited data, are action-oriented, and systematic without being so complex that a PhD is required to undertake them. Evaluation policies and units must actively promote – and provide technical guidance on – such methods.

While evaluation might sound like a new frontier in public sector reform, it is anything but. There have been successive waves of government-wide evaluation efforts in advanced countries since the 1960s. Almost all of them petered out in disappointment. This is because, throughout its long history, evaluation has been dogged by a well-documented set of problems. Evaluations have frequently been “too costly and time-consuming compared to their real use and effect”. They have all too often been short on actionable findings/recommendations, difficult for decision-makers to understand, and taken too long to produce to feed into decision-making at the time when they are needed [1]. These are problems that remain even today.

The record is clear: one of the biggest sources of the difficulties that have afflicted evaluation over the decades has been the insistence by an influential part of the evaluation profession that all evaluation must be scientifically “rigorous” — i.e. based on sophisticated social science statistical analysis techniques.

This purist attitude is riding high today. For today’s purists, “rigorous” evaluation essentially means impact evaluation [2] and – for the real hard-liners – only impact evaluation using randomized controlled trials. Those who take this view are dismissive of other methods, which they regard as insufficiently rigorous to even qualify as evaluation.

Impact evaluation is a great tool for the evaluation of some government programs. It has made considerable progress over past decades, assisted by both methodological advances and digitalization. It is unquestionably one of the evaluation methods that should be part of the analytic toolkit of a well-developed evaluation system.

However, impact evaluation also has major limitations. There are many government interventions for which its use is either not possible [3], not practical, or not cost-effective. It is highly data-intensive, time-consuming and costly – typically taking several years at a cost of hundreds of thousands of dollars. Impact evaluation findings are also much less reliable (“externally valid”) than is often claimed — even when they are obtained using randomized controlled trials [4]. Moreover, while impact evaluation can provide information on whether an intervention is effective or not, it provides no guidance on why it is or is not effective and what might be done to improve its effectiveness.

This is why practical evaluation is so important. For the analysis of effectiveness, so-called theory-based evaluation — variants of which include contribution analysis — is a particularly important type of practical evaluation. The core of theory-based evaluation is the assessment of the conceptual credibility of the reasoning about how a program is supposed to achieve its intended outcomes – what is often called “program logic analysis”. This is complemented by a review of whatever data is available and, optionally, some additional data collection. Such evaluation can be carried out in months rather than years, at modest cost, and can be applied to any program. The method is systematic — as analysis must be to meet the definition of evaluation — but very practical.

The fact that, whatever the country, there are limited financial and specialist human resources available for evaluation reinforces the importance of this type of practical evaluation. This is particularly true in developing countries setting out to develop evaluation systems.

All of this is about the evaluation of effectiveness. But what about efficiency? After all, evaluation usually claims to cover the analysis of both. The evaluation of efficiency will be the topic of the next piece in this series.

[1] I reviewed this historical experience in a 2014 paper prepared for the World Bank Independent Evaluation Group.

[2] An impact evaluation is an analysis of the effectiveness of a government intervention (i.e. its outcomes) using counterfactual analysis based on experimental or policy-experimental techniques.

[3] For example, if the outcome that the program seeks to achieve is not measurable.

[4] There is considerable literature outlining the reasons why, as Deaton and Cartwright put it, “any special status for RCTs is unwarranted” (See also, e.g., Pritchett, 2021).

Marc Robinson Blog

Evaluation: Learning from Past Mistakes

Leave a Reply Cancel reply