Home / posts / Blog / Complexity and Evaluation Failure

Complexity and Evaluation Failure

Aug 7, 2019 | Blog

By Dione Hills, Principal Researcher Consultant (The Tavistock Institute of Human Relations)

24th July 2019

Why complex evaluations fail:

People attending CECAN training and events often want to hear about evaluations that ‘failed’ because the wrong (i.e. not complexity appropriate) methods or approach were used. Providing such examples is not easy: accounts of evaluations that ‘fail’ are rarely published, and when we have examples from our own practice, confidentiality – or embarrassment – can make it hard to talk about these in public.

It is therefore both refreshing and helpful to read a book like ‘Evaluation Failures’i . Written by experienced evaluators, this gives twenty-two examples of evaluations that ‘failed’, with reflections by the authors on what they got wrong, and what they learned. In very few cases was the failure attributed to using the wrong evaluation methods or design. More often, difficulties arose from dynamics in the system that the evaluators had either not foreseen or overlooked in the early stages. The authors report their main mistakes as ignoring ‘red flags’ , making assumptions that turned out to be incorrect and failing to ‘call time’ when it was clear that the evaluation was no longer viable.

Using a ‘complexity’ framework:

What is striking is that all these examples are of evaluations of complex projects or programmes, often with multiple and varied interventions taking place across several sites and sometimes several countries. Although features of complex adaptive systems are not specifically referred to, many of the difficulties experienced can be seen through a complexity ‘lens’. Made up of many interacting parts, complex systems display ‘emergent’ behaviour and properties that may be hard to predict. As ‘open systems’, they are strongly influenced by things taking place in the wider context, but complex adaptive systems will also actively resist change, with individual parts of the system acting in ways that minimise the impact of external disruption in an attempt to bring the system back to equilibrium. Individual ‘gatekeepers’ (hubs and spokes in systems terms) can often have the power to provide or limit access to the system as a whole: in evaluation, this often means access to data. There may also be lack of consensus between key parts or key stakeholders in the system. Some of these features are illustrated in the examples given below:

Gate keepers and their ability to obstruct and subvert evaluation:

Several case studies give examples of ways in which key individuals, or one part of the overall system or programme, were able to obstruct or undermine the evaluation activities, either by making it difficult for evaluators to collect data – or by rejecting the veracity of the data when it was presented.

Chapter 8 describes difficulties an evaluator had when presenting baseline findings from the evaluation of a child labour project run by an international NGO. Government officials challenged the evaluation methodology being used and the failure of the evaluators to consult adequately on how they were approaching the work. However, a key problem for the government department was that the data indicated levels of child labour that were potentially damaging to the reputation of the companies involved and the country itself.
Chapter 20 describes a multisite programme evaluation in which programme managers vigorously rejected extensive evidence (drawn from interviews, a survey and case studies) that local staff wanted to have greater contact and communication with the central team. This masked the concern from the central team that they didn’t have the capacity to provide this level of communication with sites. When presented with more contextual information and suggestions of how this challenge could be addressed without a major investment of resources, the evidence was (reluctantly) accepted.
Chapter 16 describes the evaluation of a programme supporting small enterprises in South Africa in order to build up the local, black, economy. Information from the programme was to be verified using procurement and financial data from the larger corporations involved (buying services from the local enterprises). In spite of early assurances, few corporations were willing to supply this information, giving commercial sensitivity and practical difficulties as their reasons,. Combined with severe difficulties in collecting other programme information (including a major computer crash and loss of key data) the evaluation findings were severely limited.

This resistance to the collection of data or presentation of findings may result from the evaluator being seen as threatening to the current ‘status quo’ or equilibrium of the system, often pointing to deeper problems in the policy or programme being evaluated. In the second example, addressing the deeper issue helped the feedback from local staff identified by the evaluation to be accepted.

Context as a critical factor:

‘Turbulence’ or rapid change in the system or its wider context was a feature of several examples. This was sometimes within the system itself: either in the intervention, or among those commissioning the intervention and its evaluation, and sometimes in the wider environment. The level of change was often hard to anticipate at the outset.

Chapter 3 is an account of a four-year public health programme promoting community responses to cardiovascular disease delivered across a number of local districts. The design included both process and impact evaluation and a strong participative element. After working well for the first year, a number of disruptive policy and organizational changes took place in the wider context, forcing sites to change their plans. There was also a high rate of staff turnover working in the programme itself. Although regularly adapting the evaluation to respond to these changes, in year three the whole evaluation approach, and some of the key findings, were challenged in a letter of complaint by one of the new site managers who was unfamiliar with participative research. The evaluation contract was abruptly terminated.

In other examples, it was not change as such, but factors in the wider context that the evaluator was simply unaware of, that caused difficulties. In the following example, it was factors that had been deliberately hidden from view.

Chapter 21 describes the evaluation of a programme providing training to local farm co-ops across multiple countries. Key staff – including country level evaluators – appeared reluctant to be involved in, or failed to turn up at, planning meetings. In spite of this and the frequent turn over of key staff, the evaluation continued and a final report was presented to the programme management. This was returned covered in mark ups and requests for change: apparently innocuous findings were challenged angrily, and proposals that had been discussed and agreed with local participants, strongly rejected. It later turned out that the programme had been running for a number of years with similar problems having been identified in at least two previous evaluations that the evaluator had not been told about.

The lessons evaluators took from these experiences include the importance of having an initial risk analysis or evaluability assessment and the need for having a flexible management plan, with contingency plans and a budget in place to cover potential change. Undertaking some system mapping may also have alerted evaluators to these potential challenges. However, there were clearly cases in which the level of turbulence was such as to make an evaluation untenable.

Lack of consensus:

Conflict and lack of consensus between key stakeholders is a feature of many of the examples in the book. These tensions can become focused on the evaluation methodology and the data being generated.

Chapter 13 describes the evaluation of a programme funded by a partnership. An intervention developed – and evaluated – by one of the partners was now being implemented across several sites. The partner that developed the intervention was keen to have a Developmental Evaluation (to contribute to programme learning) while other partners wanted a more rigorous ‘impact’ evaluation to assess whether the programme was suitable for their own organisations. Amid considerable discomfort and conflict, an evaluation did take place (with both process and impact elements, but without significant Developmental elements) but with a sense that no one was entirely comfortable with this or benefiting from the findings.
Chapter 7 describes three evaluations taking place in a high stakes public policy environment and illustrates the tension for evaluators between meeting stakeholder requirements while maintaining their professional integrity. In one case ‘difficult’ findings were challenged because they were based on qualitative rather than quantitative data, the situation becoming more high profile when ‘challenging’ findings received media coverage. ‘To them (the committee receiving the report) numbers constituted ‘real’ data and qualitative information was nothing more than anecdotes and opinions. At times our conversations felt like we were speaking two different languages. We had different views of data evidence and evaluation methodologies’.

In most cases, the evaluators eventually did see, either at the time or in retrospect, how the difficulties were evidence of a deeper tension within the system, and were sometimes able to point this out with lesser or greater success. However, in many cases this didn’t stop the evaluators from feeling that it was their evaluation plans – or their lack of competence – that was ultimately to blame for the failure.

Can better evaluation design help avoid failure?

There is one striking example in the book where it was the actual evaluation design that failed because of a lack of appreciation of the wider ‘system’ in which the intervention took place.

Chapter 19. The evaluation adopted a semi experimental design comparing one group of participants who received new (financial) resources and another that did not. The evaluation approach aligned well with an initial logic model but failed to show any significant difference in outcome between the two groups. Exploring why, the evaluator realized that they had failed to identify that the intervention was just one part of a larger system (they had placed the boundary in the wrong place), that there were important system level interconnections between all recipients whether in receipt of the new resources or not, and that those not receiving new resources were receiving other kinds of support (other than the specific benefits being assessed).

However, there are also several examples in which evaluators adopted strategies and approaches recommended for use in complex settings, but ran into difficulties because they had not fully appreciated the system as a whole and failed to involve a key stakeholder.

Chapter 10 described an evaluation of an initiative designed to support medical graduates in preparing for their licensing exams. Designed in consultation with a faculty member and programme staff, a logic model was developed and presented to a high level advisory group. After an awkward silence, the government funder observed that ‘these outcomes are not what we were funding you to do!’ Having not involved the funder in the logic mapping exercise, differences of view between the programme developers and the funders had been overlooked. Following discussions the logic map was amended and, with regular updates, provided the basis for a successful ongoing evaluation.
Chapter 11 describes how an evaluator proudly produced a system map of health care intervention to demonstrate to the steering group the complexity of the issue – and its wider environment. The map was met with stunned silence and someone later commented ‘Maybe systems maps are a better investigative tool than a communications tool’. As well learning how important it is to involve key stakeholders in drawing up a system map, the evaluator also felt that drawing up a rich picture might have been more helpful for identifying issues such as conflicts, emotions and politics.

Another strategy recommended for use in complex evaluations is the use of adaptive or agile management approaches so that the design itself can be changed in response to changing circumstances. Many of the evaluators did report having to make significant changes to their evaluation design, in response to changes in the programme being evaluated, but also noted the challenges that this could present.

Chapter 2 describes the evaluation of a professional development programme for government employees with a large number of stakeholders involved. Soon after the start there were a number of changes in key personnel followed by a change in government with accompanying changes in policy and restructuring. Significant evaluation resources had to be used in briefing incoming staff and adapting the design to accommodate shifts in focus and interest, leaving little in the budget for the final analysis and report writing.
Chapter 8 reports on an evaluation of a communication and outreach plan designed to increase understanding and take up of services of a government department. An initial diagnosis and theory of change map identified that trust – between the public and public servants – was a key issue and the evaluation design took this as a central focus. Unfortunately, the staff involved in the delivery of the programme were not kept sufficiently informed of this change in focus. The methodology was subject to closer and closer scrutiny before the whole evaluation was called to an abrupt halt.

The key lessons taken forward by the evaluators from examples like this relate to the importance of having contingency plans and budget agreed with commissioners, and ensuring that everyone is kept informed of changes taking place so there were no unexpected surprises.

Conclusions:

Many of the dynamics at work in the examples of evaluation ‘failure’ in this book can be traced back to the characteristics of complex systems. The ‘failures’ experienced were often felt very personally by the evaluators as a reflection of their lack of experience or competence. Michael Quinn Patton draws out quotes related to this in the forward to the book:

All hell broke loose, not catastrophically, but little by little, and it all added up.

The cumulative snowball effect – how a flurry of relatively minor challenges could pile up and leave me feeling overwhelmed and unable to solve it.

I felt incredible embarrassed and disappointed

I may have actually shed tears over this one. I was devastated.

Seeing these issues from a systems perspective might have given the evaluators – and others involved – a different handle on the situation. Individual behaviour seen as annoying and subversive can sometimes be interpreted as providing evidence of deeper problems in the system which needed to be addressed. Recognising, from the outset, the inherent unpredictability of complex adaptive systems could also have been helpful in ‘expecting the unexpected’ and having mechanism with which to identify – and discuss with commissioners – potential disruptive features at an earlier stage.

[I]Evaluation Failures: 22 Tales of Mistakes Made and Lessons Learned, SAGE Publications Inc., 2019

CECAN Webinar – The benefits and challenges of conducting research with impact ‘built in’: reflections and findings from an evaluation of Electronic Monitoring with the Ministry of Justice, with Ian Brunton-Smith. 23 Jun, 1 - 2pm BST. Includes live Q&A! Register free: www.cecan.ac.uk/events/cecan...

[image or embed]
— CECAN (@cecan.bsky.social) April 9, 2025 at 12:22 PM

*New Resource* - 'Guidance on using large language models to extract cause-and-effect pairs from texts for systems mapping', written by Jordan White and Pete Barbrook-Johnson. See: www.cecan.ac.uk/resources/to...

[image or embed]
— CECAN (@cecan.bsky.social) April 3, 2025 at 2:53 PM

Complexity and Evaluation Failure

Related