2. Complete the evaluation work plan
This section of the toolkit is structured to mirror the Evaluation work plan template to give additional guidance, section by section.
The evaluation work plan outlines future evaluation activity for the program, usually over a five-year period. It forms the basis of the request for tender (see section 3.2: Prepare the request for tender) to commission an external evaluation or to clarify requirements for an internal evaluation.
The work plan details roles and responsibilities, the evaluation methodology (including the program logic, the evaluation questions, evaluation types and data matrix) explains whether the evaluation will be external or internal, the budget and resourcing available, stakeholder communications, ethical considerations, evaluation risks and review. Where appropriate, key stakeholders should be included in these discussions. Specific methodologies for each of the evaluations in the evaluation work plan may need to be refined prior to the commencement of each evaluation.
The work plan also takes into account the strategic importance of the program and the expected level of resourcing for evaluation.
2.1.1. Who should develop the evaluation work plan?
The evaluation work plan is usually completed by the program manager with input from the agency’s evaluation unit and/or DTF. For large and complex evaluations, it may be worth engaging an evaluation team at the development stage to help complete the evaluation work plan (see section 2.5.3: External or internal evaluation and section 3.1: When to engage the evaluation team).
While a program team may be able to develop designs for smaller scale evaluations, evaluation expertise may be required for more complex evaluations. Specialist expertise might be needed to:
- gather data about hard-to-measure outcomes or from hard-to-reach populations
- develop an evaluation design that adequately addresses causal attribution in outcome and/or impact evaluations
- advise on the feasibility of applying particular designs within the context of a program
- identify specific ethical and cultural issues.
If the program manager decides to use an external provider to develop the evaluation work plan, they should commission the design of the evaluation as a separate project as soon as the program is approved to proceed. This would be based on the evaluation overview in the Cabinet submission and would be used to complete the evaluation work plan. Subsequent requests for tender for each evaluation would draw on this work.
The program overview needs to explain why the program is needed, what it is aiming to achieve and how it is expected to impact demand on future government services. It also briefly describes how the program operates (its funding and governance) and any sensitivities. It should be about half a page in length with any additional information in an appendix.
2.2.1. Needs assessment
Ideally, a needs assessment will be undertaken as part of the program design. If a needs assessment has been carried out, please outline the findings in the program overview or a relevant appendix.
A needs assessment is a tool that is used for both designing programs and for conducting program evaluation. It is a systematic method to determine who needs the program, how great the need is, characteristics of the target group, patterns of unmet needs, and what might work to meet the needs identified.
Key questions may include:
- What problems exist and how large or serious are they?
- What are the characteristics and needs of the target population?
- How are people affected as individuals?
- How is the community affected? What are the financial and social costs of the issue?
- Are other groups or agencies (including Commonwealth agencies) working to address the need?
- What are the opportunities for collaboration and shared funding?
Undertaking a systematic needs assessment is a transparent way of ensuring resources are used in the most effective way possible. The needs assessment should be reviewed during evaluation to assess whether the program is still needed.
Conducting a literature review is an important step in determining the existing evidence base and possible program or intervention options. A good starting point may include one of the many online evidence banks or clearinghouses which aim to synthesise evidence on what works in various policy areas.
|Campbell Collaboration||Systematic reviews and evidence and gap maps (EGMs) by a range of subject areas e.g. Crime & Justice, Disability, Education, Social Welfare|
|Washington State Institute for Public Policy||Systematic assessment of high quality studies across various public policy areas and the cost of each policy option|
|Cochrane library||The Cochrane Database of Systematic Reviews is the leading journal for systematic review in healthcare|
|Australian Public Sector Evaluation Network||APSEN SharePoint Communication site provides a library of public sector evaluation resources and reports|
|The Evidence Based Policing Matrix||This Matrix categorises and visualises evaluated police tactics against ‘realms of effectiveness’|
|CrimeSolutions||A clearing house of programs and practices that have undergone rigorous evaluations and meta-analyses aiming to address criminal justice, juvenile justice and crime victim services outcomes|
|Evidence for Learning Toolkit||Summaries of global evidence for education approaches with a dashboard of i) average months’ worth of learning progress, ii) cost to implement and iii) security of evidence|
|Child Family Community Australia||The CFCA information exchange’s mission is to be the primary source of quality, evidence-based information, resources and support for professionals in the child, family and community welfare sector|
|Australian Centre for the Study of Sexual Assault (historical)||An information centre to provide policy-relevant data on sexual assault and promote research and best practice interventions|
|The Centre of Best Practice in Aboriginal and Torres Strait Islander Suicide Prevention Clearing House||This clearinghouse shares promising and best practice programs, services, guidelines, resources and research on Indigenous suicide prevention initiatives|
Closing the Gap Clearinghouse (historical)
|This clearinghouse was established to collect, analyse and synthesise evaluation evidence on 'what works' to close the gap in Indigenous disadvantage.|
|Indigenous Justice Clearinghouse||This clearinghouse disseminates relevant Indigenous Justice information|
|Australia’s National Research Organisation for Women’s Safety Limited||ANROWS produce evidence to support the reduction of violence against women and their children|
|What works wellbeing evidence bank||Evidence statements and gaps from systematic reviews on wellbeing, culture, sport, work and learning|
|Clearinghouse for Sport||An information and knowledge sharing platform for Australian sport|
|HealthInfoNet||This website provides an evidence base to inform practice and policy in Aboriginal and Torres Strait Islander health|
|NSW Department of Communities and Justice evidence and gap maps||An interactive out-of-home care evidence and gap map against 11 outcome areas based on 128 primary studies and 31 systematic reviews|
|UNICEF Evidence Gap Maps||Evidence gap maps for reducing violence against children, child well-being interventions, adolescent well-being and pandemics, epidemics and child protection|
|3ie - International Initiative for Impact Evaluation||Interactive evidence gap maps for international development policies and programs, links to user-friendly summaries and full-text articles where available|
 NSW Evaluation Toolkit (step 1).
Include all the information from the evaluation overview section of the Cabinet submission to ensure the evaluation work plan has all the relevant information as part of a single document. You could attach the Cabinet submission evaluation overview as an appendix and refer to the attachment in this section.
There are many decisions to be made in an evaluation including:
- the focus of the evaluation (including the key evaluation questions)
- choosing the evaluator/evaluation team
- approving the evaluation design
- approving the evaluation report(s) and who can access them.
Contributors to involve in the decision-making process may include:
- the program manager within the agency
- an evaluation steering committee
- a technical advisory group or a number of individual technical advisors (including service providers)
- a community consultation committee or relevant people from the community.
The role of each individual or group in relation to specific decisions can be categorised as follows:
- to consult: those whose opinions are sought (bilateral)
- to recommend: those who are responsible for putting forward a suitable answer to the decision.
- to approve: those who are authorised to approve a recommendation.
- to inform: those who are informed after the decision has been made (unilateral).
One or more of the following processes may be employed in the decision-making process:
- Decisions made based on support from the majority. Where decisions may be contentious it is important to be clear about who is eligible to vote and whether proxy votes are allowed.
- Decisions made based on reaching a consensus. In practical terms, that can mean giving all decision makers the right to veto.
- Decisions made based on hierarchy (formal positions of authority).
Evaluation managers are often, but not always, the program manager. For large evaluations, the evaluation manager may be assisted by one or more staff members with specific responsibilities in the management process.
Table 7: Potential evaluation roles and responsibilities
|[Program name] Evaluation Steering Committee|
 M. K. Gugerty, D. Karlan, The Goldilocks Challenge: Right Fit Evidence for the Social Sector, New York, Oxford University Press, 2018.
The evaluation methodology should include a detailed program logic, key evaluation questions and a data matrix. It needs to consider what types of evaluations should be used and how key evaluation questions will be addressed.
2.5.1. Program logic
A program logic should illustrate how the program will work by linking program activities with intended outcomes. It visually represents the theory of change underpinning the program and describes how the program contributes to a chain of results flowing from the inputs and activities to short-term, intermediate and long-term outcomes.
Different terms are used for a program logic including program theory, logic model, theory of change, causal model, outcomes hierarchy, results chain, and intervention logic. Usually it is represented as a one page diagram. The diagrams and terms used with program logic may also vary – sometimes the diagrams are shown as a series of boxes, as a table, or as a series of results with activities occurring alongside them rather than just at the start. Some diagrams show the causal links from left to right, some from bottom to top. In all cases, a program logic needs to be more than just a list of activities with arrows to the intended outcomes. For some examples, see Program logic library.
What is a program logic used for?
A program logic should show what needs to be measured in order to distinguish between implementation failure (not done right) and theory failure (done right but still did not work). A program logic:
- clarifies and communicates program intentions and outcomes
- demonstrates alignment between activities and objectives
- explains causal assumptions and tests if they are supported by evidence
- identifies relevant external factors that could influence outcomes (either positively or negatively)
- identifies key indicators to be monitored
- identifies gaps in available data and outlines mitigation measures
- clarifies the outcomes measurement horizon and identifies early indicators of progress or lack of progress in achieving results
- focuses evaluation questions.
A program logic underpins data collection by identifying a program’s operating steps and defining what program managers should monitor and measure. A program logic also helps identify the components of the program to be tracked as part of monitoring (outputs) versus those that should be assessed as part of an outcome or impact evaluation (outcomes).
Monitoring activities and outputs shows which program components are being well implemented and which could be improved. A focus on measuring outcomes and impacts without a good monitoring system can result in wasted resources. For example, if a program aims to improve literacy in schools using particular books, it is important to monitor delivery and use of the books so that the program can be adjusted early if the books are not being delivered or used. If the program does not have a good monitoring system in place and waits three years before doing an outcome evaluation, this could be an expensive way of finding out that the books had not even been used. Further information on the importance of a good monitoring system as a way of keeping evaluation costs down is in the Goldilocks toolkit.
A program logic illuminates the critical assumptions and predictions that must hold for key outcomes to occur, and suggests important areas for data collection. It also helps prioritise data collection; for example, if there is no way to isolate external factors that influence the outcomes of the program, is it worth collecting the outcome data? Important considerations include the cost of collecting the outcome data and the conclusions that can reasonably be drawn from it. In some cases, a process evaluation may be sufficient but it is essential that the results of the process evaluation are not overstated.
Developing a program logic
Developing a program logic is part analytical, part consultative process. Analytically, it should review the program settings to identify statements of activities, objectives, aims and intended outcomes. It should then refine and assemble these statements into a causal chain that shows how the activities are assumed to contribute to immediate outcomes, intermediate outcomes and ultimately to the longer term outcome. Consultatively, the process should involve working with a range of stakeholders to draw on their understanding of the outcomes and logic, and also encourage greater ownership of the program logic.
It is useful to think realistically about when a successful program will be able to achieve particular outputs and outcomes. For example, within a domestic violence context, a successful program may see an increase in reporting (due to increased awareness and/or availability of support) before reporting decreases. Where possible, estimated timing of indicators should be built into the program logic to help clarify what success looks like in different timeframes.
The Evaluation work plan template includes a suggested template for the program logic at Appendix A . This is an optional starting point rather than a mandatory structure, however all program logics should clearly identify assumptions and relevant external factors. ‘Backcasting’ starts with identifying long-term outcomes of a program and envisaging alternative futures, and then working backwards to determine the necessary steps towards achieving these outcomes. Unlike forecasting, which considers what is currently occurring and predicting future outcomes, the benefit of backcasting is that it allows stakeholders to brainstorm and consider alternative courses of action. BetterEvaluation’s direction on developing programme theory/theory of change may assist in determining the best approach.
Table 8: Potential steps for developing a program logic
|Undertake situational analysis||The context of the problem, its causes and consequences. A good situation analysis will go beyond problems and deficits to identify strengths and potential opportunities.|
|Identify outputs||The direct deliverables of a program. The products, goods or services that need to be provided to program participants to achieve the short-term outcomes.|
|Identify activities||The required actions to produce program outputs.|
|Identify inputs||The resources required to run the program.|
In every link between activity, output and outcome, many different assumptions are made that must hold for the program to work as expected. Making these assumptions explicit and identifying the most critical among them helps to figure out what testing and monitoring is needed to ensure the program works
as planned. This includes assumptions about:
|Consider external factors that also cause changes||What besides the program could influence the intended outcome? Listing the most important external influences helps organisations better understand the counterfactual and clarify whether it will be possible to attribute a change in the outcome solely to the program.|
|Identify risks and unintended consequences||The world around the program is unlikely to remain static; changes in external conditions pose unavoidable risks to any program. It is important to identify the most likely and potentially damaging risks and develop a risk reduction or mitigation plan (see section 2.9: Evaluation risks).|
Once a program logic is developed, it is useful to map existing data from an established program or previous evaluation onto the program logic to identify priority areas for additional data collection.
What does a good program logic look like?
There is no one way to represent a program logic – the test is whether it is a representation of the program's causal links, and whether it communicates effectively with the intended audience by making sense and helping them understand the program. See examples in Program logic library.
Table 9 provides guidance on the aspects required for a good program logic, and explains the different criteria for ‘requires improvement’, ‘satisfactory’ and ‘good’. This may be refined over time in response to user feedback to ensure it is appropriate to a Territory Government context.
See section 2.10. Reviewing the evaluation work plan for suggestions on what to look for when reviewing an established program logic.
Table 9. Program logic rubric
|Section of program logic||Requires improvement||Satisfactory||Good (includes all satisfactory criteria plus those listed below)|
|Inputs and participation|
|Activities and or outputs|
External factors and |
1. SMART Specific, Measurable, Attainable, Relevant and Time-bound.
Source: Department of Industry, Innovation and Science (2017)
Table 10: Program logic library
|Program logic source||Comments|
|Australian Institute of Family Studies Blairtown example program logic||A hypothetical program aiming to ensure children reach appropriate developmental milestones. Includes assumptions and external factors.|
|Evidence-Based Programs and Practice in Children and Parenting Support Programs||A project supporting nine Children and Parenting Support services in regional and rural NSW to enhance their use of evidence-based programs and practice. Includes assumptions and external factors.|
|National Forum on Youth Violence Prevention||A program that aims to maximise the use of city partnerships and increase the effectiveness of federal agencies to reduce youth violence. Includes assumptions and external factors.|
|Australian Policy Service Policy Hub Evaluation Ready example program logic: Save Our Town||A hypothetical program aimed at stimulating private sector investment, population growth, and economic expansion and diversification to increase a region’s viability. Includes assumptions and external factors.|
|University of Michigan Evaluation Resource Assistant example program logic||A hypothetical program aimed at reducing rates of child abuse and neglect. Does not include assumptions and external factors, however it is a good example of how to use a program logic to prioritise key evaluation questions and indicators.|
|The Goldilocks Toolkit Case Studies||Program logics and lessons learned from a range of international social programs including Acumen, GiveDirectly, Digital Green, Root Capital, Splash, TulaSalud, Women for Women International and One Acre Fund. The program logics do not include assumptions and external factors but the case studies provide examples of how to reduce evaluation costs through a good monitoring system.|
|Geo-Mapping for Energy & Minerals||Appendix A of this evaluation report by Natural Resources Canada has a program logic for a program improving regional geological mapping for responsible resource exploration and development. Does not include assumptions and external factors but does have an evaluation matrix to show how the evaluation questions will be addressed.|
|The Logic Model Guidebook: Better Strategies for Great Results||Program logics from a Community Leadership Academy (see page 10, and a marked up version on page 56) and a Health Improvement program (see page 39 and a marked up version on page 57).|
|National Framework for Universal Child and Family Health Services|
Figure 5 of this report is a program logic for universal child and family health services. Does not include assumptions and external factors but is an example of a different way of adapting the layout to suit the program.
|National Ice Action Strategy evaluation||A program logic (Appendix 1) that uses different colours to clearly show the link between activities, outputs and outcomes against objectives. Includes timeframes against outcomes.|
2.5.2. Evaluation questions
Across the program cycle, evaluations need to include a range of questions that promote accountability for public funding and learning from program experiences. These questions need to align with the program logic and will form the basis of the terms of reference. Evaluation questions may be added to or amended closer to evaluation commencement to account for changes in policy context, key stakeholders, or performance indicators.
Different types of questions need different methods and designs to answer them. In evaluations there are four main types of questions: descriptive, action, causal and evaluative.
Descriptive questions ask about what has happened or how things are. For example:
- What were the resources used by the program directly and indirectly?
- What activities occurred?
- What changes were observed in conditions or in the participants?
Descriptive questions might relate to:
- Inputs – materials, staff.
- Processes – implementation, research projects.
- Outputs – for example, research publications.
- Outcomes – for example, changes in policy on the basis of research.
- Impacts – for example, improvements in agricultural production.
Action questions ask about what should be done to respond to evaluation findings. For example:
- What changes should be made to address problems that have been identified?
- What should be retained or added to reinforce existing strengths?
- Should the program continue to be funded?
Causal questions ask about what has contributed to changes that have been observed. For example:
- What produced the outcomes and impacts?
- What was the contribution of the program to producing the changes that were observed?
- What other factors or programs contributed to the observed changes?
What is the difference between correlation and causation?
Two variables are classified as correlated if both increase and decrease together (positively correlated) or if one increases and the other decreases (negatively correlated). Correlation analysis measures how close the relationship is between the two variables.
Causal questions need to investigate whether programs are causing the outcomes that are observed. Although there may be a strong correlation between two variables, for example the introduction of a new program and a particular outcome, this correlation does not necessarily mean the program is directly causing the outcome. See Box 1 for an example.
Box 1: Evaluating to improve resource allocations for family planning and fertility in Indonesia
Indonesia’s innovative family planning efforts gained international recognition in the 1970s for their success in decreasing the country’s fertility rates. The acclaim arose from two parallel phenomena: (1) fertility rates declined by 22% between 1970–1980, by 25% between 1981–1990, and a bit more moderately between 1991–1994; and (2) during the same period, the Indonesian government substantially increased resources allocated to family planning (particularly contraceptive subsidies).
Given that the two things happened concurrently, many concluded that increased investment in family planning had led to lower fertility rates. Unconvinced by the available evidence, a team of researchers evaluated the impact of family planning programs on fertility rates and found, contrary to what was generally believed, that family planning programs only had a moderate impact on fertility, with changes in women’s status deemed to have a larger impact on fertility rates.
The researchers noted that before the start of the family planning program very few women of reproductive age had finished primary education. During the same period as the family planning program, however, the Indonesian government undertook a large-scale education program for girls. By the end of the program, women entering reproductive age had benefited from the additional education. When the oil boom brought economic expansion and increased demand for labour in Indonesia, the participation of educated women in the labour force increased significantly. As the value of women’s time at work rose, so did the use of contraceptives. In the end, higher wages and empowerment explained 70% of the observed decline in fertility—more than the investment in family planning programs.
These evaluation results informed policy makers’ subsequent resource allocation decisions: funding was reprogrammed away from contraception subsidies and towards programs that increased the enrolment of women in school. Although the ultimate goals of the two programs were similar, evaluation studies had shown that in the Indonesian context, lower fertility rates could be obtained more effectively by investing in education than by investing in family planning.
There are many designs and methods to answer causal questions but they usually involve one or more of these strategies:
(a) Compare results to an estimate of what would have happened if the program had not occurred (this is known as a counterfactual)
This might involve creating a control group (where people or sites are randomly assigned to either participate or not) or a comparison group (where those who participate are compared to others who are matched in various ways). Techniques include:
- Randomised controlled trial (RCT): a control group is compared to one or more treatment groups.
- Matched comparison: participants are each matched with a non participant on variables that are thought to be relevant. It can be difficult to adequately match on all relevant criteria.
- Propensity score matching: create a comparison group based on an analysis of the factors that influenced people’s propensity to participate in the program.
- Regression discontinuity: compares the outcomes of individuals just below the cut-off point with those just above the cut-off point.
(b) Check for consistency of the evidence with the theory of how the intervention would contribute to the observed results
This can involve checking that intermediate outcomes have been achieved, using process tracing to check each causal link in the theory of change, identifying and following up anomalies that don’t fit the pattern, and asking participants to describe how the changes came about. Techniques include:
- Contribution analysis: sets out the theory of change that is understood to produce the observed outcomes and impacts and then searches iteratively for evidence that will either support or challenge it.
- Key informant attribution: asks participants and other informed people about what they believe caused the impacts and gathers information about the details of the causal processes.
- Qualitative comparative analysis: compares different cases to identify the different combinations of factors that produce certain outcomes.
- Process tracing: a case-based approach to causal inference which focuses on the use of clues within a case (causal-process observations) to adjudicate between alternative possible explanations. It involves checking each step in the causal chain to see if the evidence supports, fails to support or rules out the theory that the program or project produced the observed impacts.
- Qualitative impact assessment protocol: combines information from relevant stakeholders, process tracing and contribution analysis, using interviews undertaken in a way to reduce biased narratives.
(c) Identify and rule out alternative explanations
This can involve a process to identify possible alternative explanations (perhaps involving interviews with program sceptics and critics, and drawing on previous research and evaluation, as well as interviews with participants) and then searching for evidence that can rule them out.
While technical expertise is needed to choose the appropriate option for answering causal questions, the program manager should be able to check there is an explicit approach being used, and seek technical review of its appropriateness for third parties where necessary.
Further guidance and options for measuring causal attribution can be found in the UNICEF Impact evaluation series.
Evaluative questions ask whether an intervention can be considered a success, an improvement or the best option and require a combination of explicit values as well as evidence – for example:
- In what ways and for whom was the program successful?
- Did the program provide Value for Money, taking into account all the costs incurred (not only the direct funding) and any negative outcomes.
Many evaluations do not make explicit how evaluative questions will be answered – what the criteria will be (the domains of performance), what the standard will be (the level of performance that will be considered adequate or good), or how different criteria will be weighted. A review of the design could check each of these in turn:
- Are there clear criteria for this evaluative question?
- Are there clear standards for judging the quality of performance on each criterion?
- Is there clarity about how to synthesize evidence across criteria? Is there a performance framework that explains what “how good” or “how well” or “how much” mean in practice? For example, is it better to have some improvement for everyone or big improvements for a few?
- Are the criteria, standards and approach to synthesis appropriate? What has been their source? Is further review of these needed? Who should be involved?
Ideally an evaluation design will be explicit about these, including the source of these criteria and standards. The BetterEvaluation website has further information on Evaluation methods for assessing value for money and Oxford Policy Management’s approach to assessing value for money is useful for assessing value for money in complex interventions.
Key evaluation questions
To clarify the purpose and objectives of an evaluation, there should be a limited number of higher order key evaluation questions (roughly 5 to 7 questions) addressing:
- appropriateness – to what extent does the program address an identified need?
- effectiveness – to what extent is the program achieving the intended outcomes, in the short, medium and long term?
- efficiency – do the outcomes of the program represent value for money?
These key evaluation questions are high-level research topics that can be broken down into detailed sub‑questions, each addressing a particular aspect. The key evaluation questions are not yes or no questions.
Key evaluation questions often contain more than one type of evaluation question – for example to answer “How effective has the program been?” requires answering:
- descriptive questions – What changes have occurred?
- causal questions – What contribution did the intervention make to these changes?
- evaluative questions – How valuable were the changes in terms of the stated goals taking into account types of changes, level of change and distribution of changes.
A way to test the validity and scope of evaluation questions is to ask: when the evaluation has answered these questions, have we met the full purpose of the evaluation?
See also section 4.2.1 Evidence synthesis.
2.5.3. Types of evaluation
- Process evaluation: considers program design and initial implementation (≤18 months).
- Outcome evaluation: considers program implementation (>2 years) and short to medium term outcomes.
- Impact evaluation: considers medium to long term outcomes (>3 years), and whether the program contributed to the outcomes and represented value for money.
These three evaluation types address different questions at various stages of the program lifecycle, with each evaluation building on the evidence from the previous evaluation (Figure 3). Not all programs will require all three evaluation types. The evaluation overview, completed as part of the Cabinet submission process, will specify which evaluation types are necessary for each program. The different types of evaluation are used to build a clearer picture of program effectiveness as the program matures (Figure 4).
Figure 3: Different types of evaluations consider different aspects of the program
Figure 4: Evaluations over the program lifecycle of a major program
A process evaluation investigates whether the program is being implemented according to plan. This type of evaluation can help to differentiate ineffective programs from implementation failure (where the program has not been adequately implemented) and theory failure (where the program was adequately implemented but did not produce the intended impacts)  . As an ongoing evaluative strategy, it can be used to continually improve programs by informing adjustments to delivery.
Process evaluations may be undertaken by the relevant program team, if they have appropriate capability.
A process evaluation will typically try to answer questions such as:
- Was the program implemented in accordance with the initial program design?
- Was the program rollout completed on time and within the approved budget?
- Are there any adjustments to the implementation approach that need to be made?
- Are more or different key performance indicators required?
- Is the right data being collected in an efficient way?
An outcome evaluation assesses progress in early to medium-term results that the program is aiming to achieve. It is suited to programs at a business as usual stage in the program lifecycle and is usually externally commissioned.
An outcome evaluation will typically try to answer questions such as:
- What early outcomes or indications of future outcomes are suggested by the data?
- Did the program have any unintended consequences, positive or negative? If so, what were those consequences? How and why did they occur?
- How ready is the program for an impact evaluation?
There is an important distinction between measuring outcomes, which is a description of the factual, and using a counterfactual to attribute observed outcomes to the intervention. A good outcome evaluation should consider whether the program has contributed to the outcome, noting this becomes easier over time and is therefore more of a focus in impact evaluations (see Figure 5).
Figure 5 illustrates the impact measures and change in outcomes for those affected by a program (blue line) compared to the alternative outcomes had the program not existed (orange line). As impact generally increases over time, it tends to be measured later in the program lifecycle.
Figure 5: Impact of program outcomes over time
An impact evaluation builds on an outcome evaluation to assess longer-term results. It must test whether the program has made a difference by comparing what would have happened in the absence of the program (further guidance in Causal questions). In situations where it is not possible or appropriate to undertake a rigorous impact evaluation, it may be better to monitor, learn and improve, though process and/or outcome evaluations.
As impact is the change in outcomes compared to the alternative outcomes had the program not existed, it is usually easier to measure impact later in the lifecycle of the program (see Figure 5). These evaluations commonly occur at least three years after program implementation. However, the appropriate timing for measuring impact will depend on the program and needs to be decided on case-by-case basis.
An impact evaluation will typically try to answer questions such as:
- Were the intended outcomes achieved as set out in the program’s aims and objectives?
- Have other investments influenced the attainment of the program’s aims and objectives? If so, in what way?
- Did the program contribute to achieving the outcomes as anticipated? If so, to what extent?
- Were there any unintended consequences?
- What would have been the situation had the program not been implemented?
- To what extent did the benefits of the program outweigh the costs?
- Did the program represent good value for money?
- Was the program delivered cost-effectively?
Impact evaluations are usually externally commissioned due to their complexity and are generally reserved for high-risk and complex programs due to their cost. The design options for an impact evaluation need significant investment in preparation and early data collection. It is important that impact evaluation is addressed as part of the integrated monitoring and evaluation approach outlined in the evaluation work plan. This will ensure that data from other monitoring, and the process and outcome evaluations can be used, as needed. Equity concerns may require an impact evaluation to go beyond simple average impacts to identify for whom and in what ways the program has impacted outcomes (further guidance in section 2.8: Ethical considerations).
Impact evaluations usually include a value-for-money assessment to determine whether the benefits of the program outweighed the costs and whether the outcomes could have been achieved more efficiently through program efficiencies or a different approach. Value for money in this context is broader than a cost benefit analysis, it is a question of how well resources have been used and whether the use is justified (further guidance in Evaluative questions).
External or internal evaluation
Evaluations can either be commissioned externally to an appropriate consultant or academic evaluator, conducted internally by agency staff or conducted using a hybrid model of an internal evaluator supported by an external evaluator:
- External evaluator(s): one evaluator serves as team leader and is supported by program staff
- Internal evaluator(s): one evaluator serves as team leader and is supported by program staff
- Hybrid model: an internal evaluator serves as team leader and is supported by other internal evaluators and program staff, as well as external evaluator(s).
If an external evaluator is hired to conduct the evaluation, the program manager and other agency staff still need to be involved in the evaluation process. Program staff are not only primary users of the evaluation findings but are also involved in other evaluation-related tasks (such as providing access to records or educating the evaluator about the program). Be realistic about the amount of time needed for this involvement so staff schedules do not get over-burdened.
The decision to conduct an evaluation internally or commission an external evaluation is usually a decision for the agency’s accountable officer. However, as a general best-practice guide, outcome or impact evaluations of high tier programs should be externally evaluated. It is advisable to engage an external evaluator/evaluation team when:
- the scope and/or complexity of the evaluation requires expertise that is not internally available
- a program or project is politically sensitive and impartiality is a key concern
- internal staff resources are scarce and timeframes are particularly pressing (that is, there is little flexibility in terms of evaluation timing).
Table 11: Internal versus external evaluators outlines the trade-offs between internal and external evaluators.
Table 11: Internal versus external evaluators
|Component||Internal evaluator(s)||External evaluator(s)|
|Perspective||May be more familiar with the community, issues and constraints, data sources, and resources associated with the project/program (they have an insider's perspective).||May bring a fresh perspective, insight, broader experience, and recent state-of-the-art knowledge (they have an outsider's perspective).|
|Knowledge and skills||Are familiar with the substance and context of research for development programming.||May possess knowledge and skills that internal evaluators are lacking. However it may be difficult to find evaluators who understand the specifics of research for development programming.|
|Buy-in||May be more familiar with the project/ program staff and may be perceived as less threatening. In some contexts, may be seen as too close and participants may be unwilling to provide honest feedback.||May be perceived as intrusive or a threat to the project/program (perceived as an adversary). Alternatively, it may be considered impartial and participants may be more comfortable providing honest feedback.|
|Stake in the evaluation||May be perceived as having an agenda / stake in the evaluation.||Can serve more easily as an arbitrator or facilitator between stakeholders as perceived as neutral.|
|Credibility||May be perceived as biased as ‘too close’ to the subject matter, which may reduce the credibility of the evaluation hindering its use.||May provide a view of the project/program that is considered more objective and give the findings more credibility and potential for use.|
|Resources||May use considerable staff time, which is always in limited supply, especially when their time is not solely dedicated to the evaluation.||May be more costly and still involve substantial management/staff time from the commissioning organisation.|
|Follow-up/use of evaluation findings||More opportunity and authority to follow up on recommendations of the evaluation.||Contracts often end with the delivery of the final product, typically the final evaluation report, which limits or prohibits follow-up. As outsiders, do not have authority to require appropriate follow-up or action.|
2.5.4. Data matrix
A data matrix outlines the sources and types of data that will need to be collected by the program team as part of the monitoring, as well as by the evaluator at the time of the evaluation, to ensure that the evaluation questions can be answered. The data matrix should indicate which evaluations will address which questions. Each evaluation does not need to address all the evaluation questions, however, all questions should be addressed over the entire evaluation plan (process, outcome and impact evaluations). See section 2.5.2. Evaluation questions for further information on which questions are addressed in the different types of evaluations.
A data matrix can also help focus data collection to ensure that only relevant data is collected. Data collection has real costs in terms of staff time and resources as well as time it asks of respondents. It is important to weigh the costs and benefits of data collection activities to find the right balance.
 M. K. Gugerty, D. Karlan, The Goldilocks Challenge: Right Fit Evidence for the Social Sector, New York, Oxford University Press, 2018.
 S. C. Funnell, , P. J. Rogers, (20110209). Purposeful Program Theory: Effective Use of Theories of Change and Logic Models [VitalSource Bookshelf version].
 For example see Figure 11, page 66, Ending violence against women and girls: Evaluating a decade of Australia’s development assistance.
 Adapted from S. C. Funnell, P. J. Rogers, (20110209). Purposeful Program Theory: Effective Use of Theories of Change and Logic Models [VitalSource Bookshelf version] and M. K. Gugerty, D. Karlan, The Goldilocks Challenge: Right Fit Evidence for the Social Sector, New York, Oxford University Press, 2018.
 DIIS Evaluation Strategy 2017-2021
 2011, ‘Impact Evaluation in Practice’, World Bank.
 For a good example of a low-cost impact evaluation using regression discontinuity, see the Root Capital Case Study in the Goldilocks Toolkit .
 Further guidance on developing performance frameworks including examples is on pg 6-11 of Evaluation Building Blocks – A Guide by Kinnect Group
 For information about other evaluation types, please see BetterEvaluation.
 Further information on these three evaluation types are in sections 3.2.1 to 3.2.3.
 Adapted from the Department of Industry, Innovation and Science Evaluation Strategy, 2017-2021.
 Adapted from NSW Government Evaluation Framework 2013.
 BetterEvaluation Manager’s Guide to Evaluation, accessed May 2020, Develop agreed key evaluation questions.
 Rogers, P. et al. (2015), Choosing appropriate designs and methods for impact evaluation, Office of the Chief Economist, Australian Government, Department of Industry, Science, Energy and Resources.
 BetterEvaluation: Themes.
 UNICEF Impact Evaluation Series, webinar 5 – RCT’s.
 Adapted from BetterEvaluation: Themes – Impact evaluation.
 Value for money can also be considered in earlier evaluations, if there is sufficient evidence. As the evidence for a program accumulates, so will the expectation for an assessment of value for money.
When designing a program, it is important to develop an estimate of the resources that are available for evaluation and what will be required to do the evaluation well.
The resources needed for an evaluation include:
- existing data
- funding to engage an external evaluator, evaluation team or for specific tasks to be undertaken and for materials and travel
- time, expertise and willingness to be involved of staff, program partners, technical experts and the wider community, whether as part of the evaluation team, evaluation governance and/or relevant people and data sources.
When considering data availability, look carefully at the quality of existing data and what format it is in. Also clarify the skills and availability of any people who will need to be involved in the evaluation.
There are a few ways to estimate the budget for an external evaluation:
- Calculating a percentage of the program or project budget – sometimes 1–5%: This is a crude rule of thumb approach. Large government programs with simple evaluation requirements may be around 1%; smaller government programs with more complex evaluations – for example, detailed testing and documentation of an innovation – may be around 5%.
- Developing an estimate of days needed and then multiplying by the average daily rate of an external evaluator: This can be useful for simple evaluations, especially those using a small team and a standardised methodology such as a few days of document review, a brief field visit for interviews and then a short period for report write up.
- Using the average budget for evaluations of a similar type and scope: This can be a useful starting point for budget allocation providing that the amounts have been shown to be adequate (see Table 12 in section 1.2: How will the program achieve this?).
- Developing a draft design and then costing it, including collection and analysis of primary data: This can be done as a separate project before the actual evaluation is contracted but will usually require staff with prior evaluation experience.
Estimate the costs of collecting and analysing the data, as well as the project management and reporting time needed. Allow time to secure resources (for example, including them in an annual or project budget, or seeking someone with particular expertise). If ongoing evaluation input is needed consider a staged approach to funding.
Table 12: Estimated costs for evaluation services
|Evaluation services||Scale of the program||Estimated cost|
|Design and planning for evaluation|
|Capability training for internal evaluation teams (one day workshop)||any scale||$5,000–10,000|
|Facilitate internal development of a program logic and the outcomes to be targeted by the recovery program||any scale||$5,000–10,000|
|Evaluation of needs/needs analysis for evaluation design||large scale||$20,000–30,000|
|small or mid scale||$10,000–20,000|
|Developing outcome indicators and a plan for measuring and monitoring progress toward outcomes (including planning workshop)||any||$5,000–15,000|
|Developing a plan for measuring and monitoring progress toward outcomes, including development of indicators||large scale||$15,000–20,000|
|small or mid scale||$10,000–15,000|
|Preparing an evaluation plan for a full outcome evaluation||any||$25,000–35,000|
|Supporting an internal team to develop an evaluation plan for a full outcome evaluation (providing advice, reviewing documents, providing material and resources, small workshops)||any||$5,000–15,000|
|Providing ongoing evaluation support and advice to an internal evaluation team||any||$10,000–20,000|
|Conducting Process and/or outcome evaluation|
|Process evaluation||large scale||$65,000–90,000|
|small or mid scale||$50,000–70,000|
|Support and advice for an internally-led process review||any||$20,000–30,000|
|interim evaluation of program process and progress toward outcomes||large or mid scale||$70,000–100,000|
|Conduct a full outcome evaluation||small scale||$50,000–80,000|
|Conduct a full outcome evaluation with multiple components||large scale||Over $175,000|
|Outcome evaluation of a component of a larger scale program (for example, social wellbeing, business recovery)||large or mid scale||$50,000–125,000|
2.6.1. Evaluation on a shoestring
If the resources required for the evaluation are more than the resources available, additional resources will need to be found and/or strategies used to reduce the resources required. A hybrid approach to evaluation (where an evaluation is delivered using internal resources with support from specialist providers) can help keep evaluation costs down and build internal capability. Careful targeting of the evaluation within the context of existing evidence can also help keep the costs of evaluation down.
It is not feasible or appropriate to try to evaluate every aspect of a program. As such, evaluations need scope boundaries and a focus on key issues. For example:
- a program evaluation might look at implementation in the past three years, rather than since commencement
- a program evaluation could look at performance in particular regions or sites rather than across the whole Territory
- an outcome evaluation may focus on outcomes at particular levels of the program logic or for particular components of the program
- a process evaluation may focus on the activities of particular stakeholders, such as frontline staff, or interagency coordination.
Table 13: Possible options for reducing evaluation costs
Cost reduction options
How to manage the risks
Reduce the number of key evaluation questions
Evaluation may no longer meet the needs of the primary intended users
Carefully prioritise the key evaluation questions
Review whether the evaluation is still worth doing
Reduce sample sizes
Reduced accuracy of estimates
Check these will still be sufficiently credible and useful through data rehearsal (mock-ups of tables and graphs showing the type of data the evaluation could produce)
Make more use of existing data
May mean that insufficiently accurate or relevant data are used; cost savings may be minimal if data are not readily accessible
This is only appropriate when the relevance, quality and accessibility of the existing data is adequate – need to check this is the case before committing to use
 Personal communication from Dr George Argyrous (Manager, Education and Research, Institute for Public Policy and Governance, University of Technology Sydney) based on evaluation costs in New South Wales.
 In the TulaSalud Case Study (part of the Goldilocks Toolkit), Innovation Poverty Action noted the efficacy of the program’s practices were documented in medical research. Therefore, they recommended the evaluation should focus on the training of community health workers and their ability to use the system because this was more relevant and less burdensome operationally than an assessment of the platform on health outcomes.
Key stakeholders in an evaluation are likely to include senior management in the agency, program managers, program partners, service providers, program participants and peak interest groups (for example, representing industries, program beneficiaries).
Involving stakeholders during evaluation planning and implementation can add value by:
- providing perspectives on what will be considered a credible, high quality and useful evaluation
- contributing to the program logic and framing of key evaluation questions
- facilitating quality data collection
- helping to make sense of the data that has been collected
- increasing the utilization of the evaluation’s findings by building knowledge about and support for the evaluation.
Once the evaluation is completed, stakeholders need to be informed of any lessons learned and recommendations (see section 6.1. Communicating evaluation results for further information).
It can be useful to map significant stakeholders and their actual or likely questions. See the Remote Engagement and Coordination Strategy for specific guidance within a Territory Government context. There is also useful guidance and templates in the Remote Engagement and Coordination Online Toolkit.
All evaluations should take into account appropriate ethical considerations. Program managers should undertake an assessment on ethical risk against guidelines such as those produced by the National Health and Medical Research Council (NHMRC) and the Australian Institute for Aboriginal and Torres Strait Islander Studies (AIATSIS), to determine if formal ethics review processes are required. The Australasian Evaluation Society has produced Guidelines for the Ethical Conduct of Evaluation which members are obliged to abide by.
Specific information on ethical conduct for evaluation in Aboriginal and Torres Strait Islander settings can be found in the Productivity Commission’s A Guide to Evaluation Under the Indigenous Evaluation Strategy, BetterEvaluation’s Ethical Protocol for evaluation in Aboriginal and Torres Strait Islander settings and NHMRC’s Ethical conduct in research with Aboriginal and Torres Strait Islander Peoples and communities: Guidelines for researchers and stakeholders. Other relevant Territory Government resources include Northern Territory Health’s Aboriginal Cultural Security Framework and the Remote Engagement and Coordination Strategy.
It is important to consider the appropriate timeframes and budget required, acknowledging that there are often additional resource implications associated with evaluations that have specific ethical and cultural requirements.
Key ethical questions to consider
Conduct of the evaluators:
- Do evaluators possess the appropriate knowledge, abilities and skills to undertake the proposed tasks?
- Have evaluators disclosed any potential conflicts of interest?
- Are all evaluators fully informed of what is expected in terms of their ethical and cultural safety responsibilities and adherence to protocols?
Integrity of the evaluation process:
- Does the evaluation design ensure data is valid, reliable and appropriate?
- Have the limitations and strengths of the evaluation been identified?
- How will the source of evaluative judgements be identified and presented?
- How will results be reported and communicated in a way that all stakeholders can easily understand?
- How will the evaluation findings be utilised and how will they impact implementation?
Respect and protection of participants:
- How will participants be engaged throughout the evaluation process?
- How will participants’ contributions of information, knowledge and time be respectfully recognised?
- Are there potential effects of inequalities related to race, age, gender, sexual orientation, physical or intellectual ability, religion, socioeconomic or ethnic background that need to be taken into account when analysing the data?
- How will participants’ informed consent be obtained?
- What confidentiality arrangements have been put in place?
- Will the evaluation involve interviews or focus groups that may raise potential trauma?
When is external ethics review required?
NHMRC identifies triggers for ethical review, including:
- comparison of cohorts where the activity potentially infringes the privacy or professional reputation of participants, providers or organisations
- secondary use of data – using data or analysis from quality assurance (QA) or evaluation activities for another purpose
- gathering information about the participant/s beyond that collected routinely. Information may include biospecimens or additional investigations
- testing of non-standard (innovative) protocols or equipment
- comparison of cohorts
- randomisation or the use of control groups or placebos
- targeted analysis of data involving minority/vulnerable groups whose data is to be separated out of data collected or analysed as part of the main QA/evaluation activity.
“Where one or more of the triggers above apply, the guidance provided in the National Statement on Ethical Conduct in Human Research, 2007 (National Statement) should be followed.”
This section should articulate the risks or limitations that the evaluation faces, not the risks of the program in general. If significant mitigatable risks are identified, the risk assessment plan will help program managers to implement appropriate controls.
In terms of risks associated with the accuracy of the program logic, one way to combat potential overconfidence and realistically assess risk is to imagine program failure and then think through how that failure would happen. It may also be useful to review previous evaluations from a similar program to identify lessons learned and how they may apply to this evaluation.
Risk categories may include: stakeholder engagement and support, technology, data, funding, timeframes, regulatory or ethical issues, physical or environmental issues.
Table 14: Example risk assessment plan
|Poor stakeholder participation in research||The evaluation would lack descriptive information about perceptions||Possible||Moderate||Moderate||A variety of intercept surveys, focus groups, telephone, internet-based surveys and information conversation methods will be used to encourage maximum stakeholder participation|
*Use the likelihood and consequence rating matrix in the Evaluation work plan template. Choose one of the following to define the likelihood of the risk occurring.
Table 15: Risk likelihood challenges
|Rare||may only occur in exceptional circumstances|
|Unlikely||is not expected to occur|
|Possible||could occur at some time|
|Likely||would probably occur in most circumstances|
|Almost certain||is expected to occur in most circumstances|
Choose one of the following to define the consequence if the risk occurs.
Table 16: Risk consequence categories
|Negligible||the consequences are dealt with by routine operations|
|Low||impacts on a limited aspect of the activity|
|Moderate||moderate impact on the achievement of goals/objectives|
|High||high impact on the achievement of goals/objectives|
|Extreme||significant impact on the achievement of goals/objectives|
Use the likelihood and risk rating to determine the overall risk rating. Those that are high or extreme are likely to require closer monitoring than those that are moderate or low.
Table 17: Overall risk rating matrix
 M. K. Gugerty, D. Karlan, The Goldilocks Challenge: Right Fit Evidence for the Social Sector, New York, Oxford University Press, 2018.
The evaluation plan should be developed well in advance of the start of the first evaluation (ideally, before program implementation) to allow for review by relevant stakeholders, making necessary changes, obtaining ethical approval (where required) and pilot testing data collection instruments (as needed).
The evaluation work planneeds to be submitted to DTF within six months of program approval. As programs may change over time, the evaluation work plan should be considered a ‘living document’. It should be reviewed periodically or in response to significant program events by the program manager. DTF should be provided with updated versions in a timely manner.
Prior to and throughout the implementation of the evaluation, it is important to review the evaluation work plan to determine whether it:
- is consistent with the available evaluation resources and agreed evaluation objectives
- focuses on the most important types of information to know (‘need to know’ rather than ‘nice to know’)
- does not place undue burden on project/program staff or participants
- is ethical and culturally appropriate.
Reviewers could include: the DTF, project/program staff, internal or external evaluation experts, project/program participants, and relevant community members.
2.10.1 Technical review of the evaluation design
Before finalising the design, it can be helpful to have a technical review by one or more independent evaluators. It may be necessary to involve more than one reviewer in order to provide expert advice on the specific methods proposed, including specific indicators and measures to be used. Ensure that the reviewer is experienced in using a range of methods and designs, and well briefed on the program context, to ensure they can provide situation-specific advice.
2.10.2 Review of the design by the evaluation management structure
In addition to being considered technically sound by experts, the evaluation design should be seen as credible by those who are expected to use it. Formal organisational review and endorsement of the design by an evaluation steering committee can assist in building credibility with users.
Undertake data rehearsal of possible findings with the primary intended users where possible. This is a powerful strategy for checking the appropriateness of the design by presenting mock-ups of tables, graphs and quotes that the design might produce. It is best to produce at least two different versions – one that would show the program working well and one that would show it not working.
Ideally, the primary intended users of the evaluation will review both designs and either confirm suitability or request amendments to make the potential findings more relevant and credible.
2.10.3 Review the program logic
When reviewing the program logic, the following questions should be addressed:
- What evidence was the basis for its development? What additional evidence should be used in the review?
- Whose perspective formed its basis? To what extent and in what ways were the perspectives of intended beneficiaries and partner organisations included?
- Were there different views about what the intended outcomes and impacts were and/or how these might be brought about?
- Has there been more recent research and evaluation on similar projects and programs which could inform the program logic?
Last updated: 02 February 2021
Share this page:URL copied!