2.5. Evaluation methodology

The evaluation methodology should include a detailed program logic, key evaluation questions and a data matrix. It needs to consider what types of evaluations should be used and how key evaluation questions will be addressed.

2.5.1. Program logic

A program logic should illustrate how the program will work by linking program activities with intended outcomes. It visually represents the theory of change underpinning the program and describes how the program contributes to a chain of results flowing from the inputs and activities to short-term, intermediate and long-term outcomes.

Different terms are used for a program logic including program theory, logic model, theory of change, causal model, outcomes hierarchy, results chain, and intervention logic.[1] Usually it is represented as a one page diagram. The diagrams and terms used with program logic may also vary – sometimes the diagrams are shown as a series of boxes, as a table, or as a series of results with activities occurring alongside them rather than just at the start. Some diagrams show the causal links from left to right, some from bottom to top. In all cases, a program logic needs to be more than just a list of activities with arrows to the intended outcomes. For some examples, see Program logic library.

What is a program logic used for?

A program logic should show what needs to be measured in order to distinguish between implementation failure (not done right) and theory failure (done right but still did not work).[2] A program logic:

  • clarifies and communicates program intentions and outcomes
  • demonstrates alignment between activities and objectives
  • explains causal assumptions and tests if they are supported by evidence
  • identifies relevant external factors that could influence outcomes (either positively or negatively)
  • identifies key indicators to be monitored
  • identifies gaps in available data and outlines mitigation measures
  • clarifies the outcomes measurement horizon and identifies early indicators of progress or lack of progress in achieving results
  • focuses evaluation questions.

A program logic underpins data collection by identifying a program’s operating steps and defining what program managers should monitor and measure. A program logic also helps identify the components of the program to be tracked as part of monitoring (outputs) versus those that should be assessed as part of an outcome or impact evaluation (outcomes).[1]

Monitoring activities and outputs shows which program components are being well implemented and which could be improved.[1] A focus on measuring outcomes and impacts without a good monitoring system can result in wasted resources. For example, if a program aims to improve literacy in schools using particular books, it is important to monitor delivery and use of the books so that the program can be adjusted early if the books are not being delivered or used. If the program does not have a good monitoring system in place and waits three years before doing an outcome evaluation, this could be an expensive way of finding out that the books had not even been used. Further information on the importance of a good monitoring system as a way of keeping evaluation costs down is in the Goldilocks toolkit.

A program logic illuminates the critical assumptions and predictions that must hold for key outcomes to occur, and suggests important areas for data collection.[1] It also helps prioritise data collection; for example, if there is no way to isolate external factors that influence the outcomes of the program, is it worth collecting the outcome data? Important considerations include the cost of collecting the outcome data and the conclusions that can reasonably be drawn from it. In some cases, a process evaluation may be sufficient but it is essential that the results of the process evaluation are not overstated.

Developing a program logic

Developing a program logic is part analytical, part consultative process. Analytically, it should review the program settings to identify statements of activities, objectives, aims and intended outcomes. It should then refine and assemble these statements into a causal chain that shows how the activities are assumed to contribute to immediate outcomes, intermediate outcomes and ultimately to the longer term outcome. Consultatively, the process should involve working with a range of stakeholders to draw on their understanding of the outcomes and logic, and also encourage greater ownership of the program logic.[3]

It is useful to think realistically about when a successful program will be able to achieve particular outputs and outcomes. For example, within a domestic violence context, a successful program may see an increase in reporting (due to increased awareness and/or availability of support) before reporting decreases. Where possible, estimated timing of indicators should be built into the program logic to help clarify what success looks like in different timeframes.[4]

The Evaluation work plan template includes a suggested template for the program logic at Appendix A . This is an optional starting point rather than a mandatory structure, however all program logics should clearly identify assumptions and relevant external factors. ‘Backcasting’ starts with identifying long-term outcomes of a program and envisaging alternative futures, and then working backwards to determine the necessary steps towards achieving these outcomes. Unlike forecasting, which considers what is currently occurring and predicting future outcomes, the benefit of backcasting is that it allows stakeholders to brainstorm and consider alternative courses of action.[5] BetterEvaluation’s direction on developing programme theory/theory of change may assist in determining the best approach.

Table 8: Potential steps for developing a program logic[6]
TaskApproach
Undertake situational analysis The context of the problem, its causes and consequences. A good situation analysis will go beyond problems and deficits to identify strengths and potential opportunities.
Identify outcomes
  1. Prepare a list of possible outcomes.
  2. Cluster the outcomes that are related.
  3. Arrange the outcomes in a chain of ‘if-then’ statements to link the short term and long term outcomes.
  4. Identify where a higher level outcome affects a lower level one (feedback loops).
  5. Validate the outcomes chain with key stakeholders.
Identify outputs The direct deliverables of a program. The products, goods or services that need to be provided to program participants to achieve the short-term outcomes.
Identify activities The required actions to produce program outputs.
Identify inputs The resources required to run the program.
Identify assumptions In every link between activity, output and outcome, many different assumptions are made that must hold for the program to work as expected. Making these assumptions explicit and identifying the most critical among them helps to figure out what testing and monitoring is needed to ensure the program works as planned. This includes assumptions about:
  • the need for the program
  • how the program will work
  • whether program activities are likely to produce the intended results.
Consider external factors that also cause changes What besides the program could influence the intended outcome? Listing the most important external influences helps organisations better understand the counterfactual and clarify whether it will be possible to attribute a change in the outcome solely to the program.
Identify risks and unintended consequences The world around the program is unlikely to remain static; changes in external conditions pose unavoidable risks to any program. It is important to identify the most likely and potentially damaging risks and develop a risk reduction or mitigation plan (see section 2.9: Evaluation risks).

Once a program logic is developed, it is useful to map existing data from an established program or previous evaluation onto the program logic to identify priority areas for additional data collection.

What does a good program logic look like?

There is no one way to represent a program logic – the test is whether it is a representation of the program's causal links, and whether it communicates effectively with the intended audience by making sense and helping them understand the program.[3] See examples in Program logic library.

Table 9 provides guidance on the aspects required for a good program logic, and explains the different criteria for ‘requires improvement’, ‘satisfactory’ and ‘good’. This may be refined over time in response to user feedback to ensure it is appropriate to a Territory Government context.

See section 2.10. Reviewing the evaluation work plan for suggestions on what to look for when reviewing an established program logic.

Table 9. Program logic rubric[7]
PrincipleBeginningDevelopingEmbeddedLeading
Integrated
  • Awareness of the benefits of evaluation is low.
  • Evaluation is seen as a compliance activity and threat.
  • Fear of negative findings and recommendations leads to a perception of ‘mandatory optimism’ regarding program performance.
  • Insufficient resources allocated to evaluation activities.
  • Evaluation and performance measurement skills and understanding limited, despite pockets of expertise.
  • Appreciation of the benefits of evaluation improving.
  • Evaluation is being viewed as core business for the department, not simply a compliance activity.
  • A culture of evaluative thinking and continual improvement is introduced and communicated across the department.
  • Skills in performance measurement and evaluation developed through targeted training and guidance materials.
  • Evaluation website and guidance materials developed.
  • The role of the Evaluation Unit is widely communicated. Unit seen as the authoritative source for advice.
  • Developing further expertise in the Evaluation Unit.
  • A culture of evaluative thinking and continual improvement is embedded across the department, with lessons learnt being acted upon.
  • Evaluation is seen as an integral component of sound performance management.
  • General evaluation skills widespread.
  • Improved skills and knowledge in developing quality performance measures.
  • Evaluation Unit team members have high order skills and experience which are leveraged by the department.
  • Evaluation Unit team members hold and are encouraged to undertake formal qualifications in evaluation and related subjects.
  • Evaluations motivate improvements in program design and policy implementation.
  • Demonstrated commitment to continuous learning and improvement throughout the agency.
  • Department is recognised for its evaluation and performance monitoring expertise, and innovative systems and procedures.
Fit for purpose
  • Frequency and quality of evaluation is lacking.
  • Guidelines for prioritising and scaling evaluation activity are used.
  • Priority programs are evaluated.
  • Evaluations use fit-for-purpose methodologies.
  • Evaluation effort is scaled accordingly.
  • Specialist and technical skills well developed to apply appropriate methodologies.
Evidence-based
  • Data holdings and collection methods are insufficient or of poor quality.
  • Planning at program outset improves data holdings and collection methods.
  • Developing skills and knowledge in applying robust research and analytical methods to assess impact and outcomes.
  • Quality of evaluations is improving.
  • A range of administrative and other data is used in the assessment of performance.
  • Robust research and analytical methods are used to assess impact and outcomes.
  • Evaluations conform to departmental standards.
  • The department continually develops and applies robust research and analytical methods to assess impact and outcomes.
  • Evaluation and performance measurement conform to recognised standards of quality.
Timely
  • Effort and resources are allocated in an ad hoc and reactive manner with little foresight.
  • Developing performance information at the inception of a program is ad hoc and of variable quality.
  • Evaluation activity is coordinated. An evaluation plan is in place and regularly monitored.
  • Strategically significant and risky programs are prioritised.
  • Planning for evaluation and performance monitoring is being integrated at the program design stage.
  • All programs are assessed for being Evaluation Ready.
  • The department employs strategic risk-based, whole of department criteria to prioritise evaluation effort. Evaluation plans are updated annually and progress is monitored on a regular basis.
  • Planning for evaluation and performance measurement is considered a fundamental part of policy and program design.
  • All programs have program logic, performance and evaluation plans in place.
  • The department’s approach to evaluation and performance planning is seen as the exemplar.
  • All programs have been signed off and are Evaluation Ready.
Transparent
  • Findings and recommendations held in program and policy areas.
  • No follow up on the implementation of recommendations.
  • Findings and recommendations viewed as an opportunity to identify lessons learnt.
  • Evaluations are available in the completed evaluations library to improve the dissemination of lessons learnt and inform policy development.
  • Findings widely disseminated and drive better performance.
  • Website and guidance materials are a valuable resource for staff.
  • Evaluation findings and reports are published where appropriate.
  • Findings are consistently used to optimise delivery and have influence outside the department.
Independent
  • Independent conduct and governance of evaluations is lacking.
  • Evaluations are conducted and overseen by the policy or program areas responsible for delivery of the program.
  • There is an improved level of independence in the conduct and governance of evaluations.
  • All evaluations include a level of independence.
  • Evaluations conducted by the Evaluation Unit are viewed externally as independent.­
Table 10: Program logic library
Program logic sourceComments
Australian Institute of Family Studies Blairtown example program logic A hypothetical program aiming to ensure children reach appropriate developmental milestones. Includes assumptions and external factors.
Evidence-Based Programs and Practice in Children and Parenting Support Programs A project supporting nine Children and Parenting Support services in regional and rural NSW to enhance their use of evidence-based programs and practice. Includes assumptions and external factors.
National Forum on Youth Violence Prevention A program that aims to maximise the use of city partnerships and increase the effectiveness of federal agencies to reduce youth violence. Includes assumptions and external factors.
Australian Policy Service Policy Hub Evaluation Ready example program logic: Save Our Town A hypothetical program aimed at stimulating private sector investment, population growth, and economic expansion and diversification to increase a region’s viability. Includes assumptions and external factors.
University of Michigan Evaluation Resource Assistant example program logic A hypothetical program aimed at reducing rates of child abuse and neglect. Does not include assumptions and external factors, however it is a good example of how to use a program logic to prioritise key evaluation questions and indicators.
The Goldilocks Toolkit Case Studies Program logics and lessons learned from a range of international social programs including Acumen, GiveDirectly, Digital Green, Root Capital, Splash, TulaSalud, Women for Women International and One Acre Fund. The program logics do not include assumptions and external factors but the case studies provide examples of how to reduce evaluation costs through a good monitoring system.
Geo-Mapping for Energy & Minerals Appendix A of this evaluation report by Natural Resources Canada has a program logic for a program improving regional geological mapping for responsible resource exploration and development. Does not include assumptions and external factors but does have an evaluation matrix to show how the evaluation questions will be addressed.
The Logic Model Guidebook: Better Strategies for Great Results Program logics from a Community Leadership Academy (see page 10, and a marked up version on page 56) and a Health Improvement program (see page 39 and a marked up version on page 57).

2.5.2. Evaluation questions

Across the program cycle, evaluations need to include a range of questions that promote accountability for public funding and learning from program experiences. These questions need to align with the program logic and will form the basis of the terms of reference. Evaluation questions may be added to or amended closer to evaluation commencement to account for changes in policy context, key stakeholders, or performance indicators.

Different types of questions need different methods and designs to answer them. In evaluations there are four main types of questions: descriptive, action, causal and evaluative.

Descriptive questions

Descriptive questions ask about what has happened or how things are. For example:

  • What were the resources used by the program directly and indirectly?
  • What activities occurred?
  • What changes were observed in conditions or in the participants?

Descriptive questions might relate to:

  • Inputs – materials, staff.
  • Processes – implementation, research projects.
  • Outputs – for example, research publications.
  • Outcomes – for example, changes in policy on the basis of research.
  • Impacts – for example, improvements in agricultural production.

Action questions

Action questions ask about what should be done to respond to evaluation findings. For example:

  • What changes should be made to address problems that have been identified?
  • What should be retained or added to reinforce existing strengths?
  • Should the program continue to be funded?

Causal questions

Causal questions ask about what has contributed to changes that have been observed. For example:

  • What produced the outcomes and impacts?
  • What was the contribution of the program to producing the changes that were observed?
  • What other factors or programs contributed to the observed changes?
What is the difference between correlation and causation?

Two variables are classified as correlated if both increase and decrease together (positively correlated) or if one increases and the other decreases (negatively correlated). Correlation analysis measures how close the relationship is between the two variables.

Causal questions need to investigate whether programs are causing the outcomes that are observed. Although there may be a strong correlation between two variables, for example the introduction of a new program and a particular outcome, this correlation does not necessarily mean the program is directly causing the outcome. See Box 1 for an example.

Box 1: Evaluating to improve resource allocations for family planning and fertility in Indonesia

Indonesia’s innovative family planning efforts gained international recognition in the 1970s for their success in decreasing the country’s fertility rates. The acclaim arose from two parallel phenomena: (1) fertility rates declined by 22% between 1970–1980, by 25% between 1981–1990, and a bit more moderately between 1991–1994; and (2) during the same period, the Indonesian government substantially increased resources allocated to family planning (particularly contraceptive subsidies).

Given that the two things happened concurrently, many concluded that increased investment in family planning had led to lower fertility rates. Unconvinced by the available evidence, a team of researchers evaluated the impact of family planning programs on fertility rates and found, contrary to what was generally believed, that family planning programs only had a moderate impact on fertility, with changes in women’s status deemed to have a larger impact on fertility rates.

The researchers noted that before the start of the family planning program very few women of reproductive age had finished primary education. During the same period as the family planning program, however, the Indonesian government undertook a large-scale education program for girls. By the end of the program, women entering reproductive age had benefited from the additional education. When the oil boom brought economic expansion and increased demand for labour in Indonesia, the participation of educated women in the labour force increased significantly. As the value of women’s time at work rose, so did the use of contraceptives. In the end, higher wages and empowerment explained 70% of the observed decline in fertility—more than the investment in family planning programs.

These evaluation results informed policy makers’ subsequent resource allocation decisions: funding was reprogrammed away from contraception subsidies and towards programs that increased the enrolment of women in school. Although the ultimate goals of the two programs were similar, evaluation studies had shown that in the Indonesian context, lower fertility rates could be obtained more effectively by investing in education than by investing in family planning.[8]

There are many designs and methods to answer causal questions but they usually involve one or more of these strategies:

(a) Compare results to an estimate of what would have happened if the program had not occurred (this is known as a counterfactual)

This might involve creating a control group (where people or sites are randomly assigned to either participate or not) or a comparison group (where those who participate are compared to others who are matched in various ways). Techniques include:

  • Randomised controlled trial (RCT): a control group is compared to one or more treatment groups.
  • Matched comparison: participants are each matched with a non participant on variables that are thought to be relevant. It can be difficult to adequately match on all relevant criteria.
  • Propensity score matching: create a comparison group based on an analysis of the factors that influenced people’s propensity to participate in the program.
  • Regression discontinuity: compares the outcomes of individuals just below the cut-off point with those just above the cut-off point.[9]
(b) Check for consistency of the evidence with the theory of how the intervention would contribute to the observed results

This can involve checking that intermediate outcomes have been achieved, using process tracing to check each causal link in the theory of change, identifying and following up anomalies that don’t fit the pattern, and asking participants to describe how the changes came about. Techniques include:

  • Contribution analysis: sets out the theory of change that is understood to produce the observed outcomes and impacts and then searches iteratively for evidence that will either support or challenge it.
  • Key informant attribution: asks participants and other informed people about what they believe caused the impacts and gathers information about the details of the causal processes.
  • Qualitative comparative analysis: compares different cases to identify the different combinations of factors that produce certain outcomes.
  • Process tracing: a case-based approach to causal inference which focuses on the use of clues within a case (causal-process observations) to adjudicate between alternative possible explanations. It involves checking each step in the causal chain to see if the evidence supports, fails to support or rules out the theory that the program or project produced the observed impacts.
  • Qualitative impact assessment protocol: combines information from relevant stakeholders, process tracing and contribution analysis, using interviews undertaken in a way to reduce biased narratives.
(c) Identify and rule out alternative explanations

This can involve a process to identify possible alternative explanations (perhaps involving interviews with program sceptics and critics, and drawing on previous research and evaluation, as well as interviews with participants) and then searching for evidence that can rule them out.

While technical expertise is needed to choose the appropriate option for answering causal questions, the program manager should be able to check there is an explicit approach being used, and seek technical review of its appropriateness for third parties where necessary.

Further guidance and options for measuring causal attribution can be found in the UNICEF Impact evaluation series.

Evaluative questions

Evaluative questions ask whether an intervention can be considered a success, an improvement or the best option and require a combination of explicit values as well as evidence – for example:

  • In what ways and for whom was the program successful?
  • Did the program provide Value for Money, taking into account all the costs incurred (not only the direct funding) and any negative outcomes.

Many evaluations do not make explicit how evaluative questions will be answered – what the criteria will be (the domains of performance), what the standard will be (the level of performance that will be considered adequate or good), or how different criteria will be weighted. A review of the design could check each of these in turn:

  • Are there clear criteria for this evaluative question?
  • Are there clear standards for judging the quality of performance on each criterion?
  • Is there clarity about how to synthesize evidence across criteria? Is there a performance framework that explains what “how good” or “how well” or “how much” mean in practice[10]? For example, is it better to have some improvement for everyone or big improvements for a few?
  • Are the criteria, standards and approach to synthesis appropriate? What has been their source? Is further review of these needed? Who should be involved?

Ideally an evaluation design will be explicit about these, including the source of these criteria and standards. The BetterEvaluation website has further information on Evaluation methods for assessing value for money and Oxford Policy Management’s approach to assessing value for money is useful for assessing value for money in complex interventions.

Key evaluation questions

To clarify the purpose and objectives of an evaluation, there should be a limited number of higher order key evaluation questions (roughly 5 to 7 questions) addressing:

  • appropriateness – to what extent does the program address an identified need?
  • effectiveness – to what extent is the program achieving the intended outcomes, in the short, medium and long term?
  • efficiency – do the outcomes of the program represent value for money?

These key evaluation questions are high-level research topics that can be broken down into detailed sub‑questions, each addressing a particular aspect. The key evaluation questions are not yes or no questions.

Key evaluation questions often contain more than one type of evaluation question – for example to answer “How effective has the program been?” requires answering:

  • descriptive questions – What changes have occurred?
  • causal questions – What contribution did the intervention make to these changes?
  • evaluative questions – How valuable were the changes in terms of the stated goals taking into account types of changes, level of change and distribution of changes.

A way to test the validity and scope of evaluation questions is to ask: when the evaluation has answered these questions, have we met the full purpose of the evaluation?[11]

See also section 4.2.1 Evidence synthesis.

2.5.3. Types of evaluation

While there are a number of different approaches to evaluation,[12] the Program evaluation framework is based on three types,[13] linked to the program lifecycle:

  1. Process evaluation: considers program design and initial implementation (≤18 months).
  2. Outcome evaluation: considers program implementation (>2 years) and short to medium term outcomes.
  3. Impact evaluation: considers medium to long term outcomes (>3 years), and whether the program contributed to the outcomes and represented value for money.

These three evaluation types address different questions at various stages of the program lifecycle, with each evaluation building on the evidence from the previous evaluation (Figure 3). Not all programs will require all three evaluation types. The evaluation overview, completed as part of the Cabinet submission process, will specify which evaluation types are necessary for each program. The different types of evaluation are used to build a clearer picture of program effectiveness as the program matures (Figure 4).

Figure 3: Different types of evaluations consider different aspects of the program[14]

Diagram of types of evaluations

Figure 4: Evaluations over the program lifecycle of a major program[15]

Chart depicting evaluations of program

Process evaluations

A process evaluation investigates whether the program is being implemented according to plan[16]. This type of evaluation can help to differentiate ineffective programs from implementation failure (where the program has not been adequately implemented) and theory failure (where the program was adequately implemented but did not produce the intended impacts) [17] . As an ongoing evaluative strategy, it can be used to continually improve programs by informing adjustments to delivery[11].

Process evaluations may be undertaken by the relevant program team, if they have appropriate capability.

A process evaluation will typically try to answer questions such as:

  • Was the program implemented in accordance with the initial program design?
  • Was the program rollout completed on time and within the approved budget?
  • Are there any adjustments to the implementation approach that need to be made?
  • Are more or different key performance indicators required?
  • Is the right data being collected in an efficient way?

Outcome evaluations

An outcome evaluation assesses progress in early to medium-term results that the program is aiming to achieve[16]. It is suited to programs at a business as usual stage in the program lifecycle and is usually externally commissioned.

An outcome evaluation will typically try to answer questions such as:

  • What early outcomes or indications of future outcomes are suggested by the data?
  • Did the program have any unintended consequences, positive or negative? If so, what were those consequences? How and why did they occur?
  • How ready is the program for an impact evaluation?

There is an important distinction between measuring outcomes, which is a description of the factual, and using a counterfactual to attribute observed outcomes to the intervention.[18] A good outcome evaluation should consider whether the program has contributed to the outcome, noting this becomes easier over time and is therefore more of a focus in impact evaluations (see Figure 5).

Figure 5 illustrates the impact measures and change in outcomes for those affected by a program (blue line) compared to the alternative outcomes had the program not existed (orange line). As impact generally increases over time, it tends to be measured later in the program lifecycle.

Figure 5:  Impact of program outcomes over time

Chart illustrating impact over time

Impact evaluations

An impact evaluation builds on an outcome evaluation to assess longer-term results[19]. It must test whether the program has made a difference by comparing what would have happened in the absence of the program[1] (further guidance in Causal questions). In situations where it is not possible or appropriate to undertake a rigorous impact evaluation, it may be better to monitor, learn and improve,[1] though process and/or outcome evaluations.

As impact is the change in outcomes compared to the alternative outcomes had the program not existed,[1] it is usually easier to measure impact later in the lifecycle of the program (see Figure 5). These evaluations commonly occur at least three years after program implementation. However, the appropriate timing for measuring impact will depend on the program and needs to be decided on case-by-case basis.[20]

An impact evaluation will typically try to answer questions such as:

  • Were the intended outcomes achieved as set out in the program’s aims and objectives?
  • Have other investments influenced the attainment of the program’s aims and objectives? If so, in what way?
  • Did the program contribute to achieving the outcomes as anticipated? If so, to what extent?
  • Were there any unintended consequences?
  • What would have been the situation had the program not been implemented?
  • To what extent did the benefits of the program outweigh the costs?
  • Did the program represent good value for money?
  • Was the program delivered cost-effectively?

Impact evaluations are usually externally commissioned due to their complexity and are generally reserved for high-risk and complex programs due to their cost. The design options for an impact evaluation need significant investment in preparation and early data collection. It is important that impact evaluation is addressed as part of the integrated monitoring and evaluation approach outlined in the evaluation work plan. This will ensure that data from other monitoring, and the process and outcome evaluations can be used, as needed[21]. Equity concerns may require an impact evaluation to go beyond simple average impacts to identify for whom and in what ways the program has impacted outcomes (further guidance in section 2.8: Ethical considerations).[21]

Impact evaluations usually include a value-for-money assessment to determine whether the benefits of the program outweighed the costs and whether the outcomes could have been achieved more efficiently through program efficiencies or a different approach.[22] Value for money in this context is broader than a cost benefit analysis, it is a question of how well resources have been used and whether the use is justified (further guidance in Evaluative questions).

Further information on impact evaluations is available from BetterEvaluation and the UNICEF impact evaluation series.

External or internal evaluation

Evaluations can either be commissioned externally to an appropriate consultant or academic evaluator, conducted internally by agency staff or conducted using a hybrid model of an internal evaluator supported by an external evaluator:

  • External evaluator(s): one evaluator serves as team leader and is supported by program staff
  • Internal evaluator(s): one evaluator serves as team leader and is supported by program staff
  • Hybrid model: an internal evaluator serves as team leader and is supported by other internal evaluators and program staff, as well as external evaluator(s).

If an external evaluator is hired to conduct the evaluation, the program manager and other agency staff still need to be involved in the evaluation process. Program staff are not only primary users of the evaluation findings but are also involved in other evaluation-related tasks (such as providing access to records or educating the evaluator about the program). Be realistic about the amount of time needed for this involvement so staff schedules do not get over-burdened.

The decision to conduct an evaluation internally or commission an external evaluation is usually a decision for the agency’s accountable officer. However, as a general best-practice guide, outcome or impact evaluations of high tier programs should be externally evaluated. It is advisable to engage an external evaluator/evaluation team when:

  • the scope and/or complexity of the evaluation requires expertise that is not internally available
  • a program or project is politically sensitive and impartiality is a key concern
  • internal staff resources are scarce and timeframes are particularly pressing (that is, there is little flexibility in terms of evaluation timing).

Table 11: Internal versus external evaluators outlines the trade-offs between internal and external evaluators.

Table 11: Internal versus external evaluators[23]
Component Internal evaluator(s) External evaluator(s)
Perspective May be more familiar with the community, issues and constraints, data sources, and resources associated with the project/program (they have an insider's perspective). May bring a fresh perspective, insight, broader experience, and recent state-of-the-art knowledge (they have an outsider's perspective).
Knowledge and skills Are familiar with the substance and context of research for development programming. May possess knowledge and skills that internal evaluators are lacking. However it may be difficult to find evaluators who understand the specifics of research for development programming.
Buy-in May be more familiar with the project/ program staff and may be perceived as less threatening. In some contexts, may be seen as too close and participants may be unwilling to provide honest feedback. May be perceived as intrusive or a threat to the project/program (perceived as an adversary). Alternatively, it may be considered impartial and participants may be more comfortable providing honest feedback.
Stake in the evaluation May be perceived as having an agenda / stake in the evaluation. Can serve more easily as an arbitrator or facilitator between stakeholders as perceived as neutral.
Credibility May be perceived as biased as ‘too close’ to the subject matter, which may reduce the credibility of the evaluation hindering its use. May provide a view of the project/program that is considered more objective and give the findings more credibility and potential for use.
Resources May use considerable staff time, which is always in limited supply, especially when their time is not solely dedicated to the evaluation. May be more costly and still involve substantial management/staff time from the commissioning organisation.
Follow-up/use of evaluation findings More opportunity and authority to follow up on recommendations of the evaluation. Contracts often end with the delivery of the final product, typically the final evaluation report, which limits or prohibits follow-up. As outsiders, do not have authority to require appropriate follow-up or action.

2.5.4. Data matrix

A data matrix outlines the sources and types of data that will need to be collected by the program team as part of the monitoring, as well as by the evaluator at the time of the evaluation, to ensure that the evaluation questions can be answered. The data matrix should indicate which evaluations will address which questions. Each evaluation does not need to address all the evaluation questions, however, all questions should be addressed over the entire evaluation plan (process, outcome and impact evaluations). See section 2.5.2. Evaluation questions for further information on which questions are addressed in the different types of evaluations.

A data matrix can also help focus data collection to ensure that only relevant data is collected. Data collection has real costs in terms of staff time and resources as well as time it asks of respondents. It is important to weigh the costs and benefits of data collection activities to find the right balance.[1]


[1] M. K. Gugerty, D. Karlan, The Goldilocks Challenge: Right Fit Evidence for the Social Sector, New York, Oxford University Press, 2018.

[2] S. C. Funnell, , P. J. Rogers, (20110209). Purposeful Program Theory: Effective Use of Theories of Change and Logic Models [VitalSource Bookshelf version].

[3] NSW evaluation toolkit, Develop program logic and review needs.

[4] For example see Figure 11, page 66, Ending violence against women and girls: Evaluating a decade of Australia’s development assistance.

[5] BetterEvaluation Backcasting.

[6] Adapted from S. C. Funnell, P. J. Rogers, (20110209). Purposeful Program Theory: Effective Use of Theories of Change and Logic Models [VitalSource Bookshelf version] and M. K. Gugerty, D. Karlan, The Goldilocks Challenge: Right Fit Evidence for the Social Sector, New York, Oxford University Press, 2018.

[7] DIIS Evaluation Strategy 2017-2021

[8] 2011, ‘Impact Evaluation in Practice’, World Bank.

[9] For a good example of a low-cost impact evaluation using regression discontinuity, see the Root Capital Case Study in the Goldilocks Toolkit .

[10] Further guidance on developing performance frameworks including examples is on pg 6-11 of Evaluation Building Blocks – A Guide by Kinnect Group

[11] NSW Evaluation toolkit, Develop the evaluation brief.

[12] For information about other evaluation types, please see BetterEvaluation.

[13] Further information on these three evaluation types are in sections 3.2.1 to 3.2.3.

[14] Adapted from the Department of Industry, Innovation and Science Evaluation Strategy, 2017-2021.

[15] Adapted from NSW Government Evaluation Framework 2013.

[16] BetterEvaluation Manager’s Guide to Evaluation, accessed May 2020, Develop agreed key evaluation questions.

[17] Rogers, P. et al. (2015), Choosing appropriate designs and methods for impact evaluation, Office of the Chief Economist, Australian Government, Department of Industry, Science, Energy and Resources.

[18] OECD: Outline of principles of impact evaluation.

[19] BetterEvaluation: Themes.

[20] UNICEF Impact Evaluation Series, webinar 5 – RCT’s.

[21] Adapted from BetterEvaluation: Themes – Impact evaluation.

[22] Value for money can also be considered in earlier evaluations, if there is sufficient evidence. As the evidence for a program accumulates, so will the expectation for an assessment of value for money.

[23] BetterEvaluation Commissioner’s Guide.

Last updated: 14 December 2020

Share this page:

Was this page useful?

Describe your experience

More feedback options

To provide comments or suggestions about the NT.GOV.AU website, complete our feedback form.

For all other feedback or enquiries, you must contact the relevant government agency.