2.5. Evaluation methodology

The evaluation methodology should include a detailed program logic, key evaluation questions and a data matrix. It needs to consider what types of evaluations should be used and how key evaluation questions will be addressed.

2.5.1. Program logic

A program logic should illustrate how the program will work by linking program activities with intended outcomes. It visually represents the theory of change underpinning the program and describes how the program contributes to a chain of results flowing from the inputs and activities to short-term, intermediate and long-term outcomes.

Different terms are used for a program logic including program theory, logic model, theory of change, causal model, outcomes hierarchy, results chain, and intervention logic.[1] Usually it is represented as a one page diagram. The diagrams and terms used with program logic may also vary – sometimes the diagrams are shown as a series of boxes, as a table, or as a series of results with activities occurring alongside them rather than just at the start. Some diagrams show the causal links from left to right, some from bottom to top. In all cases, a program logic needs to be more than just a list of activities with arrows to the intended outcomes. For some examples, see Program logic library.

What is a program logic used for?

A program logic should show what needs to be measured in order to distinguish between implementation failure (not done right) and theory failure (done right but still did not work).[2] A program logic:

  • clarifies and communicates program intentions and outcomes
  • demonstrates alignment between activities and objectives
  • explains causal assumptions and tests if they are supported by evidence
  • identifies relevant external factors that could influence outcomes (either positively or negatively)
  • identifies key indicators to be monitored
  • identifies gaps in available data and outlines mitigation measures
  • clarifies the outcomes measurement horizon and identifies early indicators of progress or lack of progress in achieving results
  • focuses evaluation questions.

A program logic underpins data collection by identifying a program’s operating steps and defining what program managers should monitor and measure. A program logic also helps identify the components of the program to be tracked as part of monitoring (outputs) versus those that should be assessed as part of an outcome or impact evaluation (outcomes).[1]

Monitoring activities and outputs shows which program components are being well implemented and which could be improved.[1] A focus on measuring outcomes and impacts without a good monitoring system can result in wasted resources. For example, if a program aims to improve literacy in schools using particular books, it is important to monitor delivery and use of the books so that the program can be adjusted early if the books are not being delivered or used. If the program does not have a good monitoring system in place and waits three years before doing an outcome evaluation, this could be an expensive way of finding out that the books had not even been used. Further information on the importance of a good monitoring system as a way of keeping evaluation costs down is in the Goldilocks toolkit.

A program logic illuminates the critical assumptions and predictions that must hold for key outcomes to occur, and suggests important areas for data collection.[1] It also helps prioritise data collection; for example, if there is no way to isolate external factors that influence the outcomes of the program, is it worth collecting the outcome data? Important considerations include the cost of collecting the outcome data and the conclusions that can reasonably be drawn from it. In some cases, a process evaluation may be sufficient but it is essential that the results of the process evaluation are not overstated.

Developing a program logic

Developing a program logic is part analytical, part consultative process. Analytically, it should review the program settings to identify statements of activities, objectives, aims and intended outcomes. It should then refine and assemble these statements into a causal chain that shows how the activities are assumed to contribute to immediate outcomes, intermediate outcomes and ultimately to the longer term outcome. Consultatively, the process should involve working with a range of stakeholders to draw on their understanding of the outcomes and logic, and also encourage greater ownership of the program logic.[3]

It is useful to think realistically about when a successful program will be able to achieve particular outputs and outcomes. For example, within a domestic violence context, a successful program may see an increase in reporting (due to increased awareness and/or availability of support) before reporting decreases. Where possible, estimated timing of indicators should be built into the program logic to help clarify what success looks like in different timeframes.[4]

The Evaluation work plan template includes a suggested template for the program logic at Appendix A . This is an optional starting point rather than a mandatory structure, however all program logics should clearly identify assumptions and relevant external factors. ‘Backcasting’ starts with identifying long-term outcomes of a program and envisaging alternative futures, and then working backwards to determine the necessary steps towards achieving these outcomes. Unlike forecasting, which considers what is currently occurring and predicting future outcomes, the benefit of backcasting is that it allows stakeholders to brainstorm and consider alternative courses of action.[5] BetterEvaluation’s direction on developing programme theory/theory of change may assist in determining the best approach.

Table 8: Potential steps for developing a program logic[6]
TaskApproach
Undertake situational analysis The context of the problem, its causes and consequences. A good situation analysis will go beyond problems and deficits to identify strengths and potential opportunities.
Identify outcomes
  1. Prepare a list of possible outcomes.
  2. Cluster the outcomes that are related.
  3. Arrange the outcomes in a chain of ‘if-then’ statements to link the short term and long term outcomes.
  4. Identify where a higher level outcome affects a lower level one (feedback loops).
  5. Validate the outcomes chain with key stakeholders.
Identify outputs The direct deliverables of a program. The products, goods or services that need to be provided to program participants to achieve the short-term outcomes.
Identify activities The required actions to produce program outputs.
Identify inputs The resources required to run the program.
Identify assumptions In every link between activity, output and outcome, many different assumptions are made that must hold for the program to work as expected. Making these assumptions explicit and identifying the most critical among them helps to figure out what testing and monitoring is needed to ensure the program works as planned. This includes assumptions about:
  • the need for the program
  • how the program will work
  • whether program activities are likely to produce the intended results.
Consider external factors that also cause changes What besides the program could influence the intended outcome? Listing the most important external influences helps organisations better understand the counterfactual and clarify whether it will be possible to attribute a change in the outcome solely to the program.
Identify risks and unintended consequences The world around the program is unlikely to remain static; changes in external conditions pose unavoidable risks to any program. It is important to identify the most likely and potentially damaging risks and develop a risk reduction or mitigation plan (see section 2.9: Evaluation risks).

Once a program logic is developed, it is useful to map existing data from an established program or previous evaluation onto the program logic to identify priority areas for additional data collection.

What does a good program logic look like?

There is no one way to represent a program logic – the test is whether it is a representation of the program's causal links, and whether it communicates effectively with the intended audience by making sense and helping them understand the program.[3] See examples in Program logic library.

Table 9 provides guidance on the aspects required for a good program logic, and explains the different criteria for ‘requires improvement’, ‘satisfactory’ and ‘good’. This may be refined over time in response to user feedback to ensure it is appropriate to a Territory Government context.

See section 2.10. Reviewing the evaluation work plan for suggestions on what to look for when reviewing an established program logic.

Table 9. Program logic rubric[7]
Section of program logicRequires improvementSatisfactoryGood (includes all satisfactory criteria plus those listed below)
Overall
  • The logic linking activities/outputs to outcomes is not convincing.
  • Arrows not well matched to timescale.
  • Theory of change ill-defined or not evidence-based.
  • Not comprehensive across the columns.
  • Some components incorrectly placed in columns.
  • Doesn’t fit on one page.
  • Adequately represents the views of the main stakeholders: policy, program and Evaluation Unit.
  • The theory of change is clear and indicated by arrows.
  • The outcomes are realistic relative to the inputs and activities (not changing the world).
  • Uses active, not passive voice.
  • The focus is evaluative rather than promotional.
  • All components are in correct columns.
  • Outputs and/or outcomes are linked to activities.
  • The logic linking activities/outputs to outcomes is plausible.
  • Fits on one page.
  • Has been cleared/approved at GM level or other where appropriate.
  • Has been presented to PAC for noting.
  • The template has been adapted to a sensible extent to capture differences between programs.
  • A key is provided where useful/applicable.
  • Acronyms are explained.
  • Isn’t cluttered, with a suitable level of detail.
  • The logic linking activities/outputs to outcomes is based on evidence.
Inputs and participation
  • Is either not comprehensive or is inaccurate in relation to inputs, stakeholders.
  • Omits staffing and or administered funding.
  • Lists government under participation
    (unless the program targets government as the
    beneficiary).
  • Inputs section includes staffing.
  • Inputs section includes formal external inputs where the department is not the sole funder.
  • Funding for inputs is broken down by administered and departmental, where known.
  • Inputs section includes a clear timeframe for funding, either across the lifetime of the program or other clear timeframes.
  • Participation section identifies target recipients for the program. The focus is on beneficiaries, not deliverers of it, such as government.
  • If many participants, these are grouped into logical subgroups.
  • Includes in-kind inputs where relevant.
  • Clarifies target market — distinguishes between primary and secondary beneficiaries.
  • Participation is represented so as to align with activities and outcomes.
  • Includes all stakeholders impacted, not just program participants.
  • Concise.
Activities and or outputs
  • Too much detail on generic administration processes
    such as for granting programs.
  • Outputs are confused with or substitute for outcomes.
  • Activities don’t link to outputs and outcomes.
  • Identifies who does what to whom.
  • Separates Commonwealth and participant activities as necessary.
  • Shows ordering of key activities and links to outcomes.
  • Activities/outputs are directly related to objectives and can be monitored and assessed.
  • Avoids too much detail on generic administration processes such as for granting programs.
  • Uses action verbs to identify activities.
  • Outcomes are informed by evidence and experience/lessons learnt.
Outcomes
  • Outcomes are not comprehensively identified.
  • Outputs are confused with outcomes.
  • No theory of change (no connecting links between boxes or every box connects to every other box).
  • Outcomes are aspirational and or not able to be assessed.
  • Simply restates policy objectives.
  • Doesn’t consider short/medium/long-term outcomes.
  • Links between shorter and longer-term outcomes aren’t convincing.
  • Outcomes are out of proportion to inputs.
  • Identification of outcomes is suitably comprehensive.
  • Articulates who the outcomes relate to (who is benefiting/being affected).
  • Uses evaluative, not promotional language.
  • Language is proportional increase and not just number.
  • Provides realistic timeframes for outcomes.
  • Uses SMART indicators.1 Outcomes that can’t be measured are clearly indicated.
  • Outcomes align with objectives.
    Outcomes are well connected with a logical flow from short-term to long-term.
  • Demonstrates logic links and clearly articulates anticipated changes.
  • Doesn’t restate activities/outputs.
  • Links between shorter and longer-term outcomes are plausible.
  • Uses feedback loops if appropriate.
  • Marks external factors and assumptions in links.
  • Outcomes link backwards to outputs and activities.
  • Links such as between shorter and longer-term outcomes are based on evidence.
External factors and
assumptions
  • Not included or not clearly identified.
  • Not supported by evidence.
  • Key external factors and assumptions identified.
  • Assumptions supported by evidence/theory of change and risks.
  • Informed by lessons learnt.
  • Assumptions comprehensively state the conditions required for the program to function effectively.

1. SMART Specific, Measurable, Attainable, Relevant and Time-bound.
Source: Department of Industry, Innovation and Science (2017)

Table 10: Program logic library
Program logic sourceComments
Australian Institute of Family Studies Blairtown example program logic A hypothetical program aiming to ensure children reach appropriate developmental milestones. Includes assumptions and external factors.
Evidence-Based Programs and Practice in Children and Parenting Support Programs A project supporting nine Children and Parenting Support services in regional and rural NSW to enhance their use of evidence-based programs and practice. Includes assumptions and external factors.
National Forum on Youth Violence Prevention A program that aims to maximise the use of city partnerships and increase the effectiveness of federal agencies to reduce youth violence. Includes assumptions and external factors.
Australian Policy Service Policy Hub Evaluation Ready example program logic: Save Our Town A hypothetical program aimed at stimulating private sector investment, population growth, and economic expansion and diversification to increase a region’s viability. Includes assumptions and external factors.
University of Michigan Evaluation Resource Assistant example program logic A hypothetical program aimed at reducing rates of child abuse and neglect. Does not include assumptions and external factors, however it is a good example of how to use a program logic to prioritise key evaluation questions and indicators.
The Goldilocks Toolkit Case Studies Program logics and lessons learned from a range of international social programs including Acumen, GiveDirectly, Digital Green, Root Capital, Splash, TulaSalud, Women for Women International and One Acre Fund. The program logics do not include assumptions and external factors but the case studies provide examples of how to reduce evaluation costs through a good monitoring system.
Geo-Mapping for Energy & Minerals Appendix A of this evaluation report by Natural Resources Canada has a program logic for a program improving regional geological mapping for responsible resource exploration and development. Does not include assumptions and external factors but does have an evaluation matrix to show how the evaluation questions will be addressed.
The Logic Model Guidebook: Better Strategies for Great Results Program logics from a Community Leadership Academy (see page 10, and a marked up version on page 56) and a Health Improvement program (see page 39 and a marked up version on page 57).

2.5.2. Evaluation questions

Across the program cycle, evaluations need to include a range of questions that promote accountability for public funding and learning from program experiences. These questions need to align with the program logic and will form the basis of the terms of reference. Evaluation questions may be added to or amended closer to evaluation commencement to account for changes in policy context, key stakeholders, or performance indicators.

Different types of questions need different methods and designs to answer them. In evaluations there are four main types of questions: descriptive, action, causal and evaluative.

Descriptive questions

Descriptive questions ask about what has happened or how things are. For example:

  • What were the resources used by the program directly and indirectly?
  • What activities occurred?
  • What changes were observed in conditions or in the participants?

Descriptive questions might relate to:

  • Inputs – materials, staff.
  • Processes – implementation, research projects.
  • Outputs – for example, research publications.
  • Outcomes – for example, changes in policy on the basis of research.
  • Impacts – for example, improvements in agricultural production.

Action questions

Action questions ask about what should be done to respond to evaluation findings. For example:

  • What changes should be made to address problems that have been identified?
  • What should be retained or added to reinforce existing strengths?
  • Should the program continue to be funded?

Causal questions

Causal questions ask about what has contributed to changes that have been observed. For example:

  • What produced the outcomes and impacts?
  • What was the contribution of the program to producing the changes that were observed?
  • What other factors or programs contributed to the observed changes?
What is the difference between correlation and causation?

Two variables are classified as correlated if both increase and decrease together (positively correlated) or if one increases and the other decreases (negatively correlated). Correlation analysis measures how close the relationship is between the two variables.

Causal questions need to investigate whether programs are causing the outcomes that are observed. Although there may be a strong correlation between two variables, for example the introduction of a new program and a particular outcome, this correlation does not necessarily mean the program is directly causing the outcome. See Box 1 for an example.

Box 1: Evaluating to improve resource allocations for family planning and fertility in Indonesia

Indonesia’s innovative family planning efforts gained international recognition in the 1970s for their success in decreasing the country’s fertility rates. The acclaim arose from two parallel phenomena: (1) fertility rates declined by 22% between 1970–1980, by 25% between 1981–1990, and a bit more moderately between 1991–1994; and (2) during the same period, the Indonesian government substantially increased resources allocated to family planning (particularly contraceptive subsidies).

Given that the two things happened concurrently, many concluded that increased investment in family planning had led to lower fertility rates. Unconvinced by the available evidence, a team of researchers evaluated the impact of family planning programs on fertility rates and found, contrary to what was generally believed, that family planning programs only had a moderate impact on fertility, with changes in women’s status deemed to have a larger impact on fertility rates.

The researchers noted that before the start of the family planning program very few women of reproductive age had finished primary education. During the same period as the family planning program, however, the Indonesian government undertook a large-scale education program for girls. By the end of the program, women entering reproductive age had benefited from the additional education. When the oil boom brought economic expansion and increased demand for labour in Indonesia, the participation of educated women in the labour force increased significantly. As the value of women’s time at work rose, so did the use of contraceptives. In the end, higher wages and empowerment explained 70% of the observed decline in fertility—more than the investment in family planning programs.

These evaluation results informed policy makers’ subsequent resource allocation decisions: funding was reprogrammed away from contraception subsidies and towards programs that increased the enrolment of women in school. Although the ultimate goals of the two programs were similar, evaluation studies had shown that in the Indonesian context, lower fertility rates could be obtained more effectively by investing in education than by investing in family planning.[8]

There are many designs and methods to answer causal questions but they usually involve one or more of these strategies:

(a) Compare results to an estimate of what would have happened if the program had not occurred (this is known as a counterfactual)

This might involve creating a control group (where people or sites are randomly assigned to either participate or not) or a comparison group (where those who participate are compared to others who are matched in various ways). Techniques include:

  • Randomised controlled trial (RCT): a control group is compared to one or more treatment groups.
  • Matched comparison: participants are each matched with a non participant on variables that are thought to be relevant. It can be difficult to adequately match on all relevant criteria.
  • Propensity score matching: create a comparison group based on an analysis of the factors that influenced people’s propensity to participate in the program.
  • Regression discontinuity: compares the outcomes of individuals just below the cut-off point with those just above the cut-off point.[9]
(b) Check for consistency of the evidence with the theory of how the intervention would contribute to the observed results

This can involve checking that intermediate outcomes have been achieved, using process tracing to check each causal link in the theory of change, identifying and following up anomalies that don’t fit the pattern, and asking participants to describe how the changes came about. Techniques include:

  • Contribution analysis: sets out the theory of change that is understood to produce the observed outcomes and impacts and then searches iteratively for evidence that will either support or challenge it.
  • Key informant attribution: asks participants and other informed people about what they believe caused the impacts and gathers information about the details of the causal processes.
  • Qualitative comparative analysis: compares different cases to identify the different combinations of factors that produce certain outcomes.
  • Process tracing: a case-based approach to causal inference which focuses on the use of clues within a case (causal-process observations) to adjudicate between alternative possible explanations. It involves checking each step in the causal chain to see if the evidence supports, fails to support or rules out the theory that the program or project produced the observed impacts.
  • Qualitative impact assessment protocol: combines information from relevant stakeholders, process tracing and contribution analysis, using interviews undertaken in a way to reduce biased narratives.
(c) Identify and rule out alternative explanations

This can involve a process to identify possible alternative explanations (perhaps involving interviews with program sceptics and critics, and drawing on previous research and evaluation, as well as interviews with participants) and then searching for evidence that can rule them out.

While technical expertise is needed to choose the appropriate option for answering causal questions, the program manager should be able to check there is an explicit approach being used, and seek technical review of its appropriateness for third parties where necessary.

Further guidance and options for measuring causal attribution can be found in the UNICEF Impact evaluation series.

Evaluative questions

Evaluative questions ask whether an intervention can be considered a success, an improvement or the best option and require a combination of explicit values as well as evidence – for example:

  • In what ways and for whom was the program successful?
  • Did the program provide Value for Money, taking into account all the costs incurred (not only the direct funding) and any negative outcomes.

Many evaluations do not make explicit how evaluative questions will be answered – what the criteria will be (the domains of performance), what the standard will be (the level of performance that will be considered adequate or good), or how different criteria will be weighted. A review of the design could check each of these in turn:

  • Are there clear criteria for this evaluative question?
  • Are there clear standards for judging the quality of performance on each criterion?
  • Is there clarity about how to synthesize evidence across criteria? Is there a performance framework that explains what “how good” or “how well” or “how much” mean in practice[10]? For example, is it better to have some improvement for everyone or big improvements for a few?
  • Are the criteria, standards and approach to synthesis appropriate? What has been their source? Is further review of these needed? Who should be involved?

Ideally an evaluation design will be explicit about these, including the source of these criteria and standards. The BetterEvaluation website has further information on Evaluation methods for assessing value for money and Oxford Policy Management’s approach to assessing value for money is useful for assessing value for money in complex interventions.

Key evaluation questions

To clarify the purpose and objectives of an evaluation, there should be a limited number of higher order key evaluation questions (roughly 5 to 7 questions) addressing:

  • appropriateness – to what extent does the program address an identified need?
  • effectiveness – to what extent is the program achieving the intended outcomes, in the short, medium and long term?
  • efficiency – do the outcomes of the program represent value for money?

These key evaluation questions are high-level research topics that can be broken down into detailed sub‑questions, each addressing a particular aspect. The key evaluation questions are not yes or no questions.

Key evaluation questions often contain more than one type of evaluation question – for example to answer “How effective has the program been?” requires answering:

  • descriptive questions – What changes have occurred?
  • causal questions – What contribution did the intervention make to these changes?
  • evaluative questions – How valuable were the changes in terms of the stated goals taking into account types of changes, level of change and distribution of changes.

A way to test the validity and scope of evaluation questions is to ask: when the evaluation has answered these questions, have we met the full purpose of the evaluation?[11]

See also section 4.2.1 Evidence synthesis.

2.5.3. Types of evaluation

While there are a number of different approaches to evaluation,[12] the Program evaluation framework is based on three types,[13] linked to the program lifecycle:

  1. Process evaluation: considers program design and initial implementation (≤18 months).
  2. Outcome evaluation: considers program implementation (>2 years) and short to medium term outcomes.
  3. Impact evaluation: considers medium to long term outcomes (>3 years), and whether the program contributed to the outcomes and represented value for money.

These three evaluation types address different questions at various stages of the program lifecycle, with each evaluation building on the evidence from the previous evaluation (Figure 3). Not all programs will require all three evaluation types. The evaluation overview, completed as part of the Cabinet submission process, will specify which evaluation types are necessary for each program. The different types of evaluation are used to build a clearer picture of program effectiveness as the program matures (Figure 4).

Figure 3: Different types of evaluations consider different aspects of the program[14]

Diagram of types of evaluations

Figure 4: Evaluations over the program lifecycle of a major program[15]

Chart depicting evaluations of program

Process evaluations

A process evaluation investigates whether the program is being implemented according to plan[16]. This type of evaluation can help to differentiate ineffective programs from implementation failure (where the program has not been adequately implemented) and theory failure (where the program was adequately implemented but did not produce the intended impacts) [17] . As an ongoing evaluative strategy, it can be used to continually improve programs by informing adjustments to delivery[11].

Process evaluations may be undertaken by the relevant program team, if they have appropriate capability.

A process evaluation will typically try to answer questions such as:

  • Was the program implemented in accordance with the initial program design?
  • Was the program rollout completed on time and within the approved budget?
  • Are there any adjustments to the implementation approach that need to be made?
  • Are more or different key performance indicators required?
  • Is the right data being collected in an efficient way?

Outcome evaluations

An outcome evaluation assesses progress in early to medium-term results that the program is aiming to achieve[16]. It is suited to programs at a business as usual stage in the program lifecycle and is usually externally commissioned.

An outcome evaluation will typically try to answer questions such as:

  • What early outcomes or indications of future outcomes are suggested by the data?
  • Did the program have any unintended consequences, positive or negative? If so, what were those consequences? How and why did they occur?
  • How ready is the program for an impact evaluation?

There is an important distinction between measuring outcomes, which is a description of the factual, and using a counterfactual to attribute observed outcomes to the intervention.[18] A good outcome evaluation should consider whether the program has contributed to the outcome, noting this becomes easier over time and is therefore more of a focus in impact evaluations (see Figure 5).

Figure 5 illustrates the impact measures and change in outcomes for those affected by a program (blue line) compared to the alternative outcomes had the program not existed (orange line). As impact generally increases over time, it tends to be measured later in the program lifecycle.

Figure 5:  Impact of program outcomes over time

Chart illustrating impact over time

Impact evaluations

An impact evaluation builds on an outcome evaluation to assess longer-term results[19]. It must test whether the program has made a difference by comparing what would have happened in the absence of the program[1] (further guidance in Causal questions). In situations where it is not possible or appropriate to undertake a rigorous impact evaluation, it may be better to monitor, learn and improve,[1] though process and/or outcome evaluations.

As impact is the change in outcomes compared to the alternative outcomes had the program not existed,[1] it is usually easier to measure impact later in the lifecycle of the program (see Figure 5). These evaluations commonly occur at least three years after program implementation. However, the appropriate timing for measuring impact will depend on the program and needs to be decided on case-by-case basis.[20]

An impact evaluation will typically try to answer questions such as:

  • Were the intended outcomes achieved as set out in the program’s aims and objectives?
  • Have other investments influenced the attainment of the program’s aims and objectives? If so, in what way?
  • Did the program contribute to achieving the outcomes as anticipated? If so, to what extent?
  • Were there any unintended consequences?
  • What would have been the situation had the program not been implemented?
  • To what extent did the benefits of the program outweigh the costs?
  • Did the program represent good value for money?
  • Was the program delivered cost-effectively?

Impact evaluations are usually externally commissioned due to their complexity and are generally reserved for high-risk and complex programs due to their cost. The design options for an impact evaluation need significant investment in preparation and early data collection. It is important that impact evaluation is addressed as part of the integrated monitoring and evaluation approach outlined in the evaluation work plan. This will ensure that data from other monitoring, and the process and outcome evaluations can be used, as needed[21]. Equity concerns may require an impact evaluation to go beyond simple average impacts to identify for whom and in what ways the program has impacted outcomes (further guidance in section 2.8: Ethical considerations).[21]

Impact evaluations usually include a value-for-money assessment to determine whether the benefits of the program outweighed the costs and whether the outcomes could have been achieved more efficiently through program efficiencies or a different approach.[22] Value for money in this context is broader than a cost benefit analysis, it is a question of how well resources have been used and whether the use is justified (further guidance in Evaluative questions).

Further information on impact evaluations is available from BetterEvaluation and the UNICEF impact evaluation series.

External or internal evaluation

Evaluations can either be commissioned externally to an appropriate consultant or academic evaluator, conducted internally by agency staff or conducted using a hybrid model of an internal evaluator supported by an external evaluator:

  • External evaluator(s): one evaluator serves as team leader and is supported by program staff
  • Internal evaluator(s): one evaluator serves as team leader and is supported by program staff
  • Hybrid model: an internal evaluator serves as team leader and is supported by other internal evaluators and program staff, as well as external evaluator(s).

If an external evaluator is hired to conduct the evaluation, the program manager and other agency staff still need to be involved in the evaluation process. Program staff are not only primary users of the evaluation findings but are also involved in other evaluation-related tasks (such as providing access to records or educating the evaluator about the program). Be realistic about the amount of time needed for this involvement so staff schedules do not get over-burdened.

The decision to conduct an evaluation internally or commission an external evaluation is usually a decision for the agency’s accountable officer. However, as a general best-practice guide, outcome or impact evaluations of high tier programs should be externally evaluated. It is advisable to engage an external evaluator/evaluation team when:

  • the scope and/or complexity of the evaluation requires expertise that is not internally available
  • a program or project is politically sensitive and impartiality is a key concern
  • internal staff resources are scarce and timeframes are particularly pressing (that is, there is little flexibility in terms of evaluation timing).

Table 11: Internal versus external evaluators outlines the trade-offs between internal and external evaluators.

Table 11: Internal versus external evaluators[23]
Component Internal evaluator(s) External evaluator(s)
Perspective May be more familiar with the community, issues and constraints, data sources, and resources associated with the project/program (they have an insider's perspective). May bring a fresh perspective, insight, broader experience, and recent state-of-the-art knowledge (they have an outsider's perspective).
Knowledge and skills Are familiar with the substance and context of research for development programming. May possess knowledge and skills that internal evaluators are lacking. However it may be difficult to find evaluators who understand the specifics of research for development programming.
Buy-in May be more familiar with the project/ program staff and may be perceived as less threatening. In some contexts, may be seen as too close and participants may be unwilling to provide honest feedback. May be perceived as intrusive or a threat to the project/program (perceived as an adversary). Alternatively, it may be considered impartial and participants may be more comfortable providing honest feedback.
Stake in the evaluation May be perceived as having an agenda / stake in the evaluation. Can serve more easily as an arbitrator or facilitator between stakeholders as perceived as neutral.
Credibility May be perceived as biased as ‘too close’ to the subject matter, which may reduce the credibility of the evaluation hindering its use. May provide a view of the project/program that is considered more objective and give the findings more credibility and potential for use.
Resources May use considerable staff time, which is always in limited supply, especially when their time is not solely dedicated to the evaluation. May be more costly and still involve substantial management/staff time from the commissioning organisation.
Follow-up/use of evaluation findings More opportunity and authority to follow up on recommendations of the evaluation. Contracts often end with the delivery of the final product, typically the final evaluation report, which limits or prohibits follow-up. As outsiders, do not have authority to require appropriate follow-up or action.

2.5.4. Data matrix

A data matrix outlines the sources and types of data that will need to be collected by the program team as part of the monitoring, as well as by the evaluator at the time of the evaluation, to ensure that the evaluation questions can be answered. The data matrix should indicate which evaluations will address which questions. Each evaluation does not need to address all the evaluation questions, however, all questions should be addressed over the entire evaluation plan (process, outcome and impact evaluations). See section 2.5.2. Evaluation questions for further information on which questions are addressed in the different types of evaluations.

A data matrix can also help focus data collection to ensure that only relevant data is collected. Data collection has real costs in terms of staff time and resources as well as time it asks of respondents. It is important to weigh the costs and benefits of data collection activities to find the right balance.[1]


[1] M. K. Gugerty, D. Karlan, The Goldilocks Challenge: Right Fit Evidence for the Social Sector, New York, Oxford University Press, 2018.

[2] S. C. Funnell, , P. J. Rogers, (20110209). Purposeful Program Theory: Effective Use of Theories of Change and Logic Models [VitalSource Bookshelf version].

[3] NSW evaluation toolkit, Develop program logic and review needs.

[4] For example see Figure 11, page 66, Ending violence against women and girls: Evaluating a decade of Australia’s development assistance.

[5] BetterEvaluation Backcasting.

[6] Adapted from S. C. Funnell, P. J. Rogers, (20110209). Purposeful Program Theory: Effective Use of Theories of Change and Logic Models [VitalSource Bookshelf version] and M. K. Gugerty, D. Karlan, The Goldilocks Challenge: Right Fit Evidence for the Social Sector, New York, Oxford University Press, 2018.

[7] DIIS Evaluation Strategy 2017-2021

[8] 2011, ‘Impact Evaluation in Practice’, World Bank.

[9] For a good example of a low-cost impact evaluation using regression discontinuity, see the Root Capital Case Study in the Goldilocks Toolkit .

[10] Further guidance on developing performance frameworks including examples is on pg 6-11 of Evaluation Building Blocks – A Guide by Kinnect Group

[11] NSW Evaluation toolkit, Develop the evaluation brief.

[12] For information about other evaluation types, please see BetterEvaluation.

[13] Further information on these three evaluation types are in sections 3.2.1 to 3.2.3.

[14] Adapted from the Department of Industry, Innovation and Science Evaluation Strategy, 2017-2021.

[15] Adapted from NSW Government Evaluation Framework 2013.

[16] BetterEvaluation Manager’s Guide to Evaluation, accessed May 2020, Develop agreed key evaluation questions.

[17] Rogers, P. et al. (2015), Choosing appropriate designs and methods for impact evaluation, Office of the Chief Economist, Australian Government, Department of Industry, Science, Energy and Resources.

[18] OECD: Outline of principles of impact evaluation.

[19] BetterEvaluation: Themes.

[20] UNICEF Impact Evaluation Series, webinar 5 – RCT’s.

[21] Adapted from BetterEvaluation: Themes – Impact evaluation.

[22] Value for money can also be considered in earlier evaluations, if there is sufficient evidence. As the evidence for a program accumulates, so will the expectation for an assessment of value for money.

[23] BetterEvaluation Commissioner’s Guide.

Last updated: 02 February 2021

Share this page: