Beyond the Hype

Uncovering Pitfalls in the Use of AI in Evaluation

Authors

  • Steve Jacob Université Laval

DOI:

https://doi.org/10.18357/cjpe.2026.40.1.1210

Keywords:

Artificial Intelligence, Evaluator Competencies, Ethical Risks, Professional Responsibility, AI governance

Abstract

The increasing utilization of artificial intelligence (AI) in program evaluation has given rise to novel opportunities in data gathering and analysis, document synthesis, and predictive modeling. Evaluators are exploring a growing array of applications, ranging from traditional AI systems useful for classification and prediction, to generative AI tools more suited to producing text or images. While these technologies offer efficiency gains, their integration into the evaluation practice also raises significant concerns and creates new risks.

This article explores four important pitfalls associated with the use of AI in evaluation: the illusion of infallibility, the lack of explainability, the distortion of reality, and the reinforcement of inequality. The analysis draws on examples from both traditional and generative AI systems to demonstrate how these technologies can undermine core principles of the evaluation practice. The article posits a need for a more critical and reflective engagement with AI. In this new age of AI, evaluators will need to adapt their professional judgment, enhance their ethical awareness, and reconsider the methodological foundations of their practice.

References

Amazon-Brown, I., & Raftree, L. (2024). Common AI definitions and risks for development and humanitarian actors, The MERL Tech Initiative. https://merltech.org/resources/common-ai-definitions-risks-for-development-humanitarian-actors

Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3), 337-351. https://doi.org/10.1017/pan.2023.2 DOI: https://doi.org/10.1017/pan.2023.2

Balogun, A. Y., Metibemu, O. C., Olutimehin, A. T., Ajayi, A. J., Babarinde, D. C., & Olaniyi, O. O. (2025). The Ethical and Legal Implications of Shadow AI in Sensitive Industries: A focus on Healthcare, Finance and Education. Journal of Engineering Research and Reports, 27(3), 1-22. https://doi.org/10.9734/jerr/2025/v27i31414 DOI: https://doi.org/10.9734/jerr/2025/v27i31414

Bouyousfi, S. E., & Ouedraogo, M. (2025). Artificial intelligence and big data-driven evaluation research and practices: A systematic literature review. Evaluation, 31(3), 303-330. https://doi.org/10.1177/13563890241289937 DOI: https://doi.org/10.1177/13563890241289937

Buchanan, H. (2015). A made-in-Canada credential: Developing an evaluation professional designation. Canadian Journal of Program Evaluation, 29(3), 33-53.

https://doi.org/10.3138/cjpe.29.3.33 DOI: https://doi.org/10.3138/cjpe.29.3.33

Burrell, J. (2016). How the machine ‘thinks’: Understanding opacity in machine learning algorithms. Big Data & Society, 3(1). https://doi.org/10.1177/2053951715622512 DOI: https://doi.org/10.1177/2053951715622512

Carbonell Cortés, C., Parra-Rojas, C., Pérez-Lozano, A., Arcara, F., Vargas-Sánchez, S., Fernández-Montenegro, R., Casado-Marín, D., Rondelli, B., & López-Verdeguer, I. (2024). AI-assisted prescreening of biomedical research proposals: Ethical considerations and the pilot case of “la Caixa” Foundation. Data & Policy, 6, e49. https://doi.org/10.1017/dap.2024.41 DOI: https://doi.org/10.1017/dap.2024.41

Erdocia, I., Migge, B., & Schneider, B. (2024). Language is not a data set – Why overcoming ideologies of dataism is more important than ever in the age of AI. Journal of Sociolinguistics, 28(5), 20-25. https://doi.org/10.1111/josl.12680 DOI: https://doi.org/10.1111/josl.12680

Head, C. B., Jasper, P., McConnachie, M., Raftree, L., & Higdon, G. (2023). Large language model applications for evaluation: Opportunities and ethical implications. New Directions for Evaluation (178-179), 33-46. https://doi.org/10.1002/ev.20556 DOI: https://doi.org/10.1002/ev.20556

Henson, H. (2016). Data quality evaluation for program evaluators. Canadian Journal of Program Evaluation, 31(1), 99-108. https://doi.org/10.3138/cjpe.261 DOI: https://doi.org/10.3138/cjpe.261

Jacob, S. (2024). Navigating the challenges of policy evaluation. Canadian Public Administration, 67(2), 282-290. https://doi.org/10.1111/capa.12571 DOI: https://doi.org/10.1111/capa.12571

Jacob, S. (2025). Artificial intelligence and the future of evaluation: From augmented to automated evaluation. Digital Government: Research and Practice, 6(1).

https://doi.org/10.1145/3696009 DOI: https://doi.org/10.1145/3696009

Jacob, S. (2027). Beyond the radar: Addressing undeclared AI use in evaluation practice. In K. Bruce, V. Gandhi, & S. B. Nielsen (Eds.), From algorithms to evidence: Using generative AI in evaluation practice. New York, NY: Routledge.

Jacob, S., & Brousseau, S. (2025). Quand l’algorithme décide : l’État, l’IA et nous. Presses de l'Université Laval. https://doi.org/10.1515/9782766307258 DOI: https://doi.org/10.1515/9782766307258

King, J. A. (Ed.) (2020). The American Evaluation Association’s program evaluator competencies [Special issue]. New Directions for Evaluation, 168. https://doi.org/10.1002/ev.20441 DOI: https://doi.org/10.1002/ev.20435

Kordzadeh, N., & Ghasemaghaei, M. (2021). Algorithmic bias: Review, synthesis, and future research directions. European Journal of Information Systems, 31(3), 388–409. https://doi.org/10.1080/0960085X.2021.1927212 DOI: https://doi.org/10.1080/0960085X.2021.1927212

Lee, S., Peng, T.-Q., Goldberg, M. H., Rosenthal, S. A., Kotcher, J. E., Maibach, E. W., & Leiserowitz, A. (2024). Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias. PLOS Climate, 3(8), e0000429. https://doi.org/10.1371/journal.pclm.0000429 DOI: https://doi.org/10.1371/journal.pclm.0000429

Markelius, A., Wright, C., Kuiper, J., Delille, N & Kuo, Y.-T. (2024). The mechanisms of AI hype and its planetary and social costs. AI and Ethics, 4, 727-742. https://doi.org/10.1007/s43681-024-00461-2 DOI: https://doi.org/10.1007/s43681-024-00461-2

Mashhadi, A., Winder, S. G., Lia, E. H., & Wood, S. A. (2021). No walk in the park: The viability and fairness of social media analysis for parks and recreational policy making. Proceedings of the International AAAI Conference on Web and Social Media, 15(1), 409-420. https://doi.org/10.1609/icwsm.v15i1.18071 DOI: https://doi.org/10.1609/icwsm.v15i1.18071

Maslej, N., Fattorini, L., Perrault, R., Gil, Y., Parli, V., Kariuki, N., Capstick, E., Reuel, A., Brynjolfsson, E., Etchemendy, J., Ligett, K., Lyons, T., Manyika, J., Niebles, J. C., Shoham, Y., Wald, R., Walsh, T., Hamrah, A., Santarlasci, L., ... Oak, S. (2025). The AI Index 2025 annual report. AI Index Steering Committee, Institute for Human-Centered AI, Stanford University. https://hai.stanford.edu/ai-index/2025-ai-index-report

Mason, S., & Montrosse-Moorhead, B. (2023). Evaluation and Artificial Intelligence [Special issue]. New Directions for Evaluation, (178-179), 7-10. https://doi.org/10.1002/ev.20563 DOI: https://doi.org/10.1002/ev.20563

McGrath, K. (2021). Accuracy and explainability in artificial intelligence: Unpacking the terms. ICIS 2021 Proceedings, USA, Article 2621, 1-9. https://aisel.aisnet.org/icis2021/ai_business/ai_business/18/

Merigoux, D., Alauzen, M., & Slimani, L. (2024). Rules, computation and politics: Scrutinizing unnoticed programming choices in French housing benefits. Journal of Cross-Disciplinary Research in Computational Law, 2(1). https://journalcrcl.org/crcl/article/view/32

Nensa, F. (2025). Embracing generative AI: A necessary evolution in professional writing. European Journal of Radiology Artificial Intelligence, 1. https://doi.org/10.1016/j.ejrai.2024.100001 DOI: https://doi.org/10.1016/j.ejrai.2024.100001

Nielsen, S. B., Rinaldi, F. M., & Petersson, G. J. (2025). Artificial intelligence and evaluation: Emerging technologies and their implications for evaluation. Routledge. https://doi.org/10.4324/9781003512493 DOI: https://doi.org/10.4324/9781003512493-14

Piscopo, C. (2013). The metaphysical nature of the non-adequacy claim. An epistemological analysis of the debate on probability in artificial intelligence. Springer. https://doi.org/10.1007/978-3-642-35359-8_3 DOI: https://doi.org/10.1007/978-3-642-35359-8

Powell, A., & McKelvey, F. (2024). AI policymaking as drama: Stages, roles, and ghosts in AI governance in the United Kingdom and Canada. Journal of Digital Social Research, 6(4), 77-91. https://doi.org/10.33621/jdsr.v6i440468 DOI: https://doi.org/10.33621/jdsr.v6i440468

Shapiro, S., & Lam, V. (2024). Artificial intelligence in program evaluation: Insights and applications. Canadian Journal of Program Evaluation, 39(2), 382-391.

https://doi.org/10.3138/cjpe-2024-0027 DOI: https://doi.org/10.3138/cjpe-2024-0027

Stake, R.E. (2001). Representing quality in evaluation. In A.P. Benson, D. Michelle Hinn, & C. Lloyd (Eds.) Vision of Quality: How Evaluators Define, Understand and Represent Program Quality (Advances in Program Evaluation, Vol. 7, pp.3-11) Emerald Group Publishing Limited. https://doi.org/10.1016/S1474-7863(01)80061-2 DOI: https://doi.org/10.1016/S1474-7863(01)80061-2

Subramanian, H. V., Canfield, C., Shank, D. B., & Kinnison, M. (2023). Combining uncertainty information with AI recommendations supports calibration with domain knowledge. Journal of Risk Research, 26(10), 1137-1152. https://doi.org/10.1080/13669877.2023.2259406 DOI: https://doi.org/10.1080/13669877.2023.2259406

Tucker, S., Stevahn, L., & King, J. A. (2023). Professionalizing evaluation: A time-bound comparison of the American Evaluation Association’s foundational documents. American Journal of Evaluation, 44(3), 495-512. https://doi.org/10.1177/10982140221136486 DOI: https://doi.org/10.1177/10982140221136486

Waller, M., & Waller, P. (2020). Why predictive algorithms are so risky for public sector bodies. SSRN. https://doi.org/10.2139/ssrn.3716166 DOI: https://doi.org/10.2139/ssrn.3716166

Ye, H., Liu, T., Zhang, A., Hua, W., & Jia, W. (2023). Cognitive mirage: A review of hallucinations in large language models. (arXiv:2309.06794). ArXiv. https://doi.org/10.48550/arXiv.2309.06794

Zanna, K., & Sano, A. (2024). Enhancing fairness and performance in machine learning models: A multi-task learning approach with Monte-Carlo dropout and Pareto optimality. (arXiv:2404.08230v1). ArXiv. https://arxiv.org/abs/2404.08230

Downloads

Published

2026-05-21

How to Cite

Jacob, S. (2026). Beyond the Hype: Uncovering Pitfalls in the Use of AI in Evaluation. Canadian Journal of Program Evaluation, 40(1), 1–17. https://doi.org/10.18357/cjpe.2026.40.1.1210

Issue

Section

Thematic Segment

Most read articles by the same author(s)