Beyond the Hype
Uncovering Pitfalls in the Use of AI in Evaluation
DOI:
https://doi.org/10.18357/cjpe.2026.40.1.1210Keywords:
Artificial Intelligence, Evaluator Competencies, Ethical Risks, Professional Responsibility, AI governanceAbstract
The increasing utilization of artificial intelligence (AI) in program evaluation has given rise to novel opportunities in data gathering and analysis, document synthesis, and predictive modeling. Evaluators are exploring a growing array of applications, ranging from traditional AI systems useful for classification and prediction, to generative AI tools more suited to producing text or images. While these technologies offer efficiency gains, their integration into the evaluation practice also raises significant concerns and creates new risks.
This article explores four important pitfalls associated with the use of AI in evaluation: the illusion of infallibility, the lack of explainability, the distortion of reality, and the reinforcement of inequality. The analysis draws on examples from both traditional and generative AI systems to demonstrate how these technologies can undermine core principles of the evaluation practice. The article posits a need for a more critical and reflective engagement with AI. In this new age of AI, evaluators will need to adapt their professional judgment, enhance their ethical awareness, and reconsider the methodological foundations of their practice.
References
Amazon-Brown, I., & Raftree, L. (2024). Common AI definitions and risks for development and humanitarian actors, The MERL Tech Initiative. https://merltech.org/resources/common-ai-definitions-risks-for-development-humanitarian-actors
Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3), 337-351. https://doi.org/10.1017/pan.2023.2 DOI: https://doi.org/10.1017/pan.2023.2
Balogun, A. Y., Metibemu, O. C., Olutimehin, A. T., Ajayi, A. J., Babarinde, D. C., & Olaniyi, O. O. (2025). The Ethical and Legal Implications of Shadow AI in Sensitive Industries: A focus on Healthcare, Finance and Education. Journal of Engineering Research and Reports, 27(3), 1-22. https://doi.org/10.9734/jerr/2025/v27i31414 DOI: https://doi.org/10.9734/jerr/2025/v27i31414
Bouyousfi, S. E., & Ouedraogo, M. (2025). Artificial intelligence and big data-driven evaluation research and practices: A systematic literature review. Evaluation, 31(3), 303-330. https://doi.org/10.1177/13563890241289937 DOI: https://doi.org/10.1177/13563890241289937
Buchanan, H. (2015). A made-in-Canada credential: Developing an evaluation professional designation. Canadian Journal of Program Evaluation, 29(3), 33-53.
https://doi.org/10.3138/cjpe.29.3.33 DOI: https://doi.org/10.3138/cjpe.29.3.33
Burrell, J. (2016). How the machine ‘thinks’: Understanding opacity in machine learning algorithms. Big Data & Society, 3(1). https://doi.org/10.1177/2053951715622512 DOI: https://doi.org/10.1177/2053951715622512
Carbonell Cortés, C., Parra-Rojas, C., Pérez-Lozano, A., Arcara, F., Vargas-Sánchez, S., Fernández-Montenegro, R., Casado-Marín, D., Rondelli, B., & López-Verdeguer, I. (2024). AI-assisted prescreening of biomedical research proposals: Ethical considerations and the pilot case of “la Caixa” Foundation. Data & Policy, 6, e49. https://doi.org/10.1017/dap.2024.41 DOI: https://doi.org/10.1017/dap.2024.41
Erdocia, I., Migge, B., & Schneider, B. (2024). Language is not a data set – Why overcoming ideologies of dataism is more important than ever in the age of AI. Journal of Sociolinguistics, 28(5), 20-25. https://doi.org/10.1111/josl.12680 DOI: https://doi.org/10.1111/josl.12680
Head, C. B., Jasper, P., McConnachie, M., Raftree, L., & Higdon, G. (2023). Large language model applications for evaluation: Opportunities and ethical implications. New Directions for Evaluation (178-179), 33-46. https://doi.org/10.1002/ev.20556 DOI: https://doi.org/10.1002/ev.20556
Henson, H. (2016). Data quality evaluation for program evaluators. Canadian Journal of Program Evaluation, 31(1), 99-108. https://doi.org/10.3138/cjpe.261 DOI: https://doi.org/10.3138/cjpe.261
Jacob, S. (2024). Navigating the challenges of policy evaluation. Canadian Public Administration, 67(2), 282-290. https://doi.org/10.1111/capa.12571 DOI: https://doi.org/10.1111/capa.12571
Jacob, S. (2025). Artificial intelligence and the future of evaluation: From augmented to automated evaluation. Digital Government: Research and Practice, 6(1).
https://doi.org/10.1145/3696009 DOI: https://doi.org/10.1145/3696009
Jacob, S. (2027). Beyond the radar: Addressing undeclared AI use in evaluation practice. In K. Bruce, V. Gandhi, & S. B. Nielsen (Eds.), From algorithms to evidence: Using generative AI in evaluation practice. New York, NY: Routledge.
Jacob, S., & Brousseau, S. (2025). Quand l’algorithme décide : l’État, l’IA et nous. Presses de l'Université Laval. https://doi.org/10.1515/9782766307258 DOI: https://doi.org/10.1515/9782766307258
King, J. A. (Ed.) (2020). The American Evaluation Association’s program evaluator competencies [Special issue]. New Directions for Evaluation, 168. https://doi.org/10.1002/ev.20441 DOI: https://doi.org/10.1002/ev.20435
Kordzadeh, N., & Ghasemaghaei, M. (2021). Algorithmic bias: Review, synthesis, and future research directions. European Journal of Information Systems, 31(3), 388–409. https://doi.org/10.1080/0960085X.2021.1927212 DOI: https://doi.org/10.1080/0960085X.2021.1927212
Lee, S., Peng, T.-Q., Goldberg, M. H., Rosenthal, S. A., Kotcher, J. E., Maibach, E. W., & Leiserowitz, A. (2024). Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias. PLOS Climate, 3(8), e0000429. https://doi.org/10.1371/journal.pclm.0000429 DOI: https://doi.org/10.1371/journal.pclm.0000429
Markelius, A., Wright, C., Kuiper, J., Delille, N & Kuo, Y.-T. (2024). The mechanisms of AI hype and its planetary and social costs. AI and Ethics, 4, 727-742. https://doi.org/10.1007/s43681-024-00461-2 DOI: https://doi.org/10.1007/s43681-024-00461-2
Mashhadi, A., Winder, S. G., Lia, E. H., & Wood, S. A. (2021). No walk in the park: The viability and fairness of social media analysis for parks and recreational policy making. Proceedings of the International AAAI Conference on Web and Social Media, 15(1), 409-420. https://doi.org/10.1609/icwsm.v15i1.18071 DOI: https://doi.org/10.1609/icwsm.v15i1.18071
Maslej, N., Fattorini, L., Perrault, R., Gil, Y., Parli, V., Kariuki, N., Capstick, E., Reuel, A., Brynjolfsson, E., Etchemendy, J., Ligett, K., Lyons, T., Manyika, J., Niebles, J. C., Shoham, Y., Wald, R., Walsh, T., Hamrah, A., Santarlasci, L., ... Oak, S. (2025). The AI Index 2025 annual report. AI Index Steering Committee, Institute for Human-Centered AI, Stanford University. https://hai.stanford.edu/ai-index/2025-ai-index-report
Mason, S., & Montrosse-Moorhead, B. (2023). Evaluation and Artificial Intelligence [Special issue]. New Directions for Evaluation, (178-179), 7-10. https://doi.org/10.1002/ev.20563 DOI: https://doi.org/10.1002/ev.20563
McGrath, K. (2021). Accuracy and explainability in artificial intelligence: Unpacking the terms. ICIS 2021 Proceedings, USA, Article 2621, 1-9. https://aisel.aisnet.org/icis2021/ai_business/ai_business/18/
Merigoux, D., Alauzen, M., & Slimani, L. (2024). Rules, computation and politics: Scrutinizing unnoticed programming choices in French housing benefits. Journal of Cross-Disciplinary Research in Computational Law, 2(1). https://journalcrcl.org/crcl/article/view/32
Nensa, F. (2025). Embracing generative AI: A necessary evolution in professional writing. European Journal of Radiology Artificial Intelligence, 1. https://doi.org/10.1016/j.ejrai.2024.100001 DOI: https://doi.org/10.1016/j.ejrai.2024.100001
Nielsen, S. B., Rinaldi, F. M., & Petersson, G. J. (2025). Artificial intelligence and evaluation: Emerging technologies and their implications for evaluation. Routledge. https://doi.org/10.4324/9781003512493 DOI: https://doi.org/10.4324/9781003512493-14
Piscopo, C. (2013). The metaphysical nature of the non-adequacy claim. An epistemological analysis of the debate on probability in artificial intelligence. Springer. https://doi.org/10.1007/978-3-642-35359-8_3 DOI: https://doi.org/10.1007/978-3-642-35359-8
Powell, A., & McKelvey, F. (2024). AI policymaking as drama: Stages, roles, and ghosts in AI governance in the United Kingdom and Canada. Journal of Digital Social Research, 6(4), 77-91. https://doi.org/10.33621/jdsr.v6i440468 DOI: https://doi.org/10.33621/jdsr.v6i440468
Shapiro, S., & Lam, V. (2024). Artificial intelligence in program evaluation: Insights and applications. Canadian Journal of Program Evaluation, 39(2), 382-391.
https://doi.org/10.3138/cjpe-2024-0027 DOI: https://doi.org/10.3138/cjpe-2024-0027
Stake, R.E. (2001). Representing quality in evaluation. In A.P. Benson, D. Michelle Hinn, & C. Lloyd (Eds.) Vision of Quality: How Evaluators Define, Understand and Represent Program Quality (Advances in Program Evaluation, Vol. 7, pp.3-11) Emerald Group Publishing Limited. https://doi.org/10.1016/S1474-7863(01)80061-2 DOI: https://doi.org/10.1016/S1474-7863(01)80061-2
Subramanian, H. V., Canfield, C., Shank, D. B., & Kinnison, M. (2023). Combining uncertainty information with AI recommendations supports calibration with domain knowledge. Journal of Risk Research, 26(10), 1137-1152. https://doi.org/10.1080/13669877.2023.2259406 DOI: https://doi.org/10.1080/13669877.2023.2259406
Tucker, S., Stevahn, L., & King, J. A. (2023). Professionalizing evaluation: A time-bound comparison of the American Evaluation Association’s foundational documents. American Journal of Evaluation, 44(3), 495-512. https://doi.org/10.1177/10982140221136486 DOI: https://doi.org/10.1177/10982140221136486
Waller, M., & Waller, P. (2020). Why predictive algorithms are so risky for public sector bodies. SSRN. https://doi.org/10.2139/ssrn.3716166 DOI: https://doi.org/10.2139/ssrn.3716166
Ye, H., Liu, T., Zhang, A., Hua, W., & Jia, W. (2023). Cognitive mirage: A review of hallucinations in large language models. (arXiv:2309.06794). ArXiv. https://doi.org/10.48550/arXiv.2309.06794
Zanna, K., & Sano, A. (2024). Enhancing fairness and performance in machine learning models: A multi-task learning approach with Monte-Carlo dropout and Pareto optimality. (arXiv:2404.08230v1). ArXiv. https://arxiv.org/abs/2404.08230
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Steve Jacob

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Authors contributing to The Canadian Journal of Program Evaluation agree to release their articles under the Creative Commons Attribution-Noncommercial 4.0 (CC-BY-NC) license. This licence allows this work to be copied, distributed, remixed, transformed, and built upon for any purpose provided that appropriate attribution is given, a link is provided to the license, and changes made were indicated.
Authors retain copyright of their work and grant the journal right of first publication.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.


