On AI agents and their limitations in experimental work

Andreas Ortmann
6 min readMay 18, 2024


Came across an intriguing paper by a colleague of mine in the School of Information Systems and Technology Management here at UNSW. A current version of the paper is online.

Let me first summarize why I think Sam’s paper is intriguing but then also ask some questions about what happens when AI agents represent humans.

Why I think Sam’s paper is intriguing

My colleague tries to replicate a replication study of ten operations management experiments that was recently published in Management Science. Here is how the abstract of that study reads:

Abstract. Over the last two decades, researchers in operations management have increasingly leveraged laboratory experiments to identify key behavioral insights. These experiments inform behavioral theories of operations management, impacting domains including inventory, supply chain management, queuing, forecasting, and sourcing. Yet, until now, the replicability of most behavioral insights from these laboratory experiments has been untested. We remedy this with the first large-scale replication study in operations management. With the input of the wider operations management community, we identify 10 prominent experimental operations management papers published in Management Science, which span a variety of domains, to be the focus of our replication effort. For each paper, we conduct a high-powered replication study of the main results across multiple locations using original materials (when available and suitable). In addition, our study tests replicability in multiple modalities (in-person and online) due to laboratory closures during the COVID-19 pandemic. Our replication study contributes new knowledge about the robustness of several key behavioral theories in operations management and contributes more broadly to efforts in the operations management field to improve research transparency and reliability.

The abstract curiously does not mention that this replication study fared quite well: The replicators find that, “of the 10 papers whose main hypotheses are tested in the project, 6 achieve full replication, 2 achieve partial replication, and 2 do not replicate.” (p. 4978) Not bad and on par with recent replication attempts in economics (e.g., Brodeur et al, 2024, and references therein.)

My colleague replicates 9 of these 10 papers following similar standards of assessing performance of his state-of-the-art custom GPT assistants against the key hypotheses of those papers. (The 10th paper used real-effort tasks which are yet to be implemented on AI platforms.) He finds, in 7 of 9 of his replications, AI assistants’ responses to be consistent with human decision making. These are rather promising results, on par with the results of the original Management Science replication study.

Some questions about what happens when AI agents represent humans

In his Discussion and Conclusions section Sam claims that GPT uses “human-like heuristics”, not always optimally but successful enough for the remarkable replication rate that he reports.

It’s this claim that concerns me most. Let me discuss my concerns by way of another example: the well-known Linda problem, recently also analyzed with AI tools in this paper. There are many versions of the problem but the canonical one says:

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.

Which is more probable?

  1. Linda is a bank teller.
  2. Linda is a bank teller and is active in the feminist movement.

The original much-cited results seemed to support the claim by Kahneman & Tversky (KT) that humans are susceptible to the conjunction fallacy — the inference that a conjoint set of two or more specific statements is likelier than any single member of that same set, in violation of the laws of probability. KT argued that this was due to humans applying the so-called representativeness heuristic of judging the likelihood of an event based on how closely it resembles a typical example, rather than the laws of probability.

The Linda problem is one of the many questionable demonstrations by KT that humans are pretty bad at formal logic. Turns out that is a problematic claim.

Exhibit 1:

Both of these figures are from chapter 109 in the Plott & Smith Handbook of Experimental Economics Results (2008). The chapter is titled Cognitive Illusions Reconsidered and covers a number of so-called cognitive illusions that Gigerenzer, Hertwig, and their collaborators managed to make appear, disappear, and even reverse systematically. In the case of the Linda problem they identified two such strategies: to disambiguate the meaning of the word “probability” (Figure 6) and to ask for the frequency of statements (Figure 7).

Exhibit 2:

In an article a few years back, Charness et al. proposed a couple of additional strategies: they incentivized their subjects, and they let them work through the problem in pairs and trios. Here is the key table from that paper (and what a doozy it is):

As you can see the standard KT implementation has an extraordinary “error rate” (i.e., violations of the laws of probability) of 85.2 percent but incentivization and pairings work their magic. For the case of incentivized trios the error rate reduces to 10.4 percent; miraculously, to some psychologists at least, subjects have become pretty darn good at following the laws of probability. Charness and his colleagues note that typically our decisions are made together with others and do have consequences, so their results seem more useful than the KT unincentivized scenario studies.

Back then to the question of the usefulness of AI agents, or Large Language Models, in replicating experimental studies. Wang et al (2024), in their “Will the Real Linda Please Stand up … to Large Language Models? Examining the Representativeness Heuristic in LLMs”, find LLMs tend to be susceptible to the representativeness heuristic, i.e., commit high error rates although they (presumably must) know the laws of probability. This is true for four different LLMs the authors put to the test. While this result is only so robust (the authors also subject those four LLMs to other cognitive illusions and results vary), the question is why in the Linda problems these AI agents get things so wrong.

Three important issues here for all I can see:

First, are the representativeness results in Wang et al (2024) inevitable because of the very strength of AI agents/LLMs which is the processing of huuuge corpora of data? These representativeness results may simply be the result of the KT (“heuristics & biases”) program still being the dominant paradigm in the literature, and certainly it was that for many decades.

Second, there is no such thing as a “human-like heuristics” that AI agents/LLMs adopt because the successful heuristic, or mode of thinking, is very much a function of the experimental set-up, as I demonstrated. It might show us the KT results, or the kind of results reported by Gigerenzer, Hertwig, and their colleagues, and Charness and his colleagues. These latter results seem less likely simply because the KT heuristics & biases remains the dominant paradigm in the literature, and certainly was that for many decades.

Third, the results reported in Wang et al (2024) pose important questions about the optimization function that AI agents/LLMs have. Is there a way to incentivize them? Is there a way to let them reason through such problems together? And relatedly, would disambiguation of the meaning of the word “probability” and asking for the frequency of statements lead to other results?

In the end it comes to this: AI agents are not incentivized. There are no consequences attached to whatever it is that they are doing. But surely matters in some context such as the Linda problem, or for example, in the elicitation of risk aversion. (On that count see also the recent PNAS paper by Mei et al. which I will discuss in a separate post. Stay tuned.)

Back to my colleague’s AI-assisted replication study. The original and replicated studies were all incentivized (with one possibly one exception), so it is interesting to see that, and understand why, Sam’s AI agents do a reasonable job in his replication exercise.

Consider making your opinion known by applauding, or commenting, or following me here, or on Facebook.



Andreas Ortmann

EconProf: I post occasionally on whatever tickles my fancy: Science, evidence production, the Econ tribe, Oz politics, etc. Y’all r entitled to my opinions …