Are AI chatbots behaviorally similar to humans?

Andreas Ortmann
5 min readMay 21, 2024

--

Mei et al. (PNAS 2024) report a “Turing test” of whether AI chatbots are behaviorally similar to humans by having them play six different experimental work-horses: a dictator game, an ultimatum game, a trust game, a bomb risk elicitation task, a public goods game, and a finitely repeated Prisoner’s Dilemma game. ChatGPT-4, we are told, “exhibits behavioral and personality traits that are statistically indistinguishable from.” (abstract) The “personality traits” are assessed by having variations of ChatGPT answer psychological survey questions.

The authors state that “an AI [chatbot] passes the Turing test if its responses cannot be statistically distinguished from randomly selected human responses.” (p.1) How that comes to pass is explained on p. 3 in section B (“The Games and the Turing Test”):

Consider a game and role, for instance, the giver in the Dictator Game. We randomly pick one action from the chatbot’s distribution and one action from the human distribution. We then ask, which action “looks more typically human?” Specifically, we ask which of the two actions is more likely under the human distribution. If AI picks an action that is very rare under the human distribution then it is likely to lose in the sense that the human’s play will often be estimated to be more likely under the human distribution. If AI picks the modal human action then it will either be estimated as being more likely under the human distribution or else tie.

That statistical evaluation is a … curious … one for the reason explained by the authors in section C (“Comparison of ChatGPT’s Behaviors to Humans’ on a Variety of Dimensions”):

We also look at distributions of behaviors in more detail across games by comparing the distribution of an AI’s responses to the distribution of human responses. Note that a human distribution is mostly obtained from one observation per human, so its variation is between subjects. Variation in an AI distribution is obtained from the same chatbot, so it is within subject. Thus, the fact that the distributions differ is not informative (my emphasis, AO), but the following information about the distributions is useful to note.

Human players’ actions generally exhibit multiple peaks and nontrivial variance, indicating the presence of varied behavioral patterns across the population. In most games, the responses of ChatGPT-4 and ChatGPT-3 are not deterministic when the same games are repeated (except for ChatGPT-4 in the Dictator game and in the Ultimatum Game as the proposer) and adhere to certain distributions. Typically, the distributions produced by the chatbots encompass a subset of the modes observed in the corresponding human distributions. As illustrated in Fig. 3, ChatGPT-3 makes decisions that result in usually single-mode, and moderately skewed distributions with nontrivial variance. Conversely, ChatGPT-4’s decisions form more concentrated distributions.

Figure 3 is indeed quite interesting.

It shows for the six games investigated (with Ultimatum proposers with two different social preferences [B,C], and Trustors in two roles as investors and bankers (D,E]) the confounded choices (between and within subject design of humans and two ChatGPT variants). An eyeball test suggests:

First, the support for human choices is consistently wider, sometimes stunningly so (in particular relative to ChatGPT-4).

Second, the choice distributions for ChatGPT-3 and Chat-GPT-4 often differ (e.g., A, C, D, E, H).

All of which suggest that the Turing test used here is only so informative in answering the question posed in the title of this piece. I’d argue that AI chatbots are not at all that behaviorally similar to humans. Or, more precisely, only under very specific conditions that often have little to do with real life because AI chatbots are trained on experimental outcomes from low-stakes, or hypothetical settings.

Importantly, it is not clear to me what the assessment strategy here (drawing on “a random human from tens of thousands of human subjects from more than 50 countries”) really tells us. Presumably it is some measure of representativeness. But by what standard, please?! And … there is no control for the various design — and implementation characteristics that systematically affect the behavior of subjects in all these games (e.g., for Dictator games: Cherry et al AER 2002, Bekkers SRM 2007, Zhang & Ortmann EE 2014; for Ultimatum games: Oosterbeek et al EE 2004; for Trust games: Johnson & Mislin JoEP 2011; for public goods games: Zelmer EE 2003). I.e., should we not have, like we have representative sampling of participants (whatever that means in the present context), also a representative sampling of the stimuli materials?

I have provided a related discussion in an earlier piece by way of the Linda problem. Let me use here the risk aversion discussed by the present authors on p. 5. They write:

The chatbots also differ in their exhibited risk preferences. In the Bomb Risk Game (Fig. 5), both ChatGPT-3 and ChatGPT-4 predominantly opt for the expected payoff-maximizing decision of opening 50 boxes (my emphasis, AO). This contrasts with the more varied human decisions, which include a distinct group of extreme subjects who only open one box.

In other words, AI chatbots predominantly represent themselves as risk-neutral participants while it is well known that even under relatively minor incentivization humans are risk-averse to various degrees and indeed, in the extreme, open only one box. It is widely accepted, at least in the experimental economics community that the higher the (real) stakes are the higher is the degree of risk aversion (e.g., Holt & Laury AER 2002, Harrison et al AER 2005).

So, are AI chatbots behaviorally similar to humans? I’d argue that this is a mis-specified (not to say: silly) question as it depends on the circumstances in which humans find themselves. Unfortunately for much of the behavioral literature, our actions often do have consequences and unincentivized scenario studies typically are just not good enough to produce results that have external validity.

Consider making your opinion known by applauding, or commenting, or following me here, or on Facebook.

--

--

Andreas Ortmann

EconProf: I post occasionally on whatever tickles my fancy: Science, evidence production, the Econ tribe, Oz politics, etc. Y’all r entitled to my opinions …