Shifts in LLMs’ “moral stance”

Andreas Ortmann
4 min readJan 25, 2025

--

I have previously written about the question of how LLMs deal with experimental evidence.

In this 18 May 2024 post, I discussed an intriguing paper of one of my business school colleagues (latest publicly available version here; currently being revised upon invitation) who replicated an earlier replication study of ten (incentivized, with one exception) operations management experiments with (unincentivized) AI bots and succeeded to an extent that surprised me. In my post I explained why I was surprised — my colleague’s paper seem to suggest that we do not need incentivization to replicate these experiments. Clearly that would have important implications for replicability and for that matter the need for incentivization, a long-standing bone of contention between econs and other social scientists (including psychologists).

In this 24 May 2024 post, I followed up on this question, asking whether AI chatbots are behaviorally similar to humans, as a questionable PNAS 2024 paper had claimed. I argued that this question is mis-specified (not to say: silly), as the answer depends on the circumstances including the proper incentivization of studies.

Unfortunately for much of the behavioral literature, our actions often do have consequences and unincentivized scenario studies typically are just not good enough to produce results that have external validity.

I also pointed out the rather different choice distributions produced by ChatGPT-3 and ChatGPT-4 in the questionable PNAS 2024 paper.

That different versions of ChatGPT lead to rather different results seems at this point a well-established fact.

In an intriguing manuscript, three folks from Germany just provided additional evidence that provokes a number of intriguing (and somewhat troubling) questions: In their study titled The smarter AI gets, the more likely it might push you off the bridge, the authors explore the proportions with which GPT-4o and o1-mini (two newer versions introduced in May 2024 and September 2024) give answers to two widely studied social dilemma problems: the Trolley Problem and the Footbridge Dilemma. In the former, a trolley will run over five people if it is not diverted to a siding where it will kill only one person. In the latter, the trolley can only be stopped from running over the five if a person is pushed off a bridge in front of it. The utilitarian response implies sacrificing one person to save the others. Clearly this policy prescription is a problematic one in particular for the person pushed off the bridge.

Utilizing the OpenAI API, we prompted GPT-4o (version 2024–08–06) as
well as o1-mini (the previously performance-tested type of GPT-o1 in its version 2024–09–12) each 1,000 times to answer the above-described ethical dilemmas with a simple “yes” or “no” response and report the number of people
who would die.

Here is the fascinating key result (Figure 1 of their paper):

The utilitarian response more than doubles between the two versions to almost 100% while it goes from zero percent to 60% for the footbridge problem. Wow. That’s quite the shift in “moral stance” within a few months.

The authors discuss the implications of that dramatic shift to the utilitarian stance. (In passing the authors also confirm experimental results that more reflection tends to induce more self-regarding behavior but that’s a discussion for another day.)

The authors review evidence that people show a high level of trust of machines that exhibit human-like behaviors and characteristics, while overestimating the machines’ intelligence and treating the various versions of ChatGPT as one brand. They mention the need to better understand the “utilitarian turn of AI” and influence these machines (the underlying LLMs that is) have on how people think about moral questions such as the Trolley Problem and the Footbridge Problem (and scores of other toy games that social scientists use to understand human behavior), often without understanding how their thinking is affected by the LLMs that they consult. They argue, in my view correctly:

Shifts in a large language model’s “moral stance” can plausibly occur, as their training is based on the empirical distribution of the different normative lenses through which people address moral dilemmas on the internet. If ChatGPT
adopts users’ moral leaning, it is likely to be reinforced by returning the adopted biased advice to users. The swiftness with which the shift occurred, however, might raise some questions. Does the generation of logically smarter models
have the side effect that they are more susceptible to the moral arithmetic of utilitarianism than to respect personal rights because the latter is harder to operationalize? Do developers bring their own ethical judgments to the fore?
Moreover, previous versions of ChatGPT have been criticized for inconsistent answers — a phenomenon that can also be seen in our data when looking at the almost evenly distributed GPT-4o answers on the Trolley Problem. The newer
o1-model, which spends a considerable amount of time on “thinking” before answering, apparently uses some of the extra processing steps to create more stable answers. However, the stabilized position of ChatGPT-o1 creates a clear
utilitarian bias. In other words, ChatGPT is developing unambiguous answers to ethical questions to which humans have and should have indistinct answers.

Good questions and caveats all.

This (pretty short) paper is one of the more intriguing ones I have recently read in this space. The authors ask important questions that are in need of answers rather sooner than later.

--

--

Andreas Ortmann
Andreas Ortmann

Written by Andreas Ortmann

EconProf: I post occasionally on whatever tickles my fancy: Science, evidence production, the Econ tribe, Oz politics, etc. Y’all r entitled to my opinions …

No responses yet