ChatGPT is smarter than humans, but just as biased
The study's key finding is a dual trajectory in AI behavior. While GPT-4 showed advancements in cognitive reasoning, delivering correct answers in base-rate and cognitive reflection tests, it also became more prone to biases like ambiguity aversion and confirmation bias. The AI’s preference for certainty, particularly in risk-framed scenarios, diverged from expected rational behavior and mirrored deeply rooted human heuristics.
A new peer-reviewed study has revealed that OpenAI's ChatGPT exhibits decision-making behaviors that closely mirror human biases, with significant implications for business operations and AI governance. The study, titled "A Manager and an AI Walk into a Bar: Does ChatGPT Make Biased Decisions Like We Do?", is published in the journal Manufacturing & Service Operations Management.
The team tested GPT-3.5 and GPT-4 across 18 behavioral biases known to influence decision-making in operations management. Using over 3,000 prompts through OpenAI’s API, the study found that in nearly half of the scenarios, ChatGPT's responses replicated typical human biases such as risk aversion, overconfidence, and the hot-hand fallacy.
The researchers employed a dual-context methodology that compared ChatGPT’s answers in standard psychology vignettes and specially designed operations management (OM) scenarios, including inventory and procurement contexts. GPT-4, the latest premium model, was shown to improve on problems with calculable, objective answers but demonstrated increased bias in tasks involving subjective judgments.
Is ChatGPT more rational or more biased than previous models?
The study's key finding is a dual trajectory in AI behavior. While GPT-4 showed advancements in cognitive reasoning, delivering correct answers in base-rate and cognitive reflection tests, it also became more prone to biases like ambiguity aversion and confirmation bias. The AI’s preference for certainty, particularly in risk-framed scenarios, diverged from expected rational behavior and mirrored deeply rooted human heuristics.
"GPT-4 is more accurate than GPT-3.5 in tasks with clear answers but displays stronger human-like biases in subjective decision-making," the authors noted. For instance, GPT-4 consistently preferred safer bets in framing problems, even when riskier options had equivalent expected values - a hallmark of human risk aversion.
The results further indicated that ChatGPT exhibits high decision consistency across different contextual framings. Whether tested in classical experiments or newly designed business decision-making vignettes, the models’ choices remained statistically similar, suggesting a systematic internal logic rather than random variation. This reliability may boost confidence in using LLMs for enterprise tasks, but it also underlines the persistence of bias across applications.
Significantly, the study identified that GPT models were rational in some areas where humans tend to fail, such as the sunk cost fallacy and base-rate neglect. However, the models performed poorly on tasks involving the gambler’s fallacy and conjunction fallacy, suggesting that ChatGPT can simultaneously outperform and underperform human cognition depending on the nature of the task.
Comparing model versions, the research showed little change between early and late iterations of GPT-3.5, but marked differences between GPT-3.5 and GPT-4. The latter exhibited improved statistical reasoning but also a stronger propensity toward preference-driven biases. This suggests that incremental model updates may not significantly alter behavior, while major architectural changes can shift both rationality and bias in unforeseen ways.
What are the business implications of these findings?
The implications of these findings are critical for organizations deploying generative AI in decision-making roles. The authors caution against overreliance on ChatGPT in contexts involving personal preferences, subjective assessments, or ethically sensitive decisions. Instead, they recommend deploying LLMs in rule-based workflows where formulaic solutions are available and verifiable.
One of the most striking applications lies in supply chain operations. In the classic newsvendor problem - a benchmark for inventory decision-making - ChatGPT demonstrated the same behavioral errors as human managers, such as over-ordering based on emotional risk aversion. This raises concerns about delegating procurement and stocking decisions to AI without sufficient bias mitigation.
Can prompt design reduce bias in AI decisions?
The study also suggests that AI behavior can be influenced by prompt framing and problem context. Yet, despite varied scenario lengths and terminology, GPT remained consistent across standard and OM-specific prompts. This underscores the importance of developing prompt engineering techniques to neutralize bias in applied settings.
The researchers recommend that businesses adopt targeted prompt strategies, pre-deployment bias testing, and governance frameworks that recognize LLMs as behavioral agents rather than purely logical systems. They also advise investing in GPT-4 for tasks involving complex analytical reasoning, while acknowledging its higher costs.
The findings challenge the prevailing assumption that AI systems are inherently more rational than humans. In fact, ChatGPT's behavior reveals a hybrid cognitive profile that blends algorithmic computation with human-like bias tendencies learned from its training data. As OpenAI and others continue refining large language models, understanding these behavioral microfoundations is essential for ensuring their safe, effective, and equitable use in real-world operations.
- FIRST PUBLISHED IN:
- Devdiscourse

