AI models pass Turing test, deemed more human than humans
One of the most compelling aspects of the study was the critical role of prompting. Without tailored instructions to simulate a believable human persona, neither GPT-4.5 nor LLaMa could consistently fool interrogators. The simple addition of a persona, describing tone, interests, and linguistic style, transformed performance dramatically. For example, GPT-4.5’s win rate dropped to just 28% without the persona prompt, aligning with previous Turing test failures.
What happens when machines sound more human than we do? In a newly published study, researchers from UC San Diego conducted a large-scale, randomized, and pre-registered evaluation of four AI systems in a rigorous three-party Turing test format. In head-to-head matchups against real humans, these AI models were challenged to convince human interrogators that they, too, were human. The results mark a critical turning point: one of these systems not only passed the test, but was judged to be more convincingly human than actual people.
Published on arXiv, the findings of this study “Large Language Models Pass the Turing Test” provide the strongest empirical evidence to date that artificial systems can imitate human conversation at a level indistinguishable, and even superior, to that of actual humans.
Can artificial intelligence really pass as human in live interaction?
The study tested four AI systems: ELIZA (a 1960s rule-based chatbot), GPT-4o, LLaMa-3.1-405B, and GPT-4.5. Each AI was evaluated in the original, rigorous Turing test format - a five-minute, text-only conversation where human interrogators simultaneously converse with both a real human and an AI, then decide which of the two is human. Participants were drawn from two distinct populations: UCSD undergraduates and paid online workers via Prolific.
The results were astonishing. When GPT-4.5 was prompted to adopt a humanlike persona, specifically a young, internet-savvy introvert, it was judged to be human 73% of the time. In other words, it outperformed real human participants in persuading interrogators of its humanity. This win rate was not only statistically above chance but higher than any previously recorded AI performance in a classical Turing test.
LLaMa-3.1, another advanced language model using the same persona prompt, was also successful in one of the two studies, scoring a 65% win rate with Prolific participants but failing to outperform humans in the undergraduate sample. In contrast, control models like ELIZA and GPT-4o (without a persona) fared poorly, being identified as non-human in over 75% of cases.
How did AI deceive humans and why did the prompt matter so much?
One of the most compelling aspects of the study was the critical role of prompting. Without tailored instructions to simulate a believable human persona, neither GPT-4.5 nor LLaMa could consistently fool interrogators. The simple addition of a persona, describing tone, interests, and linguistic style, transformed performance dramatically. For example, GPT-4.5’s win rate dropped to just 28% without the persona prompt, aligning with previous Turing test failures.
This distinction is key. The Turing test, at its core, is a measure of indistinguishability. It does not require intelligence in a philosophical sense - only that a machine can produce behavior sufficiently humanlike to fool people. With persona prompts, GPT-4.5 didn't just appear plausible; it appeared more human than its human counterparts in the majority of trials.
Participants used a variety of interrogation strategies to discern humanity. Most relied on small talk, personal questions, and social-emotional cues rather than logic or factual knowledge. Interestingly, interrogators who employed unusual or unpredictable language, or used “jailbreak” strategies (e.g., asking the AI to ignore previous instructions), had higher accuracy rates. However, these tactics were rare.
Ultimately, the most cited reasons for believing an AI was human involved conversational flow, emotional expression, and linguistic nuance - ironically, the very domains often considered exclusive to human intelligence.
What are the broader implications of machines passing as people?
The study’s authors argue that while the results are technically impressive, they also raise profound ethical and societal questions. If machines can now consistently pass as people, what does that mean for trust, labor, and social interaction?
This development introduces what philosopher Daniel Dennett has called “counterfeit people” - systems that look, sound, and behave like humans without actually being conscious. These counterfeit people could be deployed in customer service, online dating, education, political influence, or even misinformation campaigns. Their very indistinguishability poses a new kind of threat: not one of brute force or malicious intent, but of silent substitution.
The researchers warn that people may increasingly form relationships, hold conversations, and make decisions based on interactions with AI they believe are human. Just as counterfeit money erodes the value of real currency, these interactions could devalue authentic human connection, especially if users are unaware they are speaking to a machine.
Moreover, the study found that demographic variables like age, education, and chatbot experience had minimal impact on people’s ability to detect AI. Even those who interact with chatbots daily or who conduct AI research were no better than chance at identifying real humans over GPT-4.5. This suggests that deception by AI is not just plausible - it is already a functional reality across a wide population.
A new era of AI realism - where do we go from here?
The results mark a historic milestone: the first time any AI system has demonstrably passed a classical three-party Turing test in peer-reviewed conditions. GPT-4.5's success challenges long-held assumptions about the limitations of machines and shifts the boundary of what we consider to be “intelligent” behavior.
Yet the researchers are clear-eyed about what the Turing test actually measures. It is not, and was never meant to be, a direct test of intelligence or consciousness. Instead, it is a test of humanlike appearance - an imitation game in which success depends on generating plausible belief in humanity, not replicating the human mind.
In fact, the very reason GPT-4.5 succeeds, its flexibility in adopting personas, its fluency in language, and its skill in social mimicry, makes it especially suited for roles that require deception or emotional manipulation. As such, the researchers advocate for stronger transparency measures, public education, and potentially, watermarking systems to identify machine-generated communication.
Future Turing test research could explore lengthening conversation times, increasing participant diversity, and testing whether trained experts can still be fooled. Another proposal is designing more adversarial environments where the machine must respond to high-pressure or morally complex questions - scenarios that may better reveal its limitations.
- FIRST PUBLISHED IN:
- Devdiscourse

