Prompt-hacking with ChatGPT? AI tools pose new integrity challenges in academic research

Unlike statistical tools, which are grounded in rigorous mathematical frameworks and can be audited, LLMs are inherently non-deterministic. A slight change in prompt phrasing can yield dramatically different results - even when using the same model. This variability, coupled with the models’ opaqueness and dependency on vast, uncurated training data, makes them fundamentally unsuitable for rigorous data analysis without extensive oversight.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 23-04-2025 18:04 IST | Created: 23-04-2025 18:04 IST
Prompt-hacking with ChatGPT? AI tools pose new integrity challenges in academic research
Representative Image. Credit: ChatGPT

Large language models (LLMs) like ChatGPT are becoming deeply embedded in academic research workflows, and with their rise comes a new threat to scientific integrity - prompt-hacking. In a new opinion paper titled “Prompt-Hacking: The New p-Hacking?”, published on arXiv, authors Thomas Kosch and Sebastian Feger argue that prompt-hacking, the strategic manipulation of LLM inputs to elicit desirable outputs, may mirror the controversial and discredited practice of p-hacking in statistical research.

While the capabilities of LLMs have attracted interest across disciplines, this study underscores a looming crisis: the quiet erosion of research validity and reproducibility through unregulated AI use. Drawing parallels with p-hacking, where researchers tweak statistical parameters to obtain significant results, the paper contends that prompt-hacking similarly enables outcome manipulation, albeit through different technical means. With no standard practices yet in place to regulate how prompts are constructed, tested, or reported, the authors caution that science may be entering a new era of unverifiable, model-generated evidence.

How does prompt-hacking mirror traditional p-hacking?

The concept of p-hacking is well-established in scientific discourse. It refers to manipulating data collection or statistical analysis until results appear significant - often by increasing sample size, changing hypotheses post hoc, or selectively reporting outcomes. This practice has contributed to the replication crisis across multiple fields. Prompt-hacking, according to Kosch and Feger, follows a similar logic: researchers modify LLM prompts repeatedly until the model returns outputs that confirm a preferred hypothesis.

Unlike statistical tools, which are grounded in rigorous mathematical frameworks and can be audited, LLMs are inherently non-deterministic. A slight change in prompt phrasing can yield dramatically different results - even when using the same model. This variability, coupled with the models’ opaqueness and dependency on vast, uncurated training data, makes them fundamentally unsuitable for rigorous data analysis without extensive oversight.

The authors introduce the term PARKing (Prompt Adjustments to Reach Known outcomes), inspired by HARKing (Hypothesizing After Results Are Known), to describe the emerging trend of systematically altering prompts to obtain expected outcomes. This process, they warn, may not always be intentional but can still distort findings. By adjusting prompts until the AI output aligns with preconceived hypotheses, researchers risk presenting artificially coherent results, bypassing the rigor expected in scientific workflows.

Why are LLMs problematic tools for data analysis?

The authors acknowledge that LLMs can accelerate certain aspects of research, particularly in drafting, brainstorming, or generating synthetic data for simulations. However, they argue strongly against their use in empirical data analysis. The reasoning is rooted in several foundational flaws: LLMs are trained on large but biased datasets, lack an understanding of context, and produce outputs that are not consistently replicable.

In empirical research, reproducibility and impartiality are non-negotiable. Traditional quantitative and qualitative methods require transparent data pipelines and clearly defined analytical steps. In contrast, LLMs produce results based on probabilistic token predictions, with outputs sensitive to minor linguistic changes and internal model fluctuations. This behavior, the authors argue, makes them incompatible with any role requiring rigorous hypothesis testing or data interpretation.

Moreover, most researchers are not LLM experts. They may not fully understand how prompt phrasing, model versioning, or context windows influence output. This ignorance creates a fertile environment for unintentional misuse. Even when researchers attempt to document their prompts, the infinite variations possible, and the lack of standardized protocols, mean that reproducing exact results is nearly impossible.

The paper highlights past studies where researchers have simulated human subject responses using LLMs or used them to analyze large qualitative datasets. While innovative, these applications remain highly controversial due to their reliance on opaque models and uncontrolled variation. The concern is not that LLMs will replace human analysts entirely, but that their misuse will increasingly pollute the body of peer-reviewed literature with unverifiable, AI-generated claims.

What steps are needed to prevent an AI-driven reproducibility crisis?

To avoid a full-blown reproducibility crisis fueled by LLMs, the authors propose a clear directive: don’t use LLMs for data analysis unless absolutely necessary and justifiable. Where they must be used, their application should be subjected to rigorous ethical and methodological scrutiny.

The study recommends multiple layers of safeguards. First, researchers should evaluate whether traditional methods can perform the task more reliably. If LLMs offer no clear advantage or introduce greater risk, they should be avoided. Second, prompt use should be standardized. This includes preregistration of prompt sequences, transparent documentation of modifications, and explicit reporting of failed prompt attempts.

Ethical implications must also be front and center. LLMs carry the biases of their training data, which may reflect systemic inequalities or cultural stereotypes. Using such models in sensitive analyses risks reproducing and amplifying those biases in published findings. Researchers are urged to interrogate not only whether their models are accurate, but whether they are ethically aligned with the goals of scientific inquiry.

The infrastructure supporting scientific publication must also adapt. Journals, funding agencies, and data repositories should consider mandating prompt transparency and model version disclosure. Tools like Zenodo and the Center for Open Science could be expanded to accommodate prompt registries and metadata on model configurations. These interventions won’t eliminate the risks of LLM misuse, but they will introduce accountability and help restore trust.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback