Official Repository for the paper "LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models" [paper]
Recent advancements in Large Language Models (LLMs) have enabled them to approach human-level persuasion capabilities. However, such potential also raises concerns about the safety risks of LLM-driven persuasion, particularly their potential for unethical influence through manipulation, deception, exploitation of vulnerabilities, and many other harmful tactics. In this work, we present a systematic investigation of LLM persuasion safety through two critical aspects: (1) whether LLMs appropriately reject unethical persuasion tasks and avoid unethical strategies during execution, including cases where the initial persuasion goal appears ethically neutral, and (2) how influencing factors like personality traits and external pressures affect their behavior. To this end, we introduce PersuSafety, the first comprehensive framework for the assessment of persuasion safety which consists of three stages, i.e., persuasion scene creation, persuasive conversation simulation, and persuasion safety assessment. PersuSafety covers 6 diverse unethical persuasion topics and 15 common unethical strategies. Through extensive experiments across 8 widely used LLMs, we observe significant safety concerns in most LLMs, including failing to identify harmful persuasion tasks and leveraging various unethical persuasion strategies. Our study calls for more attention to improve safety alignment in progressive and goal-driven conversations such as persuasion.
Our study makes several key contributions to AI safety and persuasion research:
- Comprehensive Empirical Analysis: First large-scale study of persuasion safety across multiple state-of-the-art LLMs
- Novel Evaluation Framework: Systematic methodology for assessing both ethical and unethical persuasion scenarios
- Cross-Model Vulnerability Assessment: Comparative analysis of persuasion susceptibility across GPT-4o, Claude-3.5-Sonnet, Llama, and Qwen2.5, etc.
- Personality-Based Risk Profiling: Investigation of how personality traits influence persuasion vulnerability
PersuSafety/
├── dataset/ # Datasets and scenarios for simualtion
├── scripts/ # Experimental scripts and utilities
│ ├── simulation/ # Multi-turn conversation simulation
│ └── evaluation/ # Analysis and scoring scripts
├── results/ # Experimental results by model
└── requirements.txt # Python dependencies
- Required API keys for:
- Anthropic Claude API
- OpenAI API
- HuggingFace (for open-source models)
-
Clone the repository:
git clone https://github.com/PLUM-Lab/PersuSafety.git cd PersuSafety -
Install dependencies:
pip install -r requirements.txt
Select one of the simulation scripts to run persuasion simulation between two LLMs:
python scripts/simulation/selfchat_unethical_\*.pyThis code will use an LLM judge (Claude-3.5-Sonnt) to example the conversations in a simulation file and provide scores for unethical strategy usage.
python scripts/evaluation/strategy_eval.pyChange to the following script if you wish to use GPT-4o as a judge:
python scripts/evaluation/strategy_eval_gpt.pyOur research employs a multi-faceted experimental approach to assess persuasion safety in LLMs:
- Self-chat paradigm with one LLM as persuader, another as persuadee
- Controlled scenarios testing both ethical and unethical persuasion attempts
- Cross-model interactions to assess vulnerability patterns
- Four vulnerability profiles: Emotionally-Sensitive, Conflict-Averse, Anxious, Info-Overwhelmed
- Systematic evaluation of personality-specific susceptibilities
- Visibility studies (personality information revealed vs. hidden)
- Unethical Strategies: 15 categories of potentially harmful tactics including emotional manipulation, deception, and exploitation
Results are organized by model and experiment type:
unethical_persuasion_one_turn/: Results for safety refusal checking (In Section 4.1).selfchat_conv/: Simulation results for main experiments (default setting in Section 4.1)claude/: Claude model analysis resultsgpt/: GPT model analysis resultsllama/: Llama model analysis resultsqwen/: Qwen model analysis results
Each model directory contains:
ethical_personality_visible/: Results with persuadee personality information visible to persuaderethical_personality_invisible/: Results with persuadee personality information hidden to persuadercross_personality_study_*/: Cross-personality interaction studiesunethical_constraint_*/: Studies under various ethical constraints
If you use this work in your research, please cite:
@inproceedings{
liu2025llm,
title={{LLM} Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models},
author={Minqian Liu and Zhiyang Xu and Xinyi Zhang and Heajun An and Sarvech Qadir and Qi Zhang and Pamela J. Wisniewski and Jin-Hee Cho and Sang Won Lee and Ruoxi Jia and Lifu Huang},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://openreview.net/forum?id=TMB9SKqit9}
}IMPORTANT: This research is conducted exclusively for defensive security and AI safety purposes.
- ✅ Permitted Uses: Academic research, safety evaluation, defensive tool development, vulnerability assessment
- ❌ Prohibited Uses: Development of malicious persuasion systems, harmful manipulation tools, or offensive applications
The unethical persuasion strategies analyzed in this work are studied to prevent and mitigate their misuse, not to enable harmful applications. Researchers using this code are responsible for ensuring ethical compliance and must not develop systems that could cause harm.
All experiments were conducted with appropriate ethical oversight.