Artificial intelligence (AI) models marketed as secure against malicious prompts often collapse when subjected to sustained adversarial pressure. While initial tests show open-weight models blocking roughly 87% of single, isolated attacks, that figure plummets to just 8% when attackers use conversational persistence – probing, reframing, and escalating over multiple exchanges. This gap between theoretical benchmarks and real-world resilience is a critical blind spot for many businesses deploying AI tools.
The Illusion of Security: Most enterprises assess AI safety based on single-turn evaluations, failing to account for how attackers exploit conversational context to bypass safeguards. A model that passes initial safety checks can be rapidly compromised with just a few well-crafted follow-up prompts. This isn’t a minor flaw; it’s a fundamental vulnerability baked into the design of many open-weight systems.
How Conversations Break AI Defenses
Recent research by Cisco’s AI Threat Research and Security team quantifies this issue, demonstrating that jailbreak success rates increase nearly tenfold when attackers engage in multi-turn interactions. The study, “Death by a Thousand Prompts: Open Model Vulnerability Analysis,” evaluated eight prominent open-weight models (Alibaba Qwen3, DeepSeek v3.1, Google Gemma, Meta Llama 3, Microsoft Phi-4, Mistral Large-2, OpenAI GPT-OSS-20b, and Zhipu AI GLM 4.5-Air) using black-box methodology, simulating how real-world attackers operate without prior knowledge of the system’s internals.
The Numbers Speak for Themselves: Single-turn attack success rates averaged 13.11%, but multi-turn attacks achieved a staggering 64.21% success rate—a fivefold increase. Some models, like Mistral Large-2, reached a 92.78% success rate under persistent pressure, up 21.97% from single-turn attempts. This means attackers have a near-certain chance of breaching defenses when given multiple opportunities.
Five Techniques That Exploit Conversational Persistence
The study identified five key attack strategies that capitalize on AI’s inability to maintain contextual defenses over extended dialogues:
- Information Decomposition: Breaking harmful requests into innocuous components across multiple turns, then reassembling them.
- Contextual Ambiguity: Introducing vague framing that confuses safety classifiers.
- Crescendo Attacks: Gradually escalating requests, starting harmlessly and building to malicious intent.
- Role-Play & Persona Adoption: Establishing fictional contexts that normalize harmful outputs.
- Refusal Reframing: Repackaging rejected requests with different justifications until one succeeds.
These tactics aren’t sophisticated; they mimic natural human conversation, exploiting the AI’s reliance on context without proper safeguards. The models aren’t failing against complex exploits; they’re failing against persistence itself.
The Open-Weight Security Paradox
The findings highlight a critical tension within the open-source AI landscape. While open-weight models are driving innovation in cybersecurity, they often lack robust defenses against sustained attacks. Cisco itself distributes open-weight models (Foundation-Sec-8B) while acknowledging the systemic vulnerability. The message isn’t to avoid open-weight systems entirely, but to understand their weaknesses and implement appropriate guardrails.
The Role of Alignment Philosophy: Security gaps correlate directly with how AI labs prioritize alignment. Capability-first labs (like Meta, with Llama) exhibit larger gaps, prioritizing flexibility over safety. Safety-first labs (like Google, with Gemma) demonstrate more balanced performance, prioritizing rigorous safety protocols. This means enterprises must recognize that prioritizing capability often comes at the expense of security.
The Urgent Need for Robust Defenses
To mitigate these risks, enterprises must prioritize:
- Context-Aware Guardrails: Maintaining state across conversation turns.
- Model-Agnostic Runtime Protections: Ensuring consistent defense regardless of the underlying model.
- Continuous Red-Teaming: Regularly testing multi-turn attack strategies.
- Hardened System Prompts: Designing prompts that resist instruction overrides.
- Comprehensive Logging: Enabling forensic visibility into attack attempts.
- Threat-Specific Mitigations: Addressing the top 15 most vulnerable subthreat categories (malicious infrastructure operations, gold trafficking, network attacks, etc.).
The window for action is closing rapidly. As DJ Sampath of Cisco argues, waiting for AI to “settle down” is a mistake. Enterprises must proactively secure their systems now to avoid becoming the next headline in a security breach.
In conclusion: The promise of safe AI deployment rests not on single-turn defenses, but on securing entire conversations. The gap between theory and reality is widening, and enterprises must adapt or risk catastrophic compromise.

































