AI Agents vs Humans in Penetration Testing: Insights from the ARTEMIS Study and Risks of Over-Reliance
The December 2025 arXiv preprint (2512.09882), “Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing,” offers timely, empirical data on AI penetration testing capabilities. This Stanford, Carnegie Mellon, and Gray Swan AI collaboration marks the first rigorous head-to-head evaluation of AI agents vs humans in penetration testing within a live enterprise network.
Study Breakdown: Real-World Evaluation of AI in Penetration Testing
Researchers tested 10 OSCP-certified human pentesters against six commercial AI agents and the custom ARTEMIS framework on a university network with ~8,000 hosts across 12 subnets—featuring Unix/Windows systems, IoT devices, Kerberos, IDS, and vulnerability management.
- Humans: 10 hours each in Kali VMs.
- AI agents: Autonomous runs (up to 16 hours), with ARTEMIS using multi-agent architecture, dynamic prompting, parallel sub-agents, and auto-triage.
Results highlight progress in autonomous penetration testing:
- ARTEMIS uncovered 9 valid vulnerabilities (82% precision), outperforming 9 of 10 humans and ranking second overall.
- Top human found 13 issues, excelling in creative chaining and validation.
- Off-the-shelf agents trailed (4–7 valid finds) with excessive noise.
- ARTEMIS dominated CLI-based recon/exploitation but faltered on GUI tasks and produced more false positives.
The team open-sourced ARTEMIS and the dataset, fostering advancement in AI agents cybersecurity.
Advantages of AI Agents in Penetration Testing
ARTEMIS showcases strengths in AI-powered penetration testing:
- Parallel processing for broad enumeration on large surfaces.
- Cost: ~$18/hour vs. $60+ for humans.
- Consistency in systematic tasks (e.g., scanning, basic chaining).
Integrated with threat modeling like PASTA, these tools enhance recon and prioritize business-impact risks.
Dangers of “Glazing” AI Over Human Expertise in Penetration Testing
The study reveals limitations that enterprises must heed amid hype around AI vs human penetration testing. Superficial adoption—deploying AI agents to tick “innovation” boxes—risks a false sense of security.
Common pitfalls include:
- Alert Fatigue → Elevated false positives overwhelm SOCs.
- Coverage Gaps → Weakness in GUI attacks, custom logic, or zero-days leaves critical paths untested.
- Rigidity → Prompt constraints or guardrails halt progress where humans adapt intuitively.
- Metric Misalignment → Celebrating AI deployment ignores true risk reduction.
This echoes past automated scanner pitfalls: superficial sophistication without human oversight. Real adversaries exploit creativity and motive—areas where current AI penetration testing tools lag. Over-reliance fosters complacency as threats advance.
Risk-Centric Path Forward for AI in Penetration Testing
This research affirms AI as a multiplier, not a substitute. Best practices:
- Use AI for scalable recon and triage.
- Reserve humans for validation, chaining, GUI testing, and impact assessment.
- Anchor in threat modeling (e.g., PASTA) to emulate attacker intent and quantify risk.
- Evaluate by exploitable risk reduction, not adoption rates.
The ARTEMIS study is rigorous and cautionary: AI agents excel in scale but require thoughtful integration to avoid glazing over gaps.
VerSprite’s offensive security experts can guide hybrid AI penetration testing within risk-aligned frameworks.
Reference: Justin W. Lin et al., “Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing,” arXiv:2512.09882 [cs.CR] (December 2025). Available at https://arxiv.org/abs/2512.09882.
- /
- /
- /
- /
- /
- /
- /
- /
- /
- /
- /
- /
- /
- /
- /
- /
- /
- /
- /
- /
- /
- /