What I learnt hacking with AI over a weekend
Tom Kinnaird
Head of Security Product Engineering
Since first hearing about GPT3 from a colleague in a pre-ChatGPT world, I have been intrigued by the promise of LLMs and what the impact might be on me working in cybersecurity.
We have come along way since the clunky original pioneer models but there is still a lot of learning to do and a lot more technological advancement needed before we reach something that resembles AGI (Artificial general intelligence).
Should we use agentic AI for cybersecurity?
The concept of “agents” pre-dates the launch of LLMs, and the idea of having an autonomous system handling “day to day” tasks has always been a utopia but the technology never supported it or was rudimentary at best. The use cases covered very specific tasks and had a very narrow goal. AI has been promoted as the silver bullet to this problem, but how close are we to having agents handle these? This is a question I get asked this question a lot and, from the point of view of a cybersecurity practitioner, I have always been reluctant to answer.
In my professional life I use AI moderately: transcribing meetings, email summary and other Copilot tools have naturally become part of my day to day.. In my personal life, on the other hand, I am a heavy user of AI. I – despite numerous attempts to learn – have the frontend development skills of a toddler. All my little helper tools and scripts exist as CLI (Command Line Interface), and they are only really useful to me. The creation of AI coding agents like Claude, Codex and Gemini have allowed me to turn these hostile, unfriendly scripts into something that is useful to anyone and therefore more productive. There is, of course, an argument to be had around “AI slop” and the security issues around “vibe coding” but they work and that in some cases is good enough.
For a few months, I have toyed with the idea of bringing agents into my regular workflow to see if I can realise any of their purported productivity benefits.
Will it make us more efficient?
In the Claranet SOC, we have built our AI capabilities to help our teams review and summarise the large amount of data that is presented as part of a security investigation. Much like the automations that proceeded this, they are targeted on a specific area and use case without much in the way of “joining the dots” like a human would. Arguably you are more effective using them but I have always found that I am more affective on my own, powered by brutal death metal music and shear brute force.
That being said, I would be naive to think that I shouldn’t adapt with technology and embrace new tools in order to stay ahead of what others are doing on both sides of the legal divide.
Skunk Works comes to life
A few months ago, I decided to test what was possible when using an “agent” for offensive cybersecurity purposes. Would it make me more effective, how good can an agent be at pen testing?
Intrigued by the achievements of XBOW, I started to build an agent that would be capable of breaking down the complex tasks needed to seek out a security issue and automate some of the investigation and activity needed to iterate and corectly identify a vulnerability. I am lucky that, being surrounded by my colleagues within the Claranet group, I am able to bounce ideas off people who are hacking all day every day. After a few iterations of the core framework, the agent was tested on some CTF challenges.
It performed…poorly. But it performed. A few iterations later, and some advancements in the models and tooling it used, I tested the agent again, and it improved! There is still work to be done, but the foundation was there.
Battle testing 2.0
Fast forward through many more iterations, along with valuable input from my colleagues, and the latest iteration was born. This time, it was entered into the Hack the Box AI CTF, a Capture the Flag only accessible via MCP (model context protocol) and staged to test the capability of autonomous agents.
My new agent proved to be capable of autonomous work and staying on track to achieve its goal – with some human intervention needed at times. There is still room for improvement, but did this make me more productive?
Yes…But, no…But, yes.
Was it able to solve challenges that I would not have had the capability to solve on my own?
Yes. Challenges in the CTF that involved reviewing a problem in its entirety (like cryptography, secure coding, reverse engineering, blockchain, forensic data investigation) proved to be trivial for the agent. In most cases, it was capable of solving the problem in one shot, with no human intervention.
Was it able to solve some challenges that needed required joining the dots, or completing tasks over several stages in the correct order?
No. Other challenges that involved working on flaws in logic (web, pwn, AI) proved more difficult for the agent, in some cases needing human intervention or getting lost in a mess of context and mistakes before crashing out.
Some interesting solutions came out of this trial. When the agent exhausted its work with the tooling it had available, it “decided” to make its own using the provided Python REPL. In some cases, this lead to solving the challenge, while in others it resulted in a mess of incorrect code and bad choices. Sometimes when it got stuck with incorrect code, it was possible to point the agent back into the right direction, but this requires the operator to know exactly what the problem is and how to solve it.
Much of the work was carried out using OpenAI GPT 5.0 and Codex models, but for some challenges the model was swapped for Claude Sonnet 4.5. In one secure coding challenge, Gemini 3 pro was able to find the correct solution. Again, this proves that not all models are right for the job, in the same way that not all personnel are right for the job – some excel at certain tasks, but struggle with others.
Another option for completing tasks was to provide the agent with external tooling provided via MCP, such as access to Ghidra, Burp Suite, Binary Ninja etc. This allowed the agent to gain access to tools, data and techniques beyond what would be possible with its own internal tooling, but at great expense to the shrinking context window – the memory for holding your system prompts in an LLM. The consumption of the context window wasn’t a problem with smaller challenges but as complexity grows, the context window shrinks and the agent looses its way.
How did the AI agent perform? (An honest assessment.)
Initial prompts make all the difference. Clearly setting out the goal and path helped to keep the agent on track. Having the agent reflect on its choices before committing was the key to keeping it on track – this will be my area of most focus as I continue to develop my AI agent.
Yes this was a CTF, and CTF challenges are not “real world” in some cases, but they are fun and challenging and a good test of skill.
Did the agent complete everything? No it didn’t. Did it require a human to help at times? Yes.
Will this type of technology replace a human pentester? No, not in its current iteration, anyway.
Will it be possible to assist a human pentester? Absolutely, but HITL (human in the loop) is critical. We will still need a human in the loop for the foreseeable future.
Did it cost a lot? Not as much as I thought (and budgeted for). Yes, there is always a cost, but this would be no different if you factored in the cost of me – the human – doing these tasks.
But was I more efficient? Reluctantly, I will admit, yes.
It works. What now?
The proof is there: AI agents are capable of performing cybersecurity tasks and assisting humans to achieve more in a shorter amount of time. Nothing is perfect, so we must iterate and improve it over time.
I look forward to expanding on this research and testing more cases over the next year (providing the “bubble” doesn’t burst).
To find out more about how Tom’s team uses AI in the Claranet SOC, contact us today.
