Will AI automate penetration testing?

There’s a lot of hype about AI and its effects on all areas of the IT and technology industries. What does this mean for penetration testing? In this article, we will discuss the impact of AI and automation on cybersecurity testing services, how AI is likely to change them in future, and what to look out for when evaluating vendors for penetration testing.

Some security vendors now describe their tools and services as “powered by AI”, but it’s worth remembering that many areas of cybersecurity, especially penetration testing, have been progressively incorporating AI into their work for several years now. On the whole, AI and automation have been built into tools that reduce the time that penetration testers spend on repetitive tasks so they can focus on tasks where their expertise is more valuable.

Let’s consult the dictionary

One problem is that industry standard terms don’t always have a universally accepted definition. One cybersecurity provider might refer to their service as “automated penetration testing”, while another provider might use the same term to refer to something slightly different. So, before we go any further, let’s define our terms:

Penetration testing (AKA Pentesting) is an offensive security exercise that uses manual and automated hacking techniques to deploy controlled attacks on an asset (or assets), such as a web application, cloud environment, piece of hardware, and so on. This tests an IT asset’s effectiveness in withstanding different malicious techniques that would cause disruption or harm, or exfiltrate data or financial assets. Penetration testing helps identifies where vulnerabilities and misconfigurations may lead to an increase in cyber risk and recommends suitable remediation to help them lower that risk.Vulnerability scanners inspect networks, devices, and applications for known vulnerabilities and then score them (usually as per CVSS).

Vulnerability scanners inspect networks, devices, and applications for known vulnerabilities and then score them (usually as per CVSS).

Some security vendors now use the term automated penetration testing, which refers to the use of vulnerability scanners to look for vulnerabilities and then automatically perform tests on those vulnerabilities. Unfortunately, they struggle to perform more complex attack techniques with many variables, or to chain multiple vulnerabilities together to achieve an impact greater than the sum of its parts.

However, “automated penetration testing” is potentially misleading. It implies that other kinds of testing are “manual”, and “manual penetration testing” implies that it is all done manually. In fact, penetration testers have been developing and using many tools to automate parts of the penetration testing process for years.

The secret lives of vulnerability scanners

Vulnerability scanners are used on most pentests to speed up the process of identifying vulnerabilities, but they are also used on their own to uncover more basic issues. Vulnerability scanners have gotten more powerful over the years, meaning the number of security vulnerabilities they can successfully detect is increasing. With the addition of AI, this growth will only continue, as will their ability to automatically investigate such vulnerabilities by performing the same attacks a human penetration tester would have done manually.

Up to now, vulnerability scanners have struggled to test systems with dynamic variables. For this reason they have traditionally performed much better when testing infrastructure than they have when testing more complex web applications. This is because such variables can be found in situations where systems and their data change according to different inputs or sequences of inputs. In the example context of an e-commerce website, a human penetration tester could input different data into the customer checkout fields on each page form to see what effect this has. A scanner, however, doesn't see the potential to program different outcomes by interacting with the fields. Without context, it can only observe and perform limited random actions, rather than objectively testing the effect of variables. For this reason, human input and expertise is still valuable for testing web applications.

Using scanners on authenticated systems also carries business risks (our engineers avoid doing it at all if possible). This is because they can’t intelligently differentiate between “safe” and “unsafe” options, instead performing a pre-scripted set of (“if”) instructions. Deployed on a live system with authenticated access, where the user is authorized to perform critical actions (such as deleting data), this can result in system downtime, operational disruption and irreversible consequences.

How does AI improve vulnerability scanners?

Let’s start with scanners’ ability to find and test more complex vulnerabilities. Previously, human input was needed to chain actions together. To do this effectively, humans have to understand which actions are linked, and which chains of actions result in consequences that are relevant to the asset they are testing. Essentially, “If I test this first, then I should test that next. If this doesn’t work, then I should try that.” AI can instruct vulnerability scanners to attempt multiple different techniques, or to perform one specific action when presented with one outcome, but a different action in response to a different outcome.

(See section: Can AI perform penetration tests? for more detail)

As well as testing for more complex vulnerabilities, we predict that AI will guide scanners on what not to test. AI can help make scanners more context-aware, so that they perform fewer risky actions when testing. Using pre-programmed sets of rules, scanners could be given indicators about functionality that, if tested, might cause irreparable damage, such as deleting accounts, data or code. Such rules would instruct the scanner, “if you encounter X, do not do Y.”

Once the testing is complete, vulnerability scanners have to list vulnerabilities they uncovered and rank those vulnerabilities by the severity of the risk they present. Unfortunately, many scanners list duplicates of vulnerabilities, such as one misconfiguration resulting in the same vulnerability appearing in dozens of different places on an application or website, for example. AI can help scanners deduplicate these results. Similarly, AI can help rank the severity of vulnerabilities more accurately.

(See section: Not man versus machine, but machine plus machine for more detail.)

How AI improves vulnerability scanners:

Increases what vulnerabilities they can identify
How they can automatically test those vulnerabilities
Help understand what vulnerabilities they should not test because they are too risky
Helps remove duplication in the results of a scan to create more useful results for human penetration testers
Better rank the severity of vulnerabilities

Can AI perform penetration tests?

In February 2024, researchers were able to show that LLM agents can autonomously hack websites, performing tasks as complex as blind database schema extraction and SQL injections without human feedback. Most importantly, the LLM agent does not need to know the vulnerabilities present on the website beforehand.

They did this by first training the LLM agents to read documents, call functions to manipulate a web browser and retrieve results, as well as access contextual information from previous actions. They tested 15 vulnerabilities, ranging from simple SQL injection vulnerabilities to more complex hacks requiring both cross-site scripting (XSS) and Cross-Site Request Forgery (CSRF). They defined a specific goal for each vulnerability (e.g., stealing private user information), and encouraged the model to pursue techniques to conclusion, probe difficulties and problems within certain techniques, and to try alternatives so the LLM agent did not get stuck when certain techniques were failing.

GPT-4 had a success rate of 42.7%, while other open-source LLMs failed because they are incapable of using tools correctly and fail to plan appropriately. (Indeed, “planning” how to approach complex tasks is one area where LLMs still struggle; try giving it a very complex sum, for example, and you will see different results compared to what you get with a calculator.) The researchers were able to prove that removing GPT-4’s ability to retrieve and documents related to the task, such as system instructions or blogs about web hacking, results in substantially reduced performance.

Despite its low success rate, this is a positive indication of where the use of AI is going for penetration tests and exploiting vulnerabilities. Sadly, threat actors also have access to these tools and will be using them for nefarious means, such as accelerating the development of exploits for new vulnerabilities that have just become public in threat advisories. New findings from the University of Illinois Urbana-Champaign (UIUC) found that GPT-4 was able to successfully devise exploits for vulnerabilities that have been public for one day.

Clearly, while AI still has limitations and needs careful deployment, it also has the potential to enhance penetration testing activities and make human testing far more efficient.

Not man versus machine, but machine plus machine

After drawing against IBM’s DeepMind, the Russian chess grandmaster Gary Kasparov began exploring the concept of advanced chess – humans pitted against one another using supercomputers to plan and predict their next move. He explained his hypothesis at DEF CON 25 with a simple formula: "A weak player with an ordinary machine and a superior process will be dominant in a game against a strong player with a strong machine and an inferior process”. Or, simply put, “your job won’t be taken by AI, but it will be taken by someone using AI.”

We used the same inspiration when we were building Continuous Security Testing. By automating repetitive, low-value tasks, such as searching for and classifying vulnerabilities, we enabled expert penetration testers to focus on areas where they could best apply their expertise and ingenuity.

Continuous Security Testing takes the best bits of automated scanners and manual penetration testing. Automated scanning tools run 24/7, to reveal vulnerabilities within your web-facing assets, APIs and external infrastructure. After this, expert penetration testers analyse the results of the scan, remove any false positives, then manually verify present vulnerabilities and conduct further testing on more complex vulnerabilities, or those that require chaining multiple attacks together to reach a final objective.

Continuous Security Testing uses Qualys to automate the identification of vulnerabilities. It also uses proprietary AI developed by Simon Kubicsek – Offensive Security Senior Manager at Claranet – to speed up the prioritization of vulnerabilities based on their risk and the likelihood of being exploited. This helps classify vulnerabilities to give them a risk score, which is always checked and approved by a human penetration tester.

Will AI automate the penetration testing process?

The short answer is, not completely and not for a long time yet. If you are in the market for penetration testing (or any other kind of offensive security testing) beware security vendors that highlight automated or autonomous penetration testing, or AI-powered penetration testing, in their marketing. If you start to doubt their credibility, you are right to be skeptical.

The failures of AI are well known and current research is showing that these failures are already targets for cyber attackers. Typosquatting, attempting to poison training data sets, capitalizing on AI hallucinations, and creating code repositories containing malware, are just a few examples of why we must invest more resources in securing large-language models.

Generative AI may make pen testing tools faster and more thorough, but such tools will still need to be created, managed and run by cybersecurity experts. For now, the results of any AI must be checked by a human expert. There are other difficulties with overreliance on AI, which relate to some of the fundamental principles of pentesting:

Inconsistent results – The results produced by generative AI and Large Language Models are not always consistent. The processes used in pentesting must be consistent and repeatable to have value for the customer.
Unclear methodology – It is often unclear how an AI has arrived at its conclusions and they cannot “show their working”. AI models often struggle to provide clear evidence of what they have done, or could have done. Penetration testing should be based on clear methodologies, providing a proof of concept so that the exploit can be recreated after the test.
Lack of rigorous quality control – All manual penetration testing undergoes thorough quality assurance from more senior team members, to ensure that the client doesn’t receive a poor test or report. When it’s unknown how an AI arrived at its conclusions, how can a pentest provider reasonably ensure the quality of its results?

Be cautious when using any service or tool that appears too heavily reliant on AI tools. To ensure you are getting the same thorough testing and expertise, and the level of service you would expect from “traditional” or “manual” penetration testing, look into the finer technical details. Ask potential vendors exactly what is automated, what is done by AI, what is done by humans and what level of expertise they have. Ask them what publicly-available tools their testers use and what proprietary software or tools they use (if any). Compare the fine details in the service description or Scope of Work with any other security vendors you are evaluating.

To find out more about how penetration testing and Continuous Security Testing can improve the security of your web applications, get in touch.

Continous Security Testing

Automated and human
Agile and always-on

Find out More

Will AI automate penetration testing?