Copy Fail (CVE-2026-31431): This vulnerability affects millions of Linux systems. Why Claranet customers still sleep well. A workshop report.
Fedor Kaplan
Team Lead Cloud Engineering
In this blog post, you can find out why the "Copy Fail" vulnerability in the Linux kernel is particularly dangerous, how the Claranet team was able to secure the systems of its managed service customers in just a few hours, and why speed is more important than ever in the AI era.
When the phone rings at midnight
On the night of 30 April 2026, our on-call team received a message from the Claranet Security Operations Center (SOC):
Sorry to wake you. I believe your IT team needs to quickly apply the workaround. Your K8s are exposed. Basically any shared Linux platform needs to be addressed immediately. This is really bad - you can basically root it in seconds."
This was followed by a coordinated incident response process that secured all of our customers' managed systems within a few hours without a single customer having to take action themselves.
What is the technical background? (Linux Kernel, AF_ALG, algif_aead)
CVE-2026-31431, known as "Copy Fail", is a privilege escalation vulnerability in the Linux kernel. It affects the algif_aead driver, which is part of the AF_ALG socket family, a kernel interface through which userspace programmes can access cryptographic algorithms of the kernel.
The error lies in the handling of memory areas when copying data (copy_from_iter) in the AEAD interface. Through a targeted sequence of system calls, an unprivileged local user can provoke an out-of-bounds write access, which leads to the execution of code in the kernel context and thus to full root rights on the affected system.
What makes the copy-fail vulnerability technically critical?
Several characteristics make CVE-2026-31431 technically exceptional:
Minimum effort, maximum effect: the exploit requires no special prior knowledge, no privileged accounts and no external network communication. Simple system access is sufficient. The proof-of-concept was publicly available shortly after it became known.
Universal impact: The vulnerability affects Linux kernel versions from 4.9 (released in 2016) to current versions. This means that practically all Linux distributions from the past nine years are affected: Ubuntu, Debian, RHEL, Amazon Linux and all container and Kubernetes environments based on them.
Long undiscovered: The vulnerable code has existed in the kernel since 2017. The vulnerability remained undiscovered for almost nine years - despite code reviews, security audits and widespread use of the affected component. This emphasises how difficult static code analysis and manual reviews are, even in one of the most audited open source projects in the world.
Difficult to detect: An additional critical point is that the vulnerability is difficult to detect by traditional security systems due to its nature. In particular, file integrity checks do not work, as there is no manipulation of files or persistent changes to the file system. Conventional intrusion detection systems and antivirus programs also have difficulties recognising the attack, as the code is executed in the kernel context via legitimate system calls. As a result, the exploitation often remains undetected and makes timely detection and defence more difficult.
No vendor patch at the time of mitigation
A critical aspect: At the time of our response, there was no official kernel patch from the vendors providing Linux distributions. Ubuntu, Red Hat, AWS, GCP and other vendors had not yet released updated packages. A custom kernel build was also ruled out for production environments.
AWS only released AMIs with a patched kernel on 5 May. Until then, only workarounds were available - which made our rapid response with the available resources all the more important.
Inventory & first workaround (mitigation)
As soon as the report was received, our security team took action. An internal ticket was opened within minutes and the relevant teams were informed and involved.
The first questions we asked ourselves were:
- Which of our managed systems are affected (kernel versions, distributions, cloud/Kubernetes)?
- Is the algif_aead kernel module already loaded or can it be loaded automatically?
- Which workaround can be used immediately without risking system failures?
Our team quickly created Ansible playbooks to check the entire managed inventory:
- hosts: all
tasks:
- name: Check if module is loaded
ansible.builtin.shell: "lsmod | grep -i algif_aead"
register: moduleloaded
ignore_errors: TrueThe algif_aead module was not actively loaded on the systems - as it is only automatically integrated when required. This enabled a non-critical workaround: The loading of the module was blocked system-wide without affecting running services:
- name: Block module from being loaded
ansible.builtin.lineinfile:
path: "/etc/modprobe.d/disable-algif.conf"
line: "install algif_aead /bin/false"
create: trueAll accessible Linux systems were automatically run through and secured.
The Kubernetes challenge
The path was clear for classic Linux VMs. Kubernetes clusters presented a challenge of their own:
- For self-managed nodes, the workaround could be applied directly.
- Managed n odes (e.g. EKS on AWS, GKE on Google Cloud) lack direct shell access - redeploying the nodes would have been associated with risks.
The analysis showed that the risk was manageable as no multi-tenant cluster was operated where several customers could influence each other. The team deliberately opted for a controlled approach rather than a quick fix.
Phase 1 - EKS (Amazon Web Services): A privileged DaemonSet was rolled out to all nodes and stored a modprobe blacklist that permanently prevents the module from loading. The verification:
OK: algif_aead not loaded
OK: blacklist file present
OK: modprobe refused (exit 1)
OK: algif_aead absent from /proc/modulesPhase 2 - GKE (Google Cloud): On GKE clusters, the module is permanently compiled in on some kernel versions (builtin) - a modprobe blacklist does not work here. The solution: A Seccomp profile that blocks the socket(AF_ALG, ...) syscall at kernel level, combined with Kyverno policies for automatic enforcement for new workloads A controlled rotation was necessary for existing workloads - customers were informed in advance and time windows were coordinated.
Customer communication: transparent, professional, proactive
All affected customers were informed in writing about the vulnerability and the measures taken. For environments where workload rotation was necessary, specific timeframes were communicated with a clear picture of what to expect.
Result: Fully secured - without customer action
By the end of the evening, all managed Linux systems and Kubernetes clusters were either:
- Directly secured (modprobe blacklist via Ansible or DaemonSet), or
- Protected via Seccomp profile with Kyverno enforcement at cluster level
The monitoring showed no deviations during the entire process. No unplanned outages, no data loss, no security incidents for our customers.
On 5 May, AWS released corrected AMIs with a patched kernel - a further step towards full protection, which was automatically integrated into the rollout planning.
What this security incident says about the added value of a managed service provider
Responsiveness around the clock
The alarm was raised after midnight. Within minutes, the team was active, a ticket was opened and the first playbooks were running. Without a dedicated managed service team, the responsibility at times like this lies with the customer.
Centralised inventory and automation
An MSP knows the managed inventory. Instead of manually checking system by system, a playbook runs across all hosts - quickly, consistently and documented. That's the difference between hours and days.
Deep technical expertise
From Linux kernel internals to Kubernetes architecture to cloud-specific peculiarities (EKS vs. GKE, modprobe vs. Seccomp): the team understood the entire stack depth and was able to make precise risk assessments instead of falling back on blanket measures.
Customers didn't have to do anything
This is the key point: the affected systems were secured without customers having to intervene. This is not a stroke of luck - it is the result of a structured, tool-supported operational organisation.
Outlook: Why incidents like Copy Fail will increase
CVE-2026-31431 is not an isolated incident - and the frequency of similar incidents will increase in the coming weeks and months. Just today, on 8.5.26, we were confronted with the "Dirty Frag" vulnerability - also a privilege escalation vulnerability.
A key driver is the use of AI-supported code analysis: modern tools can systematically analyse large amounts of code for patterns that indicate potential vulnerabilities - memory accesses, synchronisation errors, API abuse. What used to take weeks of manual analysis can now be automated in hours. This means that vulnerabilities such as "copy fail" or "dirty frag", which lay undetected in the code for years, will be found more quickly in future - both by security researchers and attackers.
This development has concrete consequences for operational practice:
- Shorter windows between disclosure and exploitation: when AI tools speed up exploit development, the timeframe in which action is required shrinks dramatically.
- More simultaneous vulnerabilities: Instead of one critical CVE every few months, parallel incidents are realistic - with higher coordination efforts.
- Higher demands on automation: Manual responses to CVEs will no longer be sufficient in terms of time. Those who have not prepared inventories, playbooks and rollout pipelines will inevitably be reactive in such situations.
For companies, this means that the decision for or against proactively managed operations will increasingly become a decision about the accepted residual risk - not just about operating costs.
Conclusion
"Copy Fail" was a serious threat. Millions of Linux systems worldwide were affected and the vulnerability was actively exploited. For our customers, the situation was manageable because we recognised, assessed and closed the gap before it could become a problem.
The question is not whether such incidents will occur again. The question is whether the structures are in place to deal with them.
Security is not a product you buy. It is a process. And this process only works if the right people with the right tools are in place around the clock.
That's our promise as an MSP.
Managed Kubernetes Operations & Security
Are you running Kubernetes productively and want to mitigate CVEs such as "Copy Fail" more quickly - without ad-hoc firefighting? Then it's worth taking a look at our Managed Kubernetes Services: As a managed service provider , we support you with secure operation (security hardening, policy enforcement, monitoring), rollouts/upgrades and structured patch & lifecycle management - tailored to your platform (e.g. EKS/GKE) and your operating models.
Tip for getting started : Start with our Kubernetes assessment: We analyse cluster setup, node images, security baseline (e.g. Seccomp/policies) and operating processes and derive a prioritised mitigation and patch roadmap.
CVE-2026-31431 ("Copy Fail"): The most important facts at a glance
- CVE-2026-31431 ("Copy Fail") is a Local Privilege Escalation (LPE) in the Linux kernel (AF_ALG / algif_aead).
- Many kernel versions are affected (including 4.9+); it is particularly critical on shared Linux platforms, CI/CD runners and Kubernetes nodes.
- If the module is not loaded, a quick workaround is possible: block module loading (e.g. install algif_aead /bin/false).
- In Kubernetes, different mitigations are required depending on the platform: DaemonSet/Blacklist (e.g. EKS) vs. Seccomp + Policy Enforcement (e.g. GKE, if built-in).
- Patching remains mandatory: Workarounds reduce the risk immediately, but do not replace a kernel fix/update.
- Claranet provides support with assessment, rollout (automation), K8 hardening and continuous monitoring and reacts quickly and proactively in an emergency.
Who is this relevant for? In short: for anyone who runs Linux productively - especially in environments with many users/workloads per host (multi-tenant, containers/Kubernetes, build agents). In such setups, "local" can very quickly become "critical" because attackers often need some initial footprint (e.g. pod, shell, job) to start root escalation. If you operate multi-tenant or "untrusted workload" scenarios, fast mitigation is particularly important.
Am I affected? (Quick check for Copy Fail / CVE-2026-31431)
This quick check helps you to quickly categorise the risk of Copy Fail (CVE-2026-31431) - especially if you operate Kubernetes nodes, CI/CD runners or other shared Linux platforms. The checks do not replace a complete assessment, but provide a reliable initial direction in just a few minutes.
- Check kernel version: uname -r and compare with the advisories of your distribution/cloud images.
- AF_ALG/algif_aead attack surface: Is algif_aead available/loaded as a module? lsmod | grep -i algif_aead
- Module loading blockable? (if not built-in): modprobe rule like install algif_aead /bin/false is a typical immediate measure.
- Kubernetes scope: Are you running multi-tenant or untrusted workloads per node? Then priority is particularly high (LPE → root at node level).
- GKE/EKS special features: With built-in modules (typical in some GKE constellations), a seccomp mitigation(e.g. block of socket(AF_ALG)) is often the faster lever than modprobe.
FAQ on Copy Fail (CVE-2026-31431) in Kubernetes
Is Copy Fail (CVE-2026-31431) remotely exploitable?
First and foremost, it is a Local Privilege Escalation (LPE). it becomes "remote" indirectly as soon as attackers can execute code somewhere (e.g. via a compromised service, a CI job or a container/pod) - then the step to root can follow very quickly.
Does Copy Fail (CVE-2026-31431) affect Kubernetes/Container in particular?
Yes - especially where many workloads run on the same nodes. In multi-tenant or "untrusted workload" scenarios, an LPE in the node context can be much more serious.
What is the fastest workaround against Copy Fail (CVE-2026-31431)?
If algif_aead can be loaded as a module (not built-in), blocking the load is typically the fastest way (e.g. install algif_aead /bin/false). If it is built-in, a seccomp approach (block of socket(AF_ALG)) is often more practical.
Does mitigation for copy fail (CVE-2026-31431) replace patching?
No. Mitigation reduces risk immediately, but is no substitute for a kernel fix. As soon as fixes/images are available, a planned patch/rollout window should follow.
Who can quickly mitigate Copy Fail (CVE-2026-31431) in heterogeneous Linux and Kubernetes fleets?
This is exactly where managed services can help: Claranet can support you with assessment, automation (Ansible/cluster rollouts), Kubernetes hardening (Seccomp/Policy) and downstream patch management as an end-to-end process - including monitoring and coordinated customer communication.
