Taming OpenShift Audit with RHACS

Want to hear a NotebookLM-generated podcast created from this article? Check it out here.

As a kid I loved a book called 'Two Minute Mysteries', by Donald J.Sobol. It was a collection of short stories, usually just over a page long, where the reader pretended to be 'Watson' to the detective 'Dr Haledjian'. He was able to rapidly discern whether someone was lying from a minute detail, and the reader was left to wonder how. I can vividly remember the cover.

I've unfortunately lost the book, but I've tried to recreate the style here (and yes, he seemed to spend a lot of time in bars):

Dr Haledjian visits a bar on his way back to the office. Inside, a bronzed adventurer, recently shaved, is seated at a table.

"Where brings you to town?", asked Dr Haledjian.

"I just got back yesterday from San Jose.", replied the adventurer. "I was out there three months digging out a gold seam I found last year. My beard was a foot long, and I just shaved yesterday. I ran out of funding to get the gold back. If you can front me the $2500, I'll give you half the gold". He scratched his bronzed chin and sank into his beer.

"I wouldn't give you $2500 if my life depended on it," responded Dr Haledjian.

How did Dr Haledjian know the man was lying?

I've probably made the clues way-y-y too obvious here. The man scratched his "bronzed chin", after only shaving yesterday, and having visited the gold mine for three months. He wouldn't have a bronzed chin if he only shaved yesterday - his skin would be pasty from the beard.

Weirdly, I'm fairly certain that this childhood interest in detective short-stories has translated to an adulthood interest in audit logs. Reading audit and event logs is just like a great detective novel - why did they access that file? Is that normal? Is this a legitimate system administrator, or a threat actor trying to evade detection?

Audit and event logging is essential to threat hunting. Many threat actors now attempt to "live off the land", abusing administrative tools to perform their objectives. Strategies like malware signature detection and application allow-listing have no impact on risk mitigation for these techniques, as the threat actors are using allow-listed administrative tools.

Really the only way to identify these users is through comprehensive audit, and identifying a "normal" administrative action from a malicious threat actor "living off the land". One of my favourite security controls available in Red Hat Enterprise Linux is tlog, a system for terminal session recording. It allows administrators to 'replay' a user's session - seeing the commands they type, the typos they made, and the sequence of actions as they took place. If you want to try it out, there's a hands-on lab here, and you can see a video here.

Context is hugely important to identify whether an action on a system is a threat actor trying to evade detection, or a legitimate system administrator. That's exactly what tlog delivers - a 'replayable' view of the users' session. This provides the context I need to see the user's actions in context, whether they're applying a patch (hopefully automated with Ansible), or troubleshooting a failed process. The reason for the user making changes becomes readily apparent when I view the entire session in context.

Linux is pretty straight-forward, because I can capture user sessions with tlog. But what about Kubernetes? Kubernetes doesn't natively support the same 'replayable' user sessions. Users instead interact with the API via command-line tools. But systems and services also interact with the API to support things like monitoring and workload scheduling. All of these API interactions create a tremendous volume of audit and event data, which needs to be indexed and searched for anomalous user interactions.

Ideally I want to:

Create alerts from Kubernetes audit and event records; and
If possible, verify user actions using a 'replayable' session

One of the difficult things about audit and other event logging is that it's expensive to keep logs "indexed". If I want to search for something, or a user action, I need to be able to keep the data tokenized and indexed for search. Platforms like Splunk and ElasticSearch excel at this - creating alerts from indexed data, and making it accessible to security teams.

You can check out how much data is generated by inspecting the OpenShift nodes. Audit logs are stored in /var/log/kube-apiserver/, and are rotated when the files reach 200M:

#  ls -lh /var/log/kube-apiserver/
total 2.3G
-rw-------. 1 root root 200M Oct 13 22:15 audit-2024-10-13T22-15-57.009.log
-rw-------. 1 root root 200M Oct 13 22:47 audit-2024-10-13T22-47-58.094.log
-rw-------. 1 root root 200M Oct 13 23:20 audit-2024-10-13T23-20-20.061.log
-rw-------. 1 root root 200M Oct 13 23:52 audit-2024-10-13T23-52-47.573.log
-rw-------. 1 root root 200M Oct 14 00:25 audit-2024-10-14T00-25-31.483.log
-rw-------. 1 root root 200M Oct 14 00:56 audit-2024-10-14T00-56-29.039.log
-rw-------. 1 root root 200M Oct 14 01:28 audit-2024-10-14T01-28-19.076.log
-rw-------. 1 root root 200M Oct 14 02:02 audit-2024-10-14T02-02-04.734.log
-rw-------. 1 root root 200M Oct 14 02:32 audit-2024-10-14T02-32-43.245.log
-rw-------. 1 root root 200M Oct 14 03:04 audit-2024-10-14T03-04-39.250.log
-rw-------. 1 root root 192M Oct 14 03:36 audit.log
-rw-------. 1 root root 1.8M Oct 13 11:21 termination.log

You can see here that on my Single Node OpenShift (SNO) I am generating 200MB in audit logs from the Kubernetes API server approximately every 30 minutes. I only have one control-plane node in this cluster, one worker (the same node), one user / platform engineer (me), and practically no workloads. And I'm still generating 200M every 30 minutes, or ~10GB a day, just from normal platform interactions. This works out to 300GB per month in audit logs, for my single-node OpenShift server, with practically zero workloads and only a single user.

Data ingest costs are difficult to predict, but it can range anywhere from $200 to $1800 per GB. This means that per-month I could be spending anywhere from $60,000 to $540,000 just to search indexed audit logs for my Single-Node OpenShift - and it only scales from there.

Clearly this is a very expensive option. So how do I better support audit investigations with OpenShift?

Disclaimer

The information contained in this article is intended to encourage innovation in security architectures, and ways of managing security event-related data. It may not meet the needs of your organisation. You should carefully understand your own risk management needs before making any significant changes to platforms and systems.

Red Hat Advanced Cluster Security for Kubernetes (RHACS) and Audit

I'd argue there's a better way. We don't need to keep all of this Kubernetes event and audit data immediately searchable if we can create immediate alerts, and then search the data later for more detail. This is exactly the approach that Red Hat Advanced Cluster Security for Kubernetes takes with OpenShift. I've covered RHACS in a few articles before, looking at the new RHACS scanner, "living off the land", Sigstore support and CVE data, and I think it's worth revisiting how RHACS integrates with OpenShift, and particularly audit logs.

RHACS uses a hub-and-spoke model to secure Kubernetes platforms, including OpenShift. It comprises 'Central', supporting the vulnerability scanner, API, UI, etc, and 'Secured Cluster Services', comprising an eBPF probe, admission controller, and audit integations.

RHACS integrates with OpenShift audit through a simple Go-based file reader, included with the RHACS Secured Cluster Services. You can see the code here:

// Reader provides functionality to read, parse and send audit log events to Sensor.
type Reader interface {
	// StartReader will start the audit log reader process which will continuously read and send events until stopped.
	// Returns true if the reader can be started (log exists and can be read). Log file missing is not considered an error.
	StartReader(ctx context.Context) (bool, error)
	// StopReader will stop the reader if it's started.
	StopReader()
}

Each audit event is captured and sent to the Sensor, to be relayed to RHACS Central. You can see data collected for each audit event in the StackRox source:

type auditEvent struct {
	Annotations              map[string]string `json:"annotations"`
	APIVersion               string            `json:"apiVersion"`
	AuditID                  string            `json:"auditID"`
	Kind                     string            `json:"kind"`
	Level                    string            `json:"level"`
	ObjectRef                objectRef         `json:"objectRef"`
	RequestReceivedTimestamp string            `json:"requestReceivedTimestamp"`
	RequestURI               string            `json:"requestURI"`
	ResponseStatus           responseStatusRef `json:"responseStatus"`
	SourceIPs                []string          `json:"sourceIPs"`
	Stage                    string            `json:"stage"`
	StageTimestamp           string            `json:"stageTimestamp"`
	User                     userRef           `json:"user"`
	ImpersonatedUser         *userRef          `json:"impersonatedUser,omitempty"`
	UserAgent                string            `json:"userAgent"`
	Verb                     string            `json:"verb"`
}

You can see this data-flow represented here:

There's interesting information captured here which we can use to generate Red Hat Advanced Cluster Security for Kubernetes (RHACS) alerts. I can capture the fact that a user action is impersonated - this could potentially represent a user evading threat detection, and trying to disguise their actions as another platform user. I can also capture the source IP, which can be really useful if users are supposed to administer OpenShift via a 'bastion host' - I'll cover this example a bit later.

Currently Red Hat Advanced Cluster Security for Kubernetes (RHACS) supports a limited number of high-interest audit events, which are also shown in the StackRox source:

// The audit logs report the resource all as one word, but the k8s event object (and elsewhere) uses underscore
var (
	// The audit logs report the resource all as one word, but the k8s event object (and elsewhere) uses underscore
	auditResourceToKubeResource = map[string]storage.KubernetesEvent_Object_Resource{
		"pods_exec":                  storage.KubernetesEvent_Object_PODS_EXEC,
		"pods_portforward":           storage.KubernetesEvent_Object_PODS_PORTFORWARD,
		"secrets":                    storage.KubernetesEvent_Object_SECRETS,
		"configmaps":                 storage.KubernetesEvent_Object_CONFIGMAPS,
		"clusterroles":               storage.KubernetesEvent_Object_CLUSTER_ROLES,
		"clusterrolebindings":        storage.KubernetesEvent_Object_CLUSTER_ROLE_BINDINGS,
		"networkpolicies":            storage.KubernetesEvent_Object_NETWORK_POLICIES,
		"securitycontextconstraints": storage.KubernetesEvent_Object_SECURITY_CONTEXT_CONSTRAINTS,
		"egressfirewalls":            storage.KubernetesEvent_Object_EGRESS_FIREWALLS,
	}
)

Though this is only a subset of audit events, they represent a lot of the high-interest events I want to investigate on a cluster.

pods_exec: Records user attempts to exec into a container, which could represent a user trying to access sensitive data inside the pod or make changes.
pods_portforward: Records attempts to forward ports inside a pod. Potentially represents efforts to exfiltrate data, or redirect pod traffic to create alternate C2 channels.
secrets: Records user attempts to access secrets. This can be a really useful event to alert on, particularly when secrets are managed externally (and users shouldn't be creating / accessing secrets within the cluster)
configmaps: Records user attempts to access config maps. Similar to secrets, this records user attempts to access or update config maps, potentially representing attempts to weaken an application configuration.
clusterroles: Records attempts to create Kubernetes cluster roles.
clusterrolebindings: Records attempts to create Kubernetes cluster role bindings. This could represent user attempts to increase their privileges on a cluster, or grant additional privileges to service accounts.
networkpolicies: Records attempts by users to make changes to network policies, which could represent attempts to exfiltrate data, or expose ports for alternate C2.
securitycontextconstraints: (OpenShift-specific) Records attempts to modify OpenShift security context constraints (SCCs), which control the privileges for a workload.
egressfirewalls: (OpenShift-specific) Records attemspts to modify OpenShift egress firewalls, which may represent attempts to exfiltrate data or communicate with C2.

These events are exposed to Red Hat Advanced Cluster Security for Kubernetes (RHACS) users via Runtime policies.

RHACS Central also maintains the state of audit collection from a node. You can see this during the sensor start-up:

common/config: 2024/10/13 07:24:20.800971 handler.go:78: Info: Received audit log sync state from Central: {
    "nodeAuditLogFileStates": {
        "sno.blueradish.net": {
            "collectLogsSince": "2024-10-13T07:23:37.249933Z",
            "lastAuditId": "12fa0ab3-11f6-48e5-976e-072aac05f72c"
        }
    }
}
common/compliance: 2024/10/13 07:24:20.801099 auditlog_manager_impl.go:232: Info: Sending start audit log collection message to node sno.blueradish.net
common/compliance: 2024/10/13 07:24:20.801260 auditlog_manager_impl.go:245: Info: Start message to node sno.blueradish.net contains state {
    "collectLogsSince": "2024-10-13T07:23:37.249933Z",
    "lastAuditId": "12fa0ab3-11f6-48e5-976e-072aac05f72c"
}

This is persisted to the RHACS Central database, and collection can be restarted in the event of a node outage.

Creating audit event alerts in RHACS

Let's look at an example for a secret located inside the cluster. I've created a secret inside the test-audit namespace on OpenShift, called sensitive-data.

Now I can create a policy to monitor access to this secret.

For the policy behaviour you can select 'Runtime' and 'Audit logs' as the event sources. Note that you can only select 'Inform' for the response method - there are no automated enforcement actions available for audit log events.

The policy criteria is where this gets interesting. Here I want to alert on Kubernetes secrets named sensitive data, and I want to alert on all API verbs representing user actions (GET / UPDATE / DELETE / PATCH / CREATE). Optionally, I could check if this user is impersonating another, or check the source IP address, but this isn't required for our quick example.

We want to generate alerts specifically for the test-audit namespace, so add an inclusion scope in the next screen.

Finally, review the policy and Save.

Now that this policy is created and enabled, we can try it out. Head to OpenShift and view the sensitive-data secret in the console. You should see that a violation is created inside RHACS with the details of the user:

Let's break down this alert in more detail:

The user who accessed this secret was kube:admin, who is a member of the system:cluster-admins group, and is authenticated to OpenShift.
The user accessed the secret, but did not perform any modifications.
The user accessed the secret via an IPv6 address (this is a dual-stack IPv4 / IPV6 cluster) from a Windows-based Chrome browser.

Quickly we can start to build a picture of user activity. This user is clearly one of the platform administrators, who accessed this from within the organisation (based on IPv6 address), and did not make any modifications to the secret. I would likely follow this up with the user directly, and see why they needed access to this secret.

Alerting on admin privilege assignment

What about another example? Let's look at a user creating admin privileges via a cluster-role binding for cluster-admin.

OpenShift provides a large number of ClusterRoles out of the box. A ClusterRole is a special role in Kubernetes (and OpenShift) in that it's not namespaced. Usualy when you create a Kubernetes RBAC Role, you provide a namespace. But ClusterRole has permissions across all namespaces - not just one. Kubernetes ClusterRoles and Roles are additive. I can't deny access to resources, but I can allow access by adding resources into the role.

So, the roles we are most interested in are:

ClusterRoles that have permissions across all namespaces; and
Roles with additive persmissions to lots of resources.

It turns out there is one such role provided with every OpenShift cluster by default - cluster-admin. If you take a look at the cluster-admin ClusterRole, you can see the huge number of permissions this user has:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: cluster-admin
rules:
- apiGroups:
  - '*'
  resources:
  - '*'
  verbs:
  - '*'
- nonResourceURLs:
  - '*'
  verbs:
  - '*'

Because these policies are additive we can't explicitly 'deny' access to resources. Instead, the most privileged roles use wildcards to denote access to many resources, or 'verbs' (actions users can perform on Kubernetes API objects). So, the cluster-admin role can:

Access any namespace in the cluster
Access any and all resources in any namespace
Perform any actions on those resources (list, get, delete, update, patch, etc)
Access any URLs not associated with resources, such as /healthz

This is definitely an OpenShift ClusterRole that we want to monitor and immediately investigate when it is associated with an account. We can do that my monitoring for ClusterRoleBindings, which associate a ClusterRole with a user account (or a serviceaccount)

The lead-up is the same; create a policy and use the 'Audit logs' runtime policy. I'm going to call this policy 'cluster-admin access created'.

Usually we'd use oc adm policy add-cluster-role-to-user to add cluster-admin privileges to a user. When this command is run it will create a new ClusterRoleBinding named for the role, e.g. cluster-admin-0, or cluster-admin-1. We can use this fact to create a policy based on the resource name (note the user of r/cluster-admin.* to invoke regex pattern matching with 'Kubernetes resource name'):

Let's test this out and make sure it triggers the policy. Create a new ClusterRoleBinding for the cluster-admin role and a user.

$ oc create user user1 --full-name "User One"
$ oc adm policy add-cluster-role-to-user cluster-admin user1
clusterrole.rbac.authorization.k8s.io/cluster-admin added: "user1"

Great! This has triggered the policy on my cluster and created alerts.

This isn't going to capture all instances of associating cluster-admin to a user. For example, if we ran oc adm policy add-cluster-role-to-user --rolebinding-name=my-new-role, it wouldn't trigger this policy. We could also create a ClusterRoleBinding directly via the API, expressed as JSON or YAML, and change the name. Again, this won't trigger the alert. But, when this is triggered, we can have high confidence it is a true-positive cluster-admin role-binding association created interactively via the oc CLI.

Using terminal session recording to record interactive OpenShift sessions

There's one more example I want to cover in this article, which is how I can provide context for user actions with terminal session recording. At the moment I have alerts on anomalous events - but I don't have any context without searching through OpenShift audit data. This is where terminal session recording can really help. I'm going to create a bastion host, that admin users will use to interactively make changes to the platform. This bastion host will be hardened with fapolicyd for application allow-listing and user sessions will be recorded using tlog. If users make changes to sensitive objects without using the bastion host, I'm going to flag this immediately and investigate.

We'll need to create a couple of policies now - one which flags on high-interest activity (like cluster-admin role assignments), and ensures these actions are performed from the bastion host. I'll also create another policy which flags any activity on high-interest API objects which is not performed from the bastion host.

Building the bastion host

Let's firstly build a Red Hat Enterprise Linux (RHEL) 9 bastion host. I'm going to install fapolicyd, tlog and cockpit-session-recording, and enable the RHEL web console so that I can record user sessions.

sudo dnf install fapolicyd tlog cockpit-session-recording
sudo systemctl enable fapolicyd cockpit.socket --now

You can download and install the OpenShift command-line from mirror.openshift.com, and I'm going to use the 4.17.0 client for my cluster.

curl https://mirror.openshift.com/pub/openshift-v4/amd64/clients/ocp/4.17.2/openshift-client-linux.tar.gz | sudo tar -xz -C /usr/local/bin/

If you try to run oc as a non-root user on the system at this point you'll see an error:

$ oc
-bash: /usr/local/bin/oc: Operation not permitted

This is because fapolicyd is blocking the oc client, as it hasn't been installed through RPM (the RPM database is the default source of trust). Let's add oc to the ancillary trust database supported by fapolicyd:

# fapolicyd-cli --file add /usr/local/bin/oc
# fapolicyd-cli --update
Fapolicyd was notifie

Now we should be able to run oc as a user:

user@bastion-host $ oc  
OpenShift Client

This client helps you develop, build, deploy, and run your applications on any
OpenShift or Kubernetes cluster. It also includes the administrative
commands for managing a cluster under the 'adm' subcommand.

Basic Commands:
  login             Log in to a server
  new-project       Request a new project
  new-app           Create a new application
...

Configure terminal session recording

Now we have fapolicyd only allowing certain binaries (like oc) on the bastion host, but we also need to capture user sessions via tlog. Firstly login to the RHEL web console. You can access this at <hostname or ip>:9090 by default:

Select the Session Recording menu and ensure that tlog is configured to record all user sessions, and to also record user input and output:

Save the tlog settings and sssd will restart. Let's now login to the host and generate some traffic:

$ ssh user@host
$ ls -l /var/log
...
$ exit

You should now see that a user session has been recorded and is visible in the web console:

Great! Our bastion host is setup and ready to capture OpenShift admin user sessions. Let's modify the cluster-admin access created policy above to specify the IP address of the bastion host. This will allow us to identify admin actions from the bastion host, which I'm slightly less worried about, from admin actions on other hosts.

Let's check this complete workflow works. Login to the bastion host and assign cluster-admin privileges to a user, as above:

$ oc create user user1 --full-name "User One"
$ oc adm policy add-cluster-role-to-user cluster-admin user1
clusterrole.rbac.authorization.k8s.io/cluster-admin added: "user1"

I should see that an alert is generated in the RHACS Violations view showing that this change has been made from the bastion host.

Now the big test - can I capture the complete user session from tlog? You can check this by accessing the web console for the bastion host.

Success! Now I have the event alert, and importantly I have context - I can see exactly which oc commands the user ran on the bastion host.

The last step here is creating a policy that alerts on any high-interest audit events that are not performed from the bastion host. I've called this policy 'Admin action from non-bastion host' and flagged it as a 'critical' severity. You can see the policy criteria here:

Let's try it out. From a different host let's make the same change:

$ oc create user user1 --full-name "User One"
$ oc adm policy add-cluster-role-to-user cluster-admin user1
clusterrole.rbac.authorization.k8s.io/cluster-admin added: "user1"

You should now see an alert that this action has been performed from a non-bastion host (in this case from an IPv6-capable host on the network, to an IPv6-capable OpenShift cluster).

Not the end of the story

OpenShift event logs generate a lot of data - 200M in a 30-minute period, on a single-node cluster with practically no workloads. Creating alerts from these event logs with Red Hat Advanced Cluster Security for Kubernetes (RHACS) helps to manage this volume of data, and raise actionable alerts.

The last example here demonstrated that this process can't capture all events, and it certainly doesn't meet the requirement to store OpenShift event logs for archival. In another article in this series I'll look at how to complement RHACS alerts with limited audit log indexing and comprehensive archival. This workflow also doesn't take into GitOps-managed changes - it's mainly auditing and alerting on interactive user sessions. In a future article I'll also look at how we can better audit and alert on GitOps-managed clusters, and understand user changes more effectively.

If you want to try this out in your own cluster, I've added these policies to the StackRox contributions repository.