Memory Cgroup: the initial kernel memory cgroup controller (v3.8)

4 min read

In this post, I try to describe the high-level idea without going into too much implementation detail.

Note: I’ll briefly discuss the initial kernel memory cgroup controller. The Linux kernel has adopted a completely redesigned controller (called obj_cgroup) since v5.9 and I’ll cover that later.

Motivation

The main job of the memory cgroup controller is to (1) track and (2) limit the amount of memory a cgroup allocates. If there's a type of memory that a process can trigger allocation of, but it's not tracked by the memcg controller, that would be considered a flaw. And yes, the "type of memory" I'm talking about is kernel memory!

A user process can allocate certain types of kernel memory by invoking system calls—like dentry and inode cache when accessing files, and task_struct and kernel stacks when creating processes. There are many more examples, but I won't list them all. ;)

Without tracking kernel memory allocations, a malicious user could increase memory usage well beyond the memcg limit. This led Glauber Costa at Parallels to add support for the kernel memory controller in 2012.

Design

The initial kmem controller tracks kernel memory in two categories: (1) slab allocations and (2) non-slab kmem allocations. Specifically, the initial patch series only tracks non-slab allocations for kernel stacks. To track kernel allocations, it's necessary to map kernel memory to the memory cgroup. This is important because the controller needs to uncharge memory from the memcg when it is freed. In fact, since struct page_cgroup (which extends struct page for memcg-specific metadata) already has a memcg field to record the memcg pointer, storing this information is straightforward.

Tracking non-slab allocation

Tracking kernel allocations that are not from slab is relatively simple; Pass __GFP_KMEMCG flag (which is renamed to __GFP_ACCOUNT later) to the page allocator when allocating memory. Then the page allocator will do the followings:

  1. Check if we’ll exceed the memcg limit after serving this allocation. If yes, try to reclaim memory first.

  2. Charge memory and allocate a page. If the allocation failed, uncharge memory and return NULL.

  3. If both charging and allocation succeeded, record the pointer to the memcg in the struct page_cgroup.

Pretty simple, right? Uncharging is also straightforward:

  1. Look up the struct page_cgroup for the struct page

  2. Read the pointer to struct mem_cgroup, and

  3. Uncharge memory.

Tracking slab memory allocation

Tracking slab allocations is a bit trickier. As we discussed earlier, kmem tracking is done on a per-page basis because we use the struct page_cgroup to find the memory cgroup that allocated the memory. struct page_cgroup is mapped 1:1 with the struct page, which represents metadata for of a page. However, in most slab caches, the object size is smaller than the page size (PAGE_SIZE). The slab allocator must ensure that objects for different memory cgroups are not allocated from the same slab.

The slab allocator solves this problem simply by creating a separate slab cache for each memory cgroup. When you allocate a slab object from slab cache X, there is a copy of the cache for each memory cgroup. Then, you look up the copy and allocate an object from it.

With this approach, the kmem controller only tracks the allocation of slab pages (with the method described in “Tracking non-slab allocation” section).

Recent changes

This is a list of interesting changes (to me) since the initial adoption of the kmem controller. The list will likely grow as I learn more about it. ;)

New slab memory controller in v5.8

The initial kmem controller has worked well for almost a decade since it was merged into the mainline. However, as it is used more in production, it has been observed that creating a slab cache for each memory cgroup unnecessarily increases memory usage because of low slab memory utilization.

Tracking memory cgroups per page creates a challenge because slab pages can't be shared between memory cgroups, resulting in low memory usage efficiency. To address this issue, Roman Gushchin, a Facebook engineer at the time, introduced a new method to track memory cgroups for each slab object, with new obj_cgroup API.

The kmem controller cannot be separately anymore

One thing to note is that for a long time, it was possible to turn the kmem controller on or off while using the memory cgroup controller. However, since there were no valid use cases for tracking only user memory without also tracking user-triggered kernel allocations, it can no longer be disabled separately in recent kernel versions.