A Brief Look At Memory Cgroup Controller Introduced In The Linux Kernel v2.6.25
Introduction
Caution: This article only covers memory cgroup in Linux kernel v2.6.25. Things may have been changed a lot now.
Control group, often abbreviated as cgroup
, is a Linux kernel feature that allows managing resources (CPU, memory, network/storage bandwidth, etc.) among groups of processes. cgroup provides the illusion that each group is running on a separate machine with its own resources.
In this article, I'll focus on the memory cgroup controller (aka. memcg), which introduces a 'barrier' that limits the amount of memory a cgroup can use. Let's take a brief look at what the memory cgroup looked like at the very beginning.
Documentation/controllers/memory.txt [1] states that the important features of memory cgroup are:
Enable control of both RSS (mapped) and Page Cache (unmapped) pages.
The infrastructure allows easy addition of other types of memory to control .
Provides zero overhead for non memory controller users .
Provides a double LRU: global memory pressure causes reclaim from the global LRU; a cgroup on hitting a limit, reclaims from the per cgroup LRU.
Memory Charging APIs
The APIs are very simple! Call mem_cgroup_charge()
when charging, and mem_cgroup_uncharge()
to uncharge memory. When charging page cache pages, call mem_cgroup_cache_charge()
.
/*
* mem_cgroup_charge
* Charge pages to the mem_cgroup associated with the mm_struct.
* Charging may fail if the limit is exceeded and reclaim cannot free enough memory.
*/
int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask);
/* Almost the same as mem_cgroup_charge, but for unmapped page cache pages. */
int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask);
/* Uncharge never fails. */
void mem_cgroup_uncharge_page(struct page *page);
RSS and Page Cache Accounting
The Resident Set Size (RSS) of a process is the part of memory that the process is using in RAM right now. It shows how much memory is actually in physical memory. Memory not yet in RAM, swapped out to disk, or used by the kernel is not part of the RSS. The Memory Cgroup Controller keeps track of the RSS for each group.
While memory mapped into a process's address space is counted as RSS, some file-backed memory is not mapped into the address space. For example, this happens when a process reads from or writes to a file using read() and write() instead of mmap(). Therefore, the Memory Cgroup Controller separately tracks the Page Cache (unmapped).
When a process allocates memory through page faults or read()/write() system calls, the memory cgroup controller tracks the memory usage of the memory cgroup the process belongs to. The Swap Cache (Unmapped) is not yet accounted for in version 2.6.25.
Unmaped Page Cache Pages Accounting
Unmapped pages are added when the add_to_page_cache() function is called. If a page is allocated to the page cache and later mapped, it is not counted twice because it will already have a page_cgroup when it is mapped. More details will be provided later in the article.
int add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t offset, gfp_t gfp_mask)
{
int error = mem_cgroup_cache_charge(page, current->mm,
gfp_mask & ~__GFP_HIGHMEM);
if (error)
goto out;
error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
if (error == 0) {
/* ... some filesystem magic ... */
} else
mem_cgroup_uncharge_page(page);
out:
return error;
}
The add_to_page_cache()
function calls mem_cgroup_cache_charge()
to check if more memory can be allocated. mem_cgroup_cache_charge()
returns zero if it successfully charges memory. If it fails, memory cannot be allocated, so it returns an error. This pattern is used by any code that charges pages.
/*
* Remove a page from the page cache and free it. Caller has to make
* sure the page is locked and that nobody else uses it - or that usage
* is safe. The caller must hold a write_lock on the mapping's tree_lock.
*/
void __remove_from_page_cache(struct page *page)
{
struct address_space *mapping = page->mapping;
mem_cgroup_uncharge_page(page);
/* ... snip ... */
}
The unmapped page caches are then uncharged when they are evicted from the page cache by the __remove_from_page_cache()
function.
Originally, shmfs/tmpfs pages were charged through the page cache interface, but this caused a hang. Later, a commit added charging logic to mm/shmem.c, fixed the issue by separating the charging logic for shmfs/tmpfs.
Unmapped swap cache pages used to be charged, but this was reverted because swapin readahead and swapoff often led to pages being charged to the wrong cgroup. In version 2.6.25, swap caches were not charged.
Mapped Pages Accounting
Mapped pages are charged when new pages are allocated to a process. The memory cgroup accounting commit adds a few mem_cgroup_charge() calls in several places to check the limit before actually allocating pages.
do_anonymous_page()
→ Allocates anonymous pages to processes during a page fault.do_wp_page()
→ Breaks Copy-on-Write (CoW) and allocates a copy during a write page fault.do_swap_page()
→ Manages swap-in, bringing swapped pages from disk to memory.__do_fault()
→ Manages page faults for file mappings.remove_migration_pte()
→ Restores a migration page table entry to a working page table entry.insert_page()
→ Inserts a page into a Virtual Memory Area (VMA).unuse_pte()
→ Releases swap entries.
The pages are uncharged in several situations, which can be categorized as follows:
When a page is already charged, but some operations fail after charging.
- For example, if
do_anonymous_page()
successfully charges but fails to allocate a page, it is uncharged immediately.
- For example, if
When a page is unmapped (by
page_remove_rmap()
).- Note that both anonymous and file pages should be unmapped before they are reclaimed.
When a page is charged multiple times.
- Shared pages are unconditionally charged on fault. However, they are uncharged if the page is already mapped to an address space.
Shared Page Accounting
As described above, pages that are shared between memory cgroups are accounted using a first-touch approach, meaning only the first memory cgroup that touches the page is accounted for the page. You can see how the kernel charges only the first memory cgroup by looking at page_add_{anon,file}_rmap()
: a page is unconditionally charged on a page fault, but if it turns out that the page is already mapped into an address space, it is uncharged for all subsequent memory cgroups (Yes! this is inefficient).
/**
* page_add_file_rmap - add pte mapping to a file page
* @page: the page to add the mapping to
*
* The caller needs to hold the pte lock.
*/
void page_add_file_rmap(struct page *page)
{
if (atomic_inc_and_test(&page->_mapcount))
__inc_zone_page_state(page, NR_FILE_MAPPED);
else
/*
* We unconditionally charged during prepare, we uncharge here
* This takes care of balancing the reference counts
*/
mem_cgroup_uncharge_page(page);
}
/**
* page_add_anon_rmap - add pte mapping to an anonymous page
* @page: the page to add the mapping to
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
*
* The caller needs to hold the pte lock and the page must be locked.
*/
void page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
{
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
if (atomic_inc_and_test(&page->_mapcount))
__page_set_anon_rmap(page, vma, address);
else {
__page_check_anon_rmap(page, vma, address);
/*
* We unconditionally charged during prepare, we uncharge here
* This takes care of balancing the reference counts
*/
mem_cgroup_uncharge_page(page);
}
}
For curious readers, this first-touch approach is the main reason for the dying cgroup problem [3]. This issue occurs because it's not efficient to recharge or reparent all previously charged pages when a memory cgroup is killed. Instead, it waits for all charged pages to be reclaimed before destroying the memory cgroup, resulting in thousands of zombie memory cgroups that are not reclaimed.
Reclaim
The basic idea behind memory cgroup reclamation is straightforward: the memory manager now maintains two different types of LRU. A page can is on both Global LRU and Per-MEMCG LRU if the page is charged.
Global LRU
- A system-wide set of active and inactive lists, where all mapped and cached pages are inserted, regardless of whether they are owned by a memory cgroup.
Per-MEMCG LRU
Active and inactive lists maintained separately for each memory cgroup, with a distinct set of lists for each zone.
A page is inserted to both global and Per-MEMCG LRU.
A page is inserted into the Global LRU when it is mapped to an address space or added to the page cache, and removed when it is reclaimed. The
lru
field ofstruct page
links the page to the list.A page is inserted into the Per-MEMCG Per-Memory-Cgroup LRU when it is charged to a memory cgroup, and removed when it is uncharged. The
lru
field ofstruct page_cgroup
links the page to the list.
Closer Look At Memory Cgroup Data Structures
struct mem_cgroup
represents a memory cgroup. The memory cgroup controller tracks memory usage for each memory cgroup. struct mm_struct
includes a pointer to the memory cgroup it belongs to, if the process is part of a memory cgroup. struct page
, which you may know, represents a page frame. struct page_cgroup
links struct page
and struct mem_cgroup
and is allocated for each page that is tracked.
strcut page_cgroup
/*
* We use the lower bit of the page->page_cgroup pointer as a bit spin
* lock. We need to ensure that page->page_cgroup is at least two
* byte aligned (based on comments from Nick Piggin). But since
* bit_spin_lock doesn't actually set that lock bit in a non-debug
* uniprocessor kernel, we should avoid setting it here too.
*/
#define PAGE_CGROUP_LOCK_BIT 0x0
#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
#else
#define PAGE_CGROUP_LOCK 0x0
#endif
/*
* A page_cgroup page is associated with every page descriptor. The
* page_cgroup helps us identify information about the cgroup
*/
struct page_cgroup {
struct list_head lru; /* per cgroup LRU list */
struct page *page;
struct mem_cgroup *mem_cgroup;
int ref_cnt; /* cached, mapped, migrating */
int flags;
};
#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
struct page_cgroup
is straightforward: it connects a page to a mem_cgroup and an LRU list (active or inactive). The memcg developers were careful not to increase the size of struct page
, as that's generally not a good practice. So, they created struct page_cgroup
as an extension to struct page
for memcg.
It has a reference count and two flags: PAGE_CGROUP_FLAGS_CACHE
(indicating the page is charged as page cache, or RSS if not set) and PAGE_CGROUP_FLAG_ACTIVE
(indicating the page is on the active LRU list). The lower bit of the page→page_cgroup
field is used to synchronize access to page_cgroup
.
struct mem_cgroup
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
* statistics based on the statistics developed by Rik Van Riel for clock-pro,
* to help the administrator determine what knobs to tune.
*
* TODO: Add a water mark for the memory controller. Reclaim will begin when
* we hit the water mark. May be even add a low water mark, such that
* no reclaim occurs from a cgroup at it's low water mark, this is
* a feature that will be implemented much later in the future.
*/
struct mem_cgroup {
struct cgroup_subsys_state css;
/*
* the counter to account for memory usage
*/
struct res_counter res;
/*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
*/
struct mem_cgroup_lru_info info;
int prev_priority; /* for recording reclaim priority */
/*
* statistics.
*/
struct mem_cgroup_stat stat;
};
struct cgroup_subsys_state
represents the memory controller’s state within the cgroup hierarchy, linking the memory cgroup to the core cgroup infrastructure for managing hierarchy, lifetime, and resource accounting. It acts as a common interface for the cgroup core to handle the memory cgroup just like other controllers in a unified way.
struct res_counter
is a resource counter that contains three internal counters (usage, limit, failcnt) along with a spinlock. It tracks the amount of memory used by a memcg (usage), the maximum memory a memcg can use (limit), and the number of times it failed to allocate memory (failcnt) due to reaching the limit.
The struct mem_cgroup_lru_info
contains a set of linked lists. Although I called it Per-MEMCG LRU, the list exists for each zone. These Per-MEMCG LRU lists help reclaim memory when the limit is reached, making room for allocation and allowing memory to be allocated without exceeding the limit.
struct mem_cgroup_per_zone {
/*
* spin_lock to protect the per cgroup LRU
*/
spinlock_t lru_lock;
struct list_head active_list;
struct list_head inactive_list;
unsigned long count[NR_MEM_CGROUP_ZSTAT];
};
struct mem_cgroup_per_node {
struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
};
struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
};
struct mem_cgroup_lru_info
is straightforward: it maintains a struct mem_cgroup_per_zone
for each memory zone of every NUMA node. Each mem_cgroup_per_zone
contains an active list, an inactive list, a spinlock for synchronized access to these lists, and counters for various statistics.
Memory Charging API implementation
Now that we've looked at its features and data structures, you should have a basic understanding of how memcg works. Let's dive into the implementation of the charging API (as of v2.6.25).
mem_cgroup_{,cache_}charge()
int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
{
return mem_cgroup_charge_common(page, mm, gfp_mask,
MEM_CGROUP_CHARGE_TYPE_MAPPED);
}
int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask)
{
if (!mm)
mm = &init_mm;
return mem_cgroup_charge_common(page, mm, gfp_mask,
MEM_CGROUP_CHARGE_TYPE_CACHE);
}
Both mem_cgroup_charge()
and mem_cgroup_cache_charge()
are wrappers for mem_cgroup_charge_common()
, but they pass different flags (MEM_CGROUP_CHARGE_TYPE_{MAPPED,CACHE}
). Additionally, mem_cgroup_cache_charge()
uses init_mm
(the struct mm_struct
of the swapper process, not the init process) when the mm
parameter is NULL
.
/*
* Charge the memory controller for page usage.
* Return
* 0 if the charge was successful
* < 0 if the cgroup is over its limit
*/
static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask, enum charge_type ctype)
{
struct mem_cgroup *mem;
struct page_cgroup *pc;
unsigned long flags;
unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct mem_cgroup_per_zone *mz;
if (mem_cgroup_subsys.disabled)
return 0;
The memcg charging API does nothing when the feature is disabled.
/*
* Should page_cgroup's go to their own slab?
* One could optimize the performance of the charging routine
* by saving a bit in the page_flags and using it as a lock
* to see if the cgroup page already has a page_cgroup associated
* with it
*/
retry:
lock_page_cgroup(page);
pc = page_get_page_cgroup(page);
/*
* The page_cgroup exists and
* the page has already been accounted.
*/
if (pc) {
VM_BUG_ON(pc->page != page);
VM_BUG_ON(pc->ref_cnt <= 0);
pc->ref_cnt++;
unlock_page_cgroup(page);
goto done;
}
unlock_page_cgroup(page);
As mentioned earlier, the LSB of page→page_cgroup
field is used as a lock bit, and lock_page_cgroup()
acquires a bit spinlock to synchronize access to page→page_cgroup
.
When a page has a page_cgroup
(either because it was in the page cache and then mapped, or it’s a shared anonymous or file page), it is already charged, so we only increase the reference count.
As mentioned earlier, a page is never charged more than once:
If a page is in the page cache and then mapped, it is not charged again.
If a page is shared, only the first memory control group (memcg) that allocates the page charges it.
pc = kzalloc(sizeof(struct page_cgroup), gfp_mask);
if (pc == NULL)
goto err;
If the page does not have struct page_cgroup
, allocate it. If the allocation fails, there’s not much the kernel can do. It simply fails to allocate memory because it cannot charge the page.
/*
* We always charge the cgroup the mm_struct belongs to.
* The mm_struct's mem_cgroup changes on task migration if the
* thread group leader migrates. It's possible that mm is not
* set, if so charge the init_mm (happens for pagecache usage).
*/
if (!mm)
mm = &init_mm;
As the comment states, in most cases, the memory control group (memcg) that the struct mm_struct
points to is charged. However, during task migration, it can be NULL, and if that happens, the init_mm
's memcg is charged.
rcu_read_lock();
mem = rcu_dereference(mm->mem_cgroup);
/*
* For every charge from the cgroup, increment reference count
*/
css_get(&mem->css);
rcu_read_unlock();
Grab a reference to struct mem_cgroup
via css_get(&mem->css)
. Each page charged to the memcg increment the reference count. This make sure that the memcg does not disappear as long as there are pages charged to the memcg.
while (res_counter_charge(&mem->res, PAGE_SIZE)) {
if (!(gfp_mask & __GFP_WAIT))
goto out;
if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
continue;
Now, memcg charges the memory before actually allocating it. If there is enough space for the allocation, res_counter_charge()
returns zero, and everything is fine. If there isn't enough space, the kernel reclaims some memory to make room for the allocation.
If it is an atomic context (where gfp_mask
does not include __GFP_WAIT
), it cannot perform direct memory reclamation, so the charge fails.
try_to_free_mem_cgroup_pages()
returns 1 if it successfully reclaims some memory, allowing it to try allocating memory again. If it cannot reclaim enough memory, it returns zero.
/*
* try_to_free_mem_cgroup_pages() might not give us a full
* picture of reclaim. Some pages are reclaimed and might be
* moved to swap cache or just unmapped from the cgroup.
* Check the limit again to see if the reclaim reduced the
* current usage of the cgroup before giving up
*/
if (res_counter_check_under_limit(&mem->res))
continue;
However, in some cases, even if try_to_free_mem_cgroup_pages()
returns zero, the memory cgroup might still have some space available for allocation. So check the limit again.
I think this is a situation where unmapping or moving pages to the swap cache has uncharged some pages, but the pages haven't been reclaimed yet (I haven't verified this yet).
if (!nr_retries--) {
mem_cgroup_out_of_memory(mem, gfp_mask);
goto out;
}
congestion_wait(WRITE, HZ/10);
}
If trying to reclaim memory fails after nr_retries
attempts (set to MEM_CGROUP_RECLAIM_RETRIES
), the OOM killer is called to free up memory, but for now, let the charging fail. Note: mem_cgroup_out_of_memory()
chooses the OOM victim only from tasks within the memory cgroup.
pc->ref_cnt = 1;
pc->mem_cgroup = mem;
pc->page = page;
pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
pc->flags |= PAGE_CGROUP_FLAG_CACHE;
Initialize the struct page_cgroup
structure. A new page_cgroup
has a reference count of one, points to the memory cgroup of the current task, and is treated as active in LRU. If the ctype
has the MEM_CGROUP_CHARGE_TYPE_CACHE
flag set, it also sets the PAGE_CGROUP_FLAG_CACHE
flag in the page_cgroup
.
lock_page_cgroup(page);
if (page_get_page_cgroup(page)) {
unlock_page_cgroup(page);
/*
* Another charge has been added to this page already.
* We take lock_page_cgroup(page) again and read
* page->cgroup, increment refcnt.... just retry is OK.
*/
res_counter_uncharge(&mem->res, PAGE_SIZE);
css_put(&mem->css);
kfree(pc);
goto retry;
}
page_assign_page_cgroup(page, pc);
If other tasks have already set the page_cgroup, uncharge the memory, decrease the refcount to memcg, free the page_cgroup, and try again. When retrying, it's likely that the page_cgroup is still valid, so it will end up incrementing a refcount. If it has already been uncharged, it will retry the charging process from the beginning.
Now that the kernel has initialized struct page_cgroup
, link the page with the memcg through page_cgroup.
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
__mem_cgroup_add_list(pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);
page_cgroup_zoneinfo()
returns the struct mem_cgroup_per_zone
for the zone that the struct page_cgroup.page
is part of. Then, it calls __mem_cgroup_add_list()
while holding the lru spinlock. __mem_cgroup_add_list()
simply inserts the page into active or inactive LRU list, depending on the existence of PAGE_CGROUP_FLAG_ACTIVE
flag.
unlock_page_cgroup(page);
done:
return 0;
Now that the page is successfully charged, unlock and return zero.
out:
css_put(&mem->css);
kfree(pc);
err:
return -ENOMEM;
}
In case of an error, release the reference to memcg, free the page_cgroup, and return an error.
mem_cgroup_uncharge()
/*
* Uncharging is always a welcome operation, we never complain, simply
* uncharge.
*/
void mem_cgroup_uncharge_page(struct page *page)
{
struct page_cgroup *pc;
struct mem_cgroup *mem;
struct mem_cgroup_per_zone *mz;
unsigned long flags;
if (mem_cgroup_subsys.disabled)
return;
Similar to mem_cgroup_charge_common()
, uncharging does nothing when memcg is disabled.
/*
* Check if our page_cgroup is valid
*/
lock_page_cgroup(page);
pc = page_get_page_cgroup(page);
if (!pc)
goto unlock;
Check if the page has a valid page_cgroup. It it doesn’t (sounds like an error?), do nothing.
VM_BUG_ON(pc->page != page);
VM_BUG_ON(pc->ref_cnt <= 0);
Check some error conditions using VM_BUG_ON()
.
if (--(pc->ref_cnt) == 0) {
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
__mem_cgroup_remove_list(pc);
spin_unlock_irqrestore(&mz->lru_lock, flags);
Drop the refcount while holding the lock. If the refcount drops to zero, remove the page_cgroup from the Per-MEMCG LRU list.
page_assign_page_cgroup(page, NULL);
unlock_page_cgroup(page);
mem = pc->mem_cgroup;
res_counter_uncharge(&mem->res, PAGE_SIZE);
css_put(&mem->css);
kfree(pc);
return;
}
Set page→page_cgroup
to NULL and unlock. Uncharge and remove the reference to the memcg. Finally, free the page_cgroup
.
unlock:
unlock_page_cgroup(page);
}
Release the bit spinlock and return.
Things that are outdated today
Currently, I'm not fully up-to-date with the recent history of memcg, so this article doesn't give a complete view of memcg yet. However, let me discuss some topics covered in this article that are clearly outdated today.
Naturalization of LRU handling
Johannes Weiner unified [4] [6] the double LRU scheme (global and per-memcg) into a per-memcg LRU scheme. The global LRU no longer exists. Pages are only on the LRU list of the memcg that charged them. If pages are not charged to a memcg, they are on the LRU list of the root cgroup. It removed the memory overhead to maintain two separate LRU and simplified the logic.
Removal of struct page_cgroup
Johannes Weiner removed [5] struct page_cgroup
because it had been reduced to just a word size, making it inefficient to allocate struct page_cgroup
separately and link it with struct page
. Now, each struct page
has a pointer to the struct mem_cgroup
it is charged to.
Closing
This article covered important features of memcg, the implementation of memcg (un)charging API, reclamation, and some of updates after v2.6.25 that obsoleted some implementation details. While it is not up-to-date view on the full picture of memcg, hopefully this gave readers a brief understanding of what memory cgroup does.
Please note that this article did not cover how memory reclaim, the OOM killer, and task migration work with memcg. This decision was made to avoid overwhelming you with too many details in one article.
References
[1] Linux kernel developers, v2.6.25, Memory Resource Controller, Documentation/controllers/memory.txt
[2] Jonathan Corbet, Controlling memory use in containers, July 31, 2007, LWN.net
[3] Jonathan Corbet, Cleaning up dying control groups, 2022 edition, May 19, 2022, LWN.net
[4] Jonathan Corbet, Integrating memory control groups, May 2011, LWN.net
[5] Johannes Weiner, mm: embed the memcg pointer directly into struct page, Linux kernel git history
[6] Johannes Weiner, memcg naturalization -rc5, linux-mm mailing list.