Memory Cgroup Naturalization in the Linux Kernel v3.3
This article explains how the LRU scheme with the memory cgroup controller was changed in Linux kernel v3.3 (released in 2012). This update integrated the memory cgroup into the core MM rather than just “bolting into” the MM code, and it also reduced the memory overhead of the memory cgroup controller.
Memory Reclaim w/o and w/ Memory Cgroup
The Linux kernel memory manager uses a "Split LRU" [3] mechanism, which divides the Least-Recently-Used list into separate LRU lists for anonymous memory and file-backed memory. It also keeps "active" and "inactive" lists for both anonymous and file-backed memory, resulting in four lists per NUMA node. (I'm not discussing the "unevictable LRU list" and MGLRU here.)
The kernel reclaims memory when free memory is low, either through indirect reclaim (by waking up kswapd) or direct reclaim (the process requesting memory synchronously reclaims it). No matter what triggers the memory reclamation, the kernel reclaims memory from the "inactive" list and balances the active and inactive lists to select suitable reclaim candidates.
Without memory cgroups, reclamation is meant to be "global": meaning that when memory is reclaimed, pages from any process can be reclaimed. There is no separation between processes.
With memory cgroups, things work differently. Each memory cgroup has a maximum limit on how much memory it can allocate. When a memory cgroup reaches its limit, it triggers a reclaim of its own memory to stay within that limit (a.k.a. limit reclaim). Otherwise, it would exceed its allocation limit. This introduces a requirement that is a bit different from traditional memory management: separate LRU lists need to be maintained for each memory cgroup, and limit reclaim must be done within a memory cgroup.
In the early days of memory cgroups, the Linux kernel used a double LRU scheme: a page belongs to two LRU lists, one for the global LRU and the other for the per-memcg LRU. The per-memcg LRU was used when memory reclamation was triggered by a specific memory cgroup, and the global LRU was used when the process that triggered memory reclamation did not belong to a memory cgroup.
The disvantage of double LRU scheme
The main disadvantage of double LRU scheme was memory overhead. When a page belongs to one list, it requires 2 pointers (16 bytes) per page (usually 4 kilobytes), resulting in 0.39% of the system's total memory being used just for managing the lists. With double-LRU scheme, it’s 0.79%. In the memory cgroup naturalization series, the memory saving is highlighted [1]:
This patchset disables the global per-zone LRU lists on memory cgroup configurations and converts all its users to operate on the per-memory cgroup lists instead. As LRU pages are then exclusively on one list, this saves two list pointers for each page frame in the system: page_cgroup array size with 4G physical memory
vanilla: [ 0.000000] allocated 31457280 bytes of page_cgroup
patched: [ 0.000000] allocated 15728640 bytes of page_cgroup
The memory overhead for managing memory cgroup-related data (page_cgroup) was reduced by half! It might seem small, but on the author’s laptop with 32GiB of memory, a 0.39% saving equals 127.79 MiB.
Memory Cgroup Naturalization
Reclaim behavior depends on whether it's limit reclaim or global reclaim
in Linux kernel, the function that reclaims memory from a specific zone is shrink_zone().
It takes three arguments:
reclaim priority (how hard should we try to reclaim?)
target zone (the zone we want to reclaim memory from)
scan_control (various parameters for reclaim)
shrink_zone() can be called by kswapd (indirect reclaim) or by direct reclaim. When memory reclaim is triggered by a specific memory cgroup, sc→mem_cgroup points to the target memory cgroup. scanning_global_lru(sc) returns true when it’s a global reclaim and false otherwise. Because global reclaim and limit reclaim use different methods to free up memory, the reclamation behavior varies depending on the reason it is triggered.
Since Linux v3.3, it’s slightly different. Scanning the per-memcg LRU list doesn't necessarily mean it's a limit reclaim. A new helper, global_reclaim() checks sc→target_mem_cgroup to determine if it's a global reclaim or not.
Limit Reclaim
In Linux kernel v3.2, limit reclaim is triggered in the following situations:
When allocating memory exceeds the memcg limit, causing memory to be reclaimed.
When the limit is resized, resulting in memory usage being higher than the new limit.
When global reclaim is performed, the kernel reclaims memory from memcgs if a memcg has more memory than its soft limit.
In Linux v3.2, mem_cgroup_hierarchical_reclaim() performs limit reclaim. We won't go into the implementation details, but it's straightforward: it goes through memory cgroups in the subtree of the target memory cgroup, tries to reclaim memory from each one, and stops reclaiming if there's some space to allocate more memory for the target memory cgroup. When reclaiming from a specific memory cgroup, it simply takes pages from the per-memcg LRU list (inactive list, of course) and reclaims them if possible.
If you're wondering why we reclaim memory from descendant memcgs to make space for the target memcg, that's a great question! Before version 2.6.39, struct res_counter
was just an independent counter. However, starting from version 2.6.39, struct res_counter
became aware of the hierarchy. When a memory cgroup controller charges memory, it now moves up the hierarchy (see res_counter_charge()
). So, if you reclaim some memory from descendant memcgs, it also makes space for the target memcg.
A slight change to limit reclaim in v3.3: now mem_cgroup_hierarchical_reclaim() is divided into two functions, mem_cgroup_soft_reclaim() and mem_cgroup_reclaim(). mem_cgroup_soft_reclaim() continues to iterate over memory cgroups for soft reclaim, while mem_cgroup_reclaim() relies on the reclaim code to walk the hierarchy. More details will be covered later in the article.
Global Reclaim
Global reclaim can be triggered by kswapd (indirect reclaim) or by direct reclaim when the process is not part of a memory cgroup. Before version 3.3, the kernel would simply take pages from the global LRU list (specifically, the inactive list) and reclaim them.
With the new approach, we avoid maintaining two separate LRU lists and have removed the global LRU list. But on kernels with memcg enabled, how do we perform global reclaim if the global LRU list is gone?
Patch 7 of Johannes’ series [1] states that:
Since the LRU pages of a zone are distributed over all existing memory cgroups, a scan target for a zone is complete when all memory cgroups are scanned for their proportional share of a zone's memory.
So, the answer to the original question is that, now we maintain only per-memcg LRU lists, and global reclaim scans each memory cgroup's exclusive LRU list, and the number of scanned pages per memory cgroup is proportional to the number of pages the memory cgroup allocated from the zone. With the new scheme, global reclaim is somewhat similar to a limit reclaim where the target memcg is the root memcg.
The old scheme (v3.2)
Now that we understand how it worked in version 3.2 and how it changed in version 3.3, let's read some code.
Adding and removing pages from LRU lists
static inline void
__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
struct list_head *head)
{
list_add(&page->lru, head);
__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
mem_cgroup_add_lru_list(page, l);
}
static inline void
add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
{
__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
}
static inline void
del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
{
list_del(&page->lru);
__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
mem_cgroup_del_lru_list(page, l);
}
With the double-LRU scheme, as you can see, the helpers for adding and removing pages from LRU lists actually add and remove pages from two LRU lists: one for the global LRU list (zone→lru[l].list
) and the other for the per-memcg LRU list (memcg->info.nodeinfo[nid]->zoneinfo[zid]→lists[lru]
).
shrink_zone()
static void shrink_zone(int priority, struct zone *zone,
struct scan_control *sc)
{
/* ... snip ... */
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
if (nr[l]) {
nr_to_scan = min_t(unsigned long,
nr[l], SWAP_CLUSTER_MAX);
nr[l] -= nr_to_scan;
nr_reclaimed += shrink_list(l, nr_to_scan,
zone, sc, priority);
}
}
/* ... snip ... */
}
With some other parts omitted, shrink_zone()
calls shrink_list()
for each evictable LRU list (LRU_INACTIVE_ANON, LRU_ACTIVE_ANON, LRU_INACTIVE_FILE, LRU_ACTIVE_FILE).
Shrinking the active list moves some pages from the active list to the inactive list, while shrinking the inactive list takes pages from the inactive list and reclaims them. With the double LRU scheme, shrink_list()
takes pages from different lists based on the return value of scanning_global_lru(sc)
.
Let's look at shrink_inactive_list()
as an example: at a high level, the function takes pages from the inactive list and reclaims them.
/*
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
*/
static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
struct scan_control *sc, int priority, int file)
{
/* ... snip ... */
spin_lock_irq(&zone->lru_lock);
if (scanning_global_lru(sc)) {
nr_taken = isolate_pages_global(nr_to_scan, &page_list,
&nr_scanned, sc->order, reclaim_mode, zone, 0, file);
zone->pages_scanned += nr_scanned;
if (current_is_kswapd())
__count_zone_vm_events(PGSCAN_KSWAPD, zone,
nr_scanned);
else
__count_zone_vm_events(PGSCAN_DIRECT, zone,
nr_scanned);
} else {
nr_taken = mem_cgroup_isolate_pages(nr_to_scan, &page_list,
&nr_scanned, sc->order, reclaim_mode, zone,
sc->mem_cgroup, 0, file);
/*
* mem_cgroup_isolate_pages() keeps track of
* scanned pages on its own.
*/
}
When the kernel performs global reclaim, isolate_pages_global()
is called to remove pages from the global LRU list. In contrast, when reclaiming on behalf of a memory cgroup, mem_cgroup_isolate_pages()
is called to remove pages from the per-memcg LRU list.
Of course, since the pages are going to be reclaimed, they need to be removed from both lists. However, the choice of which pages to reclaim differs!
/* ... snip ... */
spin_unlock_irq(&zone->lru_lock);
nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority,
&nr_dirty, &nr_writeback);
/* ... snip ... */
local_irq_disable();
if (current_is_kswapd())
__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
/* ... snip ... */
}
shrink_page_list()
attempts to reclaim the isolated pages, and those that cannot be reclaimed are returned to the LRU list.
The new scheme (v3.3)
Adding and removing pages from LRU lists
static inline void
add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list lru)
{
struct lruvec *lruvec;
lruvec = mem_cgroup_lru_add_list(zone, page, lru);
list_add(&page->lru, &lruvec->lists[lru]);
__mod_zone_page_state(zone, NR_LRU_BASE + lru, hpage_nr_pages(page));
}
static inline void
del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list lru)
{
mem_cgroup_lru_del_list(page, lru);
list_del(&page->lru);
__mod_zone_page_state(zone, NR_LRU_BASE + lru, -hpage_nr_pages(page));
}
Now that the global LRU list is gone, add_page_to_lru_list()
uses the lruvec, which is a set of LRU lists for file/anon and inactive/active pages. This lruvec is returned by mem_cgroup_lru_add_list()
. It is either a per-memcg LRU list (if memcg is enabled) or the zone’s LRU list (if memcg is disabled).
del_page_from_lru_list()
removes the page from the list. mem_cgroup_lru_del_list()
only accounts for the removal of the page and does not perform any list operations.
shrink_zone()
static void shrink_zone(int priority, struct zone *zone,
struct scan_control *sc)
{
struct mem_cgroup *root = sc->target_mem_cgroup;
struct mem_cgroup_reclaim_cookie reclaim = {
.zone = zone,
.priority = priority,
};
struct mem_cgroup *memcg;
memcg = mem_cgroup_iter(root, NULL, &reclaim);
do {
struct mem_cgroup_zone mz = {
.mem_cgroup = memcg,
.zone = zone,
};
shrink_mem_cgroup_zone(priority, &mz, sc);
/*
* Limit reclaim has historically picked one memcg and
* scanned it with decreasing priority levels until
* nr_to_reclaim had been reclaimed. This priority
* cycle is thus over after a single memcg.
*
* Direct reclaim and kswapd, on the other hand, have
* to scan all memory cgroups to fulfill the overall
* scan target for the zone.
*/
if (!global_reclaim(sc)) {
mem_cgroup_iter_break(root, memcg);
break;
}
memcg = mem_cgroup_iter(root, memcg, &reclaim);
} while (memcg);
}
shrink_zone()
is almost the same as before, but the original shrink_zone()
code has been moved to shrink_mem_cgroup_zone(). Now, shrink_zone() calls this function for each memory cgroup that mem_cgroup_iter()
returns during global reclaim. However, if it's limit reclaim (global_reclaim(sc) == false), it only calls shrink_mem_cgroup_zone() for the first memory cgroup that mem_cgroup_iter() returns.
During global reclaim, the kernel iterates over all memory cgroups under the target memcg and scans each for its proportional share of the zone's memory and reclaim memory from memory cgroups, with priority
provided. mem_cgroup_iter() returns NULL after a round-trip of the hierarchy.
Limit reclaim works differently because, as mentioned in the comment, (historically) it selects a memory cgroup, tries to reclaim memory for each priority level, and then moves on to the next memcg.
Since shrink_zone() only calls shrink_mem_cgroup_zone() for the first memory cgroup returned during limit reclaim, the limit reclaim logic (mem_cgroup_reclaim()) calls try_to_free_mem_cgroup_pages(), which calls shrink_zone() for each priority level and each zone for the selected memcg. This is the "Picking a memcg, and then scanning with decreasing priority" part.
Moving on to the next memcg is done by calling try_to_free_mem_cgroup_pages() multiple times; This approach works because mem_cgroup_iter() tracks the last visited memory cgroup for each zone and each priority level. mem_cgroup_reclaim() calls try_to_free_mem_cgroup_pages() at maximum MEM_CGROUP_MAX_RECLAIM_LOOPS (hardcoded to 100) times.
In summary, limit reclaim is performed as follows:
function mem_cgroup_reclaim()
Loop, maximum 100 times:
call try_to_free_mem_cgroup_pages():
For each priority, call shrink_zones():
For each zone, call shrink_zone():
Select a memcg and scan & reclaim memory from it.
The selected memcg is the next memcg of the last visited memcg
from the previous iteration.
Closing
This article covered how the early days of memcg used a double-LRU scheme and how the memcg naturalization series integrated this to achieve both global reclaim and limit reclaim using only a per-memcg LRU list.
Stay tuned for more articles on Linux kernel memory management!
References
[1] Johannes Weiner, November 2011, memcg naturalization -rc5, linux-mm mailing list
[2] Jonathan Corbet, May 2011, Integrating memory control groups, LWN.net
[3] Rick van Riel, October 2010, PageReplacementDesign, kernelnewbies.org