Memory Cgroup: Charge Coalescing using Per-CPU Memcg Stock (v2.6.33)

Jun 15, 2025•11 min read

Intro

By nature, memory cgroup charging can lead to significant lock contention on the spinlock of struct res_counter when processes in a memory cgroup run on multiple CPUs. This happens because the kernel maintains a struct res_counter for each memory cgroup.

In the Linux kernel v2.6.33, KAMEZAWA Hiroyuki improved this by introducing per-CPU caching of memory charges. The author of the commit put it slightly differently: "coalesce charging via percpu storage".

The idea is very straitforward: when charging memory for a memcg, charge a group of pages (set to the size of 32 pages) at once and store this amount in a per-CPU data structure, struct memcg_stock_pcp. When the kernel charges the same memcg again, it deducts the amount from the stock. If the pcp stock runs out, it is refilled with 32 * PAGE_SIZE bytes again.

However, the percpu memcg stock can only hold one memory cgroup at a time. If it needs to charge a different memory cgroup, it must flush the current stock, charge 32 pages from the new memory cgroup, and remember this in the struct memcg_stock_pcp.

One important thing to note is that when some charges are tied to a specific CPU, other CPUs might not be able to charge memory if they hit the limit. This could force them to reclaim memory, which is unfair. At some point, before trying too hard to reclaim, the kernel needs to drain the stocks of other CPUs because that will be easier than reclaiming memory, which might involve writing back to disks.

That’s all high-level information you need to know. It’s quite simple, isn’t it?

Actually, there's one more thing you should know: charge coalescing does not coalesce uncharges. There is a separate mechanism for coalescing uncharges, which will be covered in a separate article later.

New data structure and functions

/*
 * size of first charge trial. "32" comes from vmscan.c's magic value.
 * TODO: maybe necessary to use big numbers in big irons.
 */
#define CHARGE_SIZE     (32 * PAGE_SIZE)
struct memcg_stock_pcp {
        struct mem_cgroup *cached; /* this never be root cgroup */
        int charge;
        struct work_struct work;
};

static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
static atomic_t memcg_drain_count;

The struct memcg_stock_pcp is a new data structure used to store cached charges for each CPU. It is defined as a static per-CPU variable. The cached field points to the memory cgroup that is currently cached. The charge field represents the number of bytes stocked for the CPU. The work field is used to drain the stock asynchronously.

The new functions introduced are:

consume_stock(): consume charges cached on this CPU.
refill_stock(): Refill the stock from res_counter.
drain_stock(): Return cached stocks to res_counter.
drain_local_stock(): Drain this CPU’s stock.
drain_all_stocks_sync(): Drain all stocks synchronously.
drain_all_stocks_async(): Drain all stocks asynchronously.

Now we can think of consuming the stock as the "fast path" and refilling the stock from the res_counter as the "slow path."

Let's look analyze the new functions. There's not much to comment because most of them are straightforward.

consume_stock()

/*
 * Try to consume stocked charge on this cpu. If success, PAGE_SIZE is consumed
 * from local stock and true is returned. If the stock is 0 or charges from a
 * cgroup which is not current target, returns false. This stock will be
 * refilled.
 */
static bool consume_stock(struct mem_cgroup *mem)
{
        struct memcg_stock_pcp *stock;
        bool ret = true;

        stock = &get_cpu_var(memcg_stock);
        if (mem == stock->cached && stock->charge)
                stock->charge -= PAGE_SIZE;
        else /* need to call res_counter_charge */
                ret = false;
        put_cpu_var(memcg_stock);
        return ret;
}

It's quite simple! Use get_cpu_var() and put_cpu_var() to access the per-CPU variable memcg_stock, and subtract PAGE_SIZE bytes from the stock. If the remaining charge is less than PAGE_SIZE, return false; otherwise, return true.

If you're wondering, "Is {get,put}_cpu_var() just enough to access the stock? What about interrupt contexts?" the answer is YES! This is because we don't charge anything in interrupt context, at least in version 2.6.33. Activities like page faults and reading files don't happen in interrupt context. However, as we start charging sockets, later kernel versions disable interrupts before accessing the stock since socket memory can be charged in interrupt context.

refill_stock()

/*
 * Cache charges(val) which is from res_counter, to local per_cpu area.
 * This will be consumed by consumt_stock() function, later.
 */
static void refill_stock(struct mem_cgroup *mem, int val)
{
        struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);

        if (stock->cached != mem) { /* reset if necessary */
                drain_stock(stock);
                stock->cached = mem;
        }
        stock->charge += val;
        put_cpu_var(memcg_stock);
}

refill_stock adds a new charge to the stock. When refilling the stock, the kernel first charges a certain amount of memory (32 * PAGE_SIZE), consumes some of it, and adds the rest to the stock by calling refill_stock. If the cached memory control group (memcg) is different from the current one being charged, it drains the old one and replaces it with the new memcg.

drain_stock()

/*
 * Returns stocks cached in percpu to res_counter and reset cached information.
 */
static void drain_stock(struct memcg_stock_pcp *stock)
{
        struct mem_cgroup *old = stock->cached;

        if (stock->charge) {
                res_counter_uncharge(&old->res, stock->charge);
                if (do_swap_account)
                        res_counter_uncharge(&old->memsw, stock->charge);
        }
        stock->cached = NULL;
        stock->charge = 0;
}

Drain the current charge in the stock to the struct res_counter of the memcg. For those curious about the memsw counter, it tracks the total of memory + swap usage for the memory cgroup. It is supported in cgroup v1, but not in cgroup v2.

drain_local_stock()

/*
 * This must be called under preempt disabled or must be called by
 * a thread which is pinned to local cpu.
 */
static void drain_local_stock(struct work_struct *dummy)
{
        struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock);
        drain_stock(stock);
}

drain_local_stock() is a helper function that calls drain_stock() for the current CPU's stock.

drain_all_stock_sync()

/* This is a synchronous drain interface. */
static void drain_all_stock_sync(void)
{
        /* called when force_empty is called */
        atomic_inc(&memcg_drain_count);
        schedule_on_each_cpu(drain_local_stock);
        atomic_dec(&memcg_drain_count);
}

drain_all_stock_sync() is a helper function that calls drain_local_stock() for each CPU. By using schedule_on_each_cpu(), it ensures all CPUs complete draining.

drain_all_stock_async()

/*
 * Tries to drain stocked charges in other cpus. This function is asynchronous
 * and just put a work per cpu for draining localy on each cpu. Caller can
 * expects some charges will be back to res_counter later but cannot wait for
 * it.
 */
static void drain_all_stock_async(void)
{
        int cpu;
        /* This function is for scheduling "drain" in asynchronous way.
         * The result of "drain" is not directly handled by callers. Then,
         * if someone is calling drain, we don't have to call drain more.
         * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
         * there is a race. We just do loose check here.
         */
        if (atomic_read(&memcg_drain_count))
                return;
        /* Notify other cpus that system-wide "drain" is running */
        atomic_inc(&memcg_drain_count);
        get_online_cpus();
        for_each_online_cpu(cpu) {
                struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
                schedule_work_on(cpu, &stock->work);
        }
        put_online_cpus();
        atomic_dec(&memcg_drain_count);
        /* We don't wait for flush_work */
}

drain_all_stock_async() simply schedules the work but does not wait for the CPUs to finish draining. While schedule_on_each_cpu() waits for the workqueue to complete the task, schedule_work_on() does not.

Checking memcg_drain_count before queuing draining ensures that we don't start draining asynchronously while it's already in progress.

The implementation

mem_cgroup_charge_common()

Now, let’s look at mem_cgroup_charge_common() in v2.6.33, which is the heart of charging.

/*
 * Charge the memory controller for page usage.
 * Return
 * 0 if the charge was successful
 * < 0 if the cgroup is over its limit
 */
static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
                                gfp_t gfp_mask, enum charge_type ctype,
                                struct mem_cgroup *memcg)
{
        struct mem_cgroup *mem;
        struct page_cgroup *pc;
        int ret;

        pc = lookup_page_cgroup(page);
        /* can happen at boot */
        if (unlikely(!pc))
                return 0;
        prefetchw(pc);

Look up the page_cgroup, which is allocated at boot, and skip charging if it's too early in the boot process.

Prefetch the page_cgroup into the CPU cache for better performance, as we will definitely access it soon.

        mem = memcg;
        ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page);
        if (ret || !mem)
                return ret;

        __mem_cgroup_commit_charge(mem, pc, ctype);
        return 0;
}

I covered the "charge-commit-cancel" scheme introduced in version 2.6.29 in a separate article. Please read it if you're curious about why this scheme was introduced.

__mem_cgroup_try_charge()

The function we’ll look at is __mem_cgroup_try_charge(), because it’s the function that will implement “charge coalescing”.

/*                                                                              
 * Unlike exported interface, "oom" parameter is added. if oom==true,           
 * oom-killer can be invoked.                                                   
 */                                                                             
static int __mem_cgroup_try_charge(struct mm_struct *mm,                        
                        gfp_t gfp_mask, struct mem_cgroup **memcg,              
                        bool oom, struct page *page)                            
{                                                                               
        struct mem_cgroup *mem, *mem_over_limit;                                
        int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;                            
        struct res_counter *fail_res;                                           
        int csize = CHARGE_SIZE;                                                

        if (unlikely(test_thread_flag(TIF_MEMDIE))) {                           
                /* Don't account this! */                                       
                *memcg = NULL;                                                  
                return 0;                                                       
        }

If the task is a victim of the OOM-killer, do not struggle with charging and reclaiming memory.

        /*                                                                      
         * We always charge the cgroup the mm_struct belongs to.                
         * The mm_struct's mem_cgroup changes on task migration if the          
         * thread group leader migrates. It's possible that mm is not           
         * set, if so charge the init_mm (happens for pagecache usage).         
         */                                                                     
        mem = *memcg;                                                           
        if (likely(!mem)) {                                                     
                mem = try_get_mem_cgroup_from_mm(mm);                           
                *memcg = mem;                                                   
        } else {                                                                
                css_get(&mem->css);                                             
        }                                                                       
        if (unlikely(!mem))                                                     
                return 0;

Each page increments the reference count of struct mem_cgroup to prevent the memcg from being freed.

        VM_BUG_ON(css_is_removed(&mem->css));                                   
        if (mem_cgroup_is_root(mem))                                            
                goto done;

Starting from v2.6.32, the kernel doesn’t charge the root memory cgroup, as an optimization.


        while (1) {                                                             
                int ret = 0;                                                    
                unsigned long flags = 0;                                        

                if (consume_stock(mem))                                         
                        goto charged;

If there is some cached charges in the stock, consume it.


                ret = res_counter_charge(&mem->res, csize, &fail_res);          
                if (likely(!ret)) {                                             
                        if (!do_swap_account)                                   
                                break;                                          
                        ret = res_counter_charge(&mem->memsw, csize, &fail_res);
                        if (likely(!ret))                                       
                                break;                                          
                        /* mem+swap counter fails */                            
                        res_counter_uncharge(&mem->res, csize);                 
                        flags |= MEM_CGROUP_RECLAIM_NOSWAP;                     
                        mem_over_limit = mem_cgroup_from_res_counter(fail_res,  
                                                                        memsw); 
                } else                                                          
                        /* mem counter fails */                                 
                        mem_over_limit = mem_cgroup_from_res_counter(fail_res,  
                                                                        res);

If there isn't enough memory, charge csize bytes (equal to 32 * PAGE_SIZE) from struct res_counter. If either the memory counter (mem->res) or the memory+swap counter (mem->memsw) fails to charge, the kernel needs to reclaim memory. In that case, note which memory cgroup's counter failed in the mem_over_limit variable. (One of the memory cgroups, from the current cgroup to the root memory cgroup, failed to charge memory).

As you might have noticed, I haven't covered swap accounting yet :( This topic is beyond the scope of this article. I'll address it in a separate article later.


                /* reduce request size and retry */                             
                if (csize > PAGE_SIZE) {                                        
                        csize = PAGE_SIZE;                                      
                        continue;                                               
                }                                                               
                if (!(gfp_mask & __GFP_WAIT))                                   
                        goto nomem;                                             

                ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,     
                                                gfp_mask, flags);               
                if (ret)                                                        
                        continue;

It might happen that we can't charge 32 * PAGE_SIZE, but we can charge PAGE_SIZE bytes. Try again after reducing the charge.

If that fails, we try to reclaim memory only if gfp_mask has __GFP_WAIT flag set. Having __GFP_WAIT means we can wait and reschedule while performing memory reclaim.

mem_cgroup_hierarchical_reclaim() reclaims memory from the memory cgroup hierarchy. For more information on how memcg reclaims memory, read Memory Cgroup Naturalization in the Linux Kernel v3.3. The function returns the number of pages reclaimed. If some memory was reclaimed, continue and try charging again.

                /*                                                              
                 * try_to_free_mem_cgroup_pages() might not give us a full      
                 * picture of reclaim. Some pages are reclaimed and might be    
                 * moved to swap cache or just unmapped from the cgroup.        
                 * Check the limit again to see if the reclaim reduced the      
                 * current usage of the cgroup before giving up                 
                 *                                                              
                 */                                                             
                if (mem_cgroup_check_under_limit(mem_over_limit))               
                        continue;

If some pages we not reclaimed but uncharged, the memory cgroup that failed to charge memory might have some room to charge. In that case, mem_cgroup_check_under_limit() returns true. If so, retry charging.

                if (!nr_retries--) {                                            
                        if (oom) {                                              
                                mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
                                record_last_oom(mem_over_limit);                
                        }                                                       
                        goto nomem;                                             
                }                                                               
        }

If it fails to charge memory and reclaiming doesn't help after MEM_CGROUP_RECLAIM_RETRIES attempts, the OOM killer is invoked to terminate some problematic tasks in the mem_over_limit memory cgroup. This marks the end of the while loop. If the kernel exits the loop, it means either the OOM killer was invoked or the memory charge was successful.

        if (csize > PAGE_SIZE)                                                  
                refill_stock(mem, csize - PAGE_SIZE);

If the kernel charged from struct res_counter instead of the stock, refill the stock with the remaining bytes.

charged:                                                                        
        /*                                                                      
         * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.    
         * if they exceeds softlimit.                                           
         */                                                                     
        if (mem_cgroup_soft_limit_check(mem))                                   
                mem_cgroup_update_tree(mem, page);                              
done:                                                                           
        return 0;

This is beyond the scope of this article, but let me explain briefly: the kernel uses a red-black tree to track memory cgroups that exceed their 'soft limit.' This tree helps identify the memory cgroup that exceeds its soft limit by the largest amount, allowing the system to reclaim memory from it when under pressure.

nomem:                                                                          
        css_put(&mem->css);                                                     
        return -ENOMEM;                                                         
}

If the kernel failed to charge, drop the reference count of the memory cgroup.

Stock Draining

To keep the article brief, a detailed analysis of stock draining is left as an exercise for readers. (It is quite simple!) At high level, stock draining happens in the following situations:

When the user writes to the force_empty file or when the memory cgroup is about to be destroyed, forcing the memory cgroup's memory to be emptied.
In mem_cgroup_hierarchical_reclaim(), when the kernel cannot make space (via reclaim) for charging from the descendant memory cgroups of the target memory cgroup.
In refill_stock(), when the cached memory cgroup is not current memory cgroup.
When the CPU hotplug system informs the memory cgroup subsystem that a CPU is dead.

Coming in v6.16: Multi-Memcg Percpu Charge Cache

The percpu memcg stock has worked well for 17 years. The assumption was that the kernel would keep a process on a CPU for a long time, allowing it to frequently use the fast path of the charging code. However, in a network-heavy workload on a multi-tenant machine, this assumption proved incorrect. It was observed that the CPU spent almost 100% of its time just charging memory in net_rx_action.

That's why Meta engineer Shakeel Butt proposed caching multiple memcg stocks per CPU. This patch was included in the v6.16-rc1 MM Pull Request from Andrew Morton. If no critical issues arise, it will be part of the v6.16 release.

Closing

In this article, we explored how the memory cgroup subsystem reduces contention on struct res_counter by coalescing charges using per-CPU caching through percpu memcg stock. As mentioned earlier, coalescing uncharge is done with a different method, which we will discuss later.