Memory Cgroup: Charge-Commit-Cancel scheme (v2.6.29)

4 min read

I realized that I didn't discuss the new charge-commit-cancel scheme introduced in v2.6.29. Let's briefly cover this scheme in this article.

A few changes since v2.6.25

There are a few important changes since v2.6.25:

With these changes, a page is charged only if the page was not mapped to userspace:

int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)  
{                                                                               
        if (mem_cgroup_subsys.disabled)                                         
                return 0;                                                       
        if (PageCompound(page))                                                 
                return 0;                                                       
        /*                                                                      
         * If already mapped, we don't have to account.                         
         * If page cache, page->mapping has address_space.                      
         * But page->mapping may have out-of-use anon_vma pointer,              
         * detecit it by PageAnon() check. newly-mapped-anon's page->mapping    
         * is NULL.                                                             
         */                                                                     
        if (page_mapped(page) || (page->mapping && !PageAnon(page)))            
                return 0;                                                       
        if (unlikely(!mm))                                                      
                mm = &init_mm;                                                  
        return mem_cgroup_charge_common(page, mm, gfp_mask,                     
                                MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);

One problem is that there is no serialization to stop other processes from unmapping a page after a process checks page_mapped(page) and decides not to charge. If this happens, the page can avoid being charged, allowing processes to use more memory than their limit.

The race condition during swap-in

The race condition is described in the commit message that introduces the charge-commit-cancel scheme (I slightly rewrote the change log):

There is a small race in do_swap_page(). When the page swapped-in is
charged, the mapcount can be greater than 0.  But, at the same time some
process (shares it) call unmap and make mapcount 1->0 and the page is
uncharged.

CPU A                                CPU B
mapcount == 1.
(1) don't charge as mapcount !=0     unmap the page via zap_pte_range()
                                     (2) mapcount 1 => 0
                                     (3) uncharge the page
(4) set the page's reverse mapping
increment mapcount (0 =>1)

Then, this swap page's account is leaked.

Again, there was nothing that prevents all other processes from unmapping the page after a new process decides to skip charging the page.

The solution is also described in the change log:

For fixing this, I added a new interface.
      - charge
       account to res_counter by PAGE_SIZE and try to free pages if necessary.
      - commit
       register page_cgroup and add to LRU if necessary.
      - cancel
       uncharge PAGE_SIZE because of do_swap_page failure.

         CPUA
      (1) charge (always)
      (2) set page's rmap (mapcount > 0)
      (3) commit charge was necessary or not after set_pte().

This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
Usual mem_cgroup_charge_common() does charge -> commit at a time.

The Charge-Commit-Cancel scheme

The new scheme now always "charges" the page and "commits" the charge. If the page is already charged, we can detect this by checking PageCgroupUsed() under lock_page_cgroup()and skip charging it. "Cancel" happens only if some operations fail after charging, as the page won't be allocated.

Let’s look at do_swap_page() that handles swap-in (eliding out-of-scope parts, only focusing on charging aspect):


static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,       
                unsigned long address, pte_t *page_table, pmd_t *pmd,           
                int write_access, pte_t orig_pte)                               
{
        /* 
         * ... A large part of swap-in logic omitted ...
         * Either 1) fault-in a page from the swap cache, or
         * 2) allocate a new page, load content from swap space,
         *    and insert the page to swap cache.
         */
        lock_page(page);                                                        
        delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

        if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {         
                ret = VM_FAULT_OOM;                                             
                unlock_page(page);                                              
                goto out;                                                       
        }
        /* ... snip ... */

This is unconditional “charge” part, and fail swap-in if charging fails.

        /* ... snip ... */

        /*                                                                      
         * Back out if somebody else already faulted in this pte.               
         */                                                                     
        page_table = pte_offset_map_lock(mm, pmd, address, &ptl);               
        if (unlikely(!pte_same(*page_table, orig_pte)))                         
                goto out_nomap;                                                 

        if (unlikely(!PageUptodate(page))) {                                    
                ret = VM_FAULT_SIGBUS;                                          
                goto out_nomap;                                                 
        }
        /* ... snip ... */

If swap-in fails after charging for some reason, jump to out_nomap label and cancel the charge.

        /* ... snip ... */
        set_pte_at(mm, address, page_table, pte);                               
        page_add_anon_rmap(page, vma, address);                                 
        /* It's better to call commit-charge after rmap is established */       
        mem_cgroup_commit_charge_swapin(page, ptr);
        /* ... snip ... */
unlock:                                                                         
        pte_unmap_unlock(page_table, ptl);                                      
out:                                                                            
        return ret;

Commit the charge at a point where it will never encounter failure.

out_nomap:                                                                      
        mem_cgroup_cancel_charge_swapin(ptr);                                   
        pte_unmap_unlock(page_table, ptl);                                      
        unlock_page(page);                                                      
        page_cache_release(page);                                               
        return ret;                                                             
}

In case of a failure, call mem_cgroup_cancel_charge_swapin() cancels the charge.

In usual cases other than swap-in, the charge is committed right after it succeeds using mem_cgroup_charge(). The swap-in path uses its own charging function because swap-in might fail even after the charge has succeeded.

So, there is no charging leak after the change?

Yes, there is no leak of charging because we unconditionally charge, and still, we don’t charge the same page multiple times because we commit the charge under lock_page_cgroup(). There is no race condition there.

/*                                                                              
 * commit a charge got by __mem_cgroup_try_charge() and makes page_cgroup to be 
 * USED state. If already USED, uncharge and return.                            
 */                                                                             
static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,                  
                                     struct page_cgroup *pc,                    
                                     enum charge_type ctype)                    
{                                                                               
        /* try_charge() can return NULL to *memcg, taking care of it. */        
        if (!mem)                                                               
                return;                                                         

        lock_page_cgroup(pc);                                                   
        if (unlikely(PageCgroupUsed(pc))) {                                     
                unlock_page_cgroup(pc);                                         
                res_counter_uncharge(&mem->res, PAGE_SIZE);                     
                if (do_swap_account)                                                
                        res_counter_uncharge(&mem->memsw, PAGE_SIZE);           
                css_put(&mem->css);                                             
                return;                                                         
        }

lock the page_cgorup, and check PageCgroupUsed(). If it’s already charged, uncharge it and return.

        pc->mem_cgroup = mem;                                                   
        smp_wmb();                                                              
        pc->flags = pcg_default_flags[ctype];                                   

        mem_cgroup_charge_statistics(mem, pc, true);                            

        unlock_page_cgroup(pc);                                                 
}

If it wasn't previously charged, charge the page. The PCG_USED flag is set by the pc->flags = pcg_default_flags[ctype]; statement.