Friday, August 9, 2013

Thoughts on why that '(un)cacheable' bit matters...

If you have had the chance of configuring page-tables of the kernel by hand, then you would probably be aware that there is a seemingly unimportant 'cacheable' bit associated with each Page-Table entry (PTE). 99 out of 100 times, you would just tend to ignore this fiddly detail of the PTE and use the default bit for it ie. cacheable. As the name suggests, this allows the Memory controller, the liberty to cache that particular virtual page (maps to an actual physical page) in L1/L2 cache. This improves performance by obviating memory access. However, following are at least two scenarios where it's worth spending some time to think about the 'cacheable' bit:


  1. Sharing Page/Memory in a multi-kernel system:
    Imagine a multiprocessor system, in which each core is managed by a different kernel image. Note that this is in contrast how most commercial systems like Linux work these days, which involve a single kernel image running on all the cores. Anyways, Barrelfish is one such kernel that boots up each core with a separate kernel image loaded in RAM. Now, let's suppose that kernel running on Bootstrap processor(BSP) wants to share some information with another application processor (AP). Simplest way to do this is by mapping a particular physical page in the virtual address space of both the cores/kernels. When either of the kernels access the corresponding virtual address in their own virtual address space,  it maps to the shared physical page, and can therefore act as a shared memory between the cores.
    However, the big catch here is that the PTE for this particular virtual address (that maps to the shared physical page) must be marked as "non-cacheable". The point is that any writes by any of the cores must go through to the actual physical memory. Application processor may remain oblivious to any writes by the BSP, if those writes only occur in the cache. Remember that both the kernels are running in a different virtual address space, so cache coherency protocol won't update** the cache of the Application processor.  By marking the PTE for the virtual page as non-cacheable, it is ensured* that writes will actually occur in the physical memory, which can then be read by the other core.

  2. Memory-mapped IO: This is similar to the previous example in terms of the basic idea. That writes must pass through to the actual physical memory. A lot of IO devices have their registers mapped somewhere in the physical memory map of the chipset (Ex: The 'hole' at 640K - 1 MB in the PC, that maps to BIOS in ROM, or may be IO devices). Consider a UART device, whose Transmit FIFO is mapped at Virtual address 0xFFFFE000H. If the PTE for this particular page is not marked uncacheable, then no writes may occur to the IO device, and you may see no output on your terminal connected to the RS-232 port.


    * (Long Footnote): It is generally a good idea to mark the pointers holding the address of the shared page between two cores or the address of the IO device register with the volatile keyword. This is because the access to these addresses occurs outside the current C program (by some another kernel running on a different core OR the IO controller accessing device registers). The compiler has no way of knowing about such extraneous access, and may act all wise and entirely remove the write in the name of optimization. Nobody wants that. So, either just use the volatile keyword to avoid such compiler optimizations or take the drastic step of turning off compiler optimization (using -O flag) entirely.