FRIDAY, MARCH 29, 2024
Thursday, 12 September, 2013 07:16

Translation Lookaside Buffer

Teaching Solaris Performance Management this week and we got into a large discussion about T-Series CPUs, Multi-threaded -vs- Multi-Process applications, Multiple Page Size Support (MPSS), and the Translation Lookaside Buffer (TLB).

Solaris processes run in a virtual memory address space. When we attempt to utilize that memory address space something needs to map that virtual address to an actual physical address. On the SPARC platform the Hardware Address Translation layer (named SFMMU – Spitfire Memory Management Unit) performs this function. The MMU divides the virtual address space into pages. Solaris supports MPSS so we can change the size of these pages on both SPARC as well as x86.

The pagesize -a command will display the available page sizes on the system.

$ uname -a
SunOS chicago 5.10 Generic_127112-11 i86pc i386 i86pc Solaris
$ pagesize -a
4096
2097152
$ uname -a
SunOS niagara 5.10 Generic_138888-03 sun4v sparc SUNW Solaris
$ pagesize -a
8192
65536
4194304
268435456

Virtual Memory would not be very effective if every memory address had to be translated by looking up the associated physical page in memory. The solution is to cache the recent translations in a Translation Lookaside Buffer (TLB). A TLB has a fixed number of slots that contain Translation Table Entries (TTE), which map virtual addresses to physical addresses.

Modern servers today have multiple cores with multiple hardware strands allowing the system to dispatch a large number of threads to a CPU. Each of the processes associated with these threads will need to gain access to the physical memory location placing a burden on the TLB and the HAT. Simply put, there may not be enough space in the TLB to hold the Translation Table Entries (TTE) to hold all of the needed translations required by the large number of running processes.

To speed up handling of TLB miss traps, the processor provides a hardware-assisted lookup mechanism called the Translation Storage Buffer (TSB). The TSB is a virtually indexed, direct-mapped, physically contiguous, and size-aligned region of physical memory which is used to cache recently used Translation Table Entries (TTEs) after retrieval from the page tables. When a TLB miss occurs, the hardware uses the virtual address of the miss combined with the contents of a TSB base address register (which is pre-programmed on context switch) to calculate the pointer into the TSB of the entry corresponding to the virtual address. If the TSB entry tag matches the virtual address of the miss, the TTE is loaded into the TLB by the TLB miss handler, and the trapped instruction is retried. If no match is found, the trap handler branches to a slow path routine called the TSB miss handler. Quite a bit of complex work to handle these “misses”.

Starting with Solaris 10, Update 1 the Out-Of-The-BOX (OOB) Large Page Support turns on MPSS automatically for the applications heap and text(libraries). The advantage is that it improves the performance of your userland applications by limiting/reducing the CPU cycles required to service dTLB and iTLB misses. Theoretically we are mapping a larger amount of memory in the TLB if we choose to map larger pages.

For example, if the heap size of a process is 256M, on a Niagara (UltraSPARC-T1) box it will be mapped on to a single 256M page. On a system that doesn’t support large pages, it will be mapped on to 32,768 8K pages.

The pmap command displays the page sizes of memory mappings within the address space of a process. The -sx option directs pmap to show the page size for each mapping.

sol10# pmap -sx ´pgrep testprog´
2909:  ./testprog
 Address Kbytes   RSS  Anon Locked Pgsz Mode  Mapped File
00010000    8    8    -    -  8K r-x-- dev:277,83 ino:114875
00020000    8    8    8    -  8K rwx-- dev:277,83 ino:114875
00022000 131088 131088 131088    -  8K rwx--  [ heap ]
FF280000   120   120    -    -  8K r-x-- libc.so.1
FF29E000   136   128    -    -  - r-x-- libc.so.1
FF2C0000   72   72    -    -  8K r-x-- libc.so.1
FF2D2000   192   192    -    -  - r-x-- libc.so.1
FF302000   112   112    -    -  8K r-x-- libc.so.1
FF31E000   48   32    -    -  - r-x-- libc.so.1
FF33A000   24   24   24    -  8K rwx-- libc.so.1
FF340000    8    8    8    -  8K rwx-- libc.so.1
FF390000    8    8    -    -  8K r-x-- libc_psr.so.1
FF3A0000    8    8    -    -  8K r-x-- libdl.so.1
FF3B0000    8    8    8    -  8K rwx--  [ anon ]
FF3C0000   152   152    -    -  8K r-x-- ld.so.1
FF3F6000    8    8    8    -  8K rwx-- ld.so.1
FFBFA000   24   24   24    -  8K rwx--  [ stack ]
-------- ------- ------- ------- -------
total Kb 132024 132000 131168    -.

There may be some instances where OOB may cause poor performance of some of your applications including application crashes if the application makes an improper assumption regarding page sizes. If one runs into this scenario there are adjustments that can be made in /etc/system to enable or disable OOB support.

It also can introduce challenges on some caches and their coherency. On multi-threaded applications run on CMP and SMP systems, threads from a common PID can be dispatched to different CPUs, each holding their own TTEs in the TLB. When a thread unmaps virtual memory we have to perform a cleanup. There may now be a stale mapping on a different TLB that now maps to an invalid physical memory location that if allowed to remain could allow for corruption. Those CPUs that are crosscalled during a munmap that have actually run the process are cleaned up instead of just broadcasting it to all of the running CPUs. However as we add more processors, this can increase the time it takes to perform this cleanup. If you think this may be occurring if you migrate to a larger system, consider using processor pools or CPU binding of the process to see if that allows some relief.

We can use the trapstat command to gain some reference into the performance of our dTLB and iTLB hit rates. By specifying the -T option, trapstat shows TLB misses broken down by page size. In this example, CPU 0 is spending 7.9 percent of its time handling user-mode TLB misses on 8K pages, and another 2.3 percent of its time handling user-mode TLB misses on 64K pages.

example# trapstat -T -c 0
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
  0 u   8k|      1300  0.1        15  0.0 |    104897  7.9        90  0.0 | 8.0
  0 u  64k|         0  0.0         0  0.0 |     29935  2.3         7  0.0 | 2.3
  0 u 512k|         0  0.0         0  0.0 |      3569  0.2         2  0.0 | 0.2
  0 u   4m|         0  0.0         0  0.0 |       233  0.0         2  0.0 | 0.0
- - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - -
  0 k   8k|        13  0.0         0  0.0 |     71733  6.5       110  0.0 | 6.5
  0 k  64k|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
  0 k 512k|         0  0.0         0  0.0 |         0  0.0       206  0.1 | 0.1
  0 k   4m|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
==========+===============================+===============================+====
      ttl |      1313  0.1        15  0.0 |    210367 17.1       417  0.2 |17.5

By specifying the -e option, trapstat displays statistics for only specific trap types. Using this option minimizes the probe effect when seeking specific data. This example yields statistics for only the dtlb-prot and syscall-32 traps on CPUs 12 through 15:

example# trapstat -e dtlb-prot,syscall-32 -c 12-15
vct  name               |    cpu12    cpu13    cpu14    cpu15
------------------------+------------------------------------
 6c dtlb-prot           |      817      754     1018      560
108 syscall-32          |     1426     1647     2186     1142

vct  name               |    cpu12    cpu13    cpu14    cpu15
------------------------+------------------------------------
 6c dtlb-prot           |     1085      996      800      707
108 syscall-32          |     2578     2167     1638     1452

cpustat allows another point of entry into monitor events on the CPU to include the workings of the TLB. The following command displays the three CPUs with the highest DTLB_miss rate.

example% cpustat -c DTLB_miss -k DTLB_miss -n 3 1 1

 time cpu event DTLB_miss
1.040 115  tick       107
1.006  18  tick        98
1.045 126  tick        31
1.046  96 total       236

event DTLB_miss
total       236

There is quite a bit more to think about relating to MPSS and the TLBs, I hope this post serves as a starting point for those that are running CMP/SMP systems with multi-threaded applications to perform a deeper dive.

Take a look at (pmap -sx) (ppgsz) (pagesize) (mpss.so.1) for additional direction.