Mike Rapoport | f462951 | 2018-05-29 14:37:25 +0300 | [diff] [blame] | 1 | .. _mm_concepts: |
| 2 | |
| 3 | ================= |
| 4 | Concepts overview |
| 5 | ================= |
| 6 | |
| 7 | The memory management in Linux is complex system that evolved over the |
| 8 | years and included more and more functionality to support variety of |
| 9 | systems from MMU-less microcontrollers to supercomputers. The memory |
| 10 | management for systems without MMU is called ``nommu`` and it |
| 11 | definitely deserves a dedicated document, which hopefully will be |
| 12 | eventually written. Yet, although some of the concepts are the same, |
| 13 | here we assume that MMU is available and CPU can translate a virtual |
| 14 | address to a physical address. |
| 15 | |
| 16 | .. contents:: :local: |
| 17 | |
| 18 | Virtual Memory Primer |
| 19 | ===================== |
| 20 | |
| 21 | The physical memory in a computer system is a limited resource and |
| 22 | even for systems that support memory hotplug there is a hard limit on |
| 23 | the amount of memory that can be installed. The physical memory is not |
| 24 | necessary contiguous, it might be accessible as a set of distinct |
| 25 | address ranges. Besides, different CPU architectures, and even |
| 26 | different implementations of the same architecture have different view |
| 27 | how these address ranges defined. |
| 28 | |
| 29 | All this makes dealing directly with physical memory quite complex and |
| 30 | to avoid this complexity a concept of virtual memory was developed. |
| 31 | |
| 32 | The virtual memory abstracts the details of physical memory from the |
| 33 | application software, allows to keep only needed information in the |
| 34 | physical memory (demand paging) and provides a mechanism for the |
| 35 | protection and controlled sharing of data between processes. |
| 36 | |
| 37 | With virtual memory, each and every memory access uses a virtual |
| 38 | address. When the CPU decodes the an instruction that reads (or |
| 39 | writes) from (or to) the system memory, it translates the `virtual` |
| 40 | address encoded in that instruction to a `physical` address that the |
| 41 | memory controller can understand. |
| 42 | |
| 43 | The physical system memory is divided into page frames, or pages. The |
| 44 | size of each page is architecture specific. Some architectures allow |
| 45 | selection of the page size from several supported values; this |
| 46 | selection is performed at the kernel build time by setting an |
| 47 | appropriate kernel configuration option. |
| 48 | |
| 49 | Each physical memory page can be mapped as one or more virtual |
| 50 | pages. These mappings are described by page tables that allow |
| 51 | translation from virtual address used by programs to real address in |
| 52 | the physical memory. The page tables organized hierarchically. |
| 53 | |
| 54 | The tables at the lowest level of the hierarchy contain physical |
| 55 | addresses of actual pages used by the software. The tables at higher |
| 56 | levels contain physical addresses of the pages belonging to the lower |
| 57 | levels. The pointer to the top level page table resides in a |
| 58 | register. When the CPU performs the address translation, it uses this |
| 59 | register to access the top level page table. The high bits of the |
| 60 | virtual address are used to index an entry in the top level page |
| 61 | table. That entry is then used to access the next level in the |
| 62 | hierarchy with the next bits of the virtual address as the index to |
| 63 | that level page table. The lowest bits in the virtual address define |
| 64 | the offset inside the actual page. |
| 65 | |
| 66 | Huge Pages |
| 67 | ========== |
| 68 | |
| 69 | The address translation requires several memory accesses and memory |
| 70 | accesses are slow relatively to CPU speed. To avoid spending precious |
| 71 | processor cycles on the address translation, CPUs maintain a cache of |
| 72 | such translations called Translation Lookaside Buffer (or |
| 73 | TLB). Usually TLB is pretty scarce resource and applications with |
| 74 | large memory working set will experience performance hit because of |
| 75 | TLB misses. |
| 76 | |
| 77 | Many modern CPU architectures allow mapping of the memory pages |
| 78 | directly by the higher levels in the page table. For instance, on x86, |
| 79 | it is possible to map 2M and even 1G pages using entries in the second |
| 80 | and the third level page tables. In Linux such pages are called |
| 81 | `huge`. Usage of huge pages significantly reduces pressure on TLB, |
| 82 | improves TLB hit-rate and thus improves overall system performance. |
| 83 | |
| 84 | There are two mechanisms in Linux that enable mapping of the physical |
| 85 | memory with the huge pages. The first one is `HugeTLB filesystem`, or |
| 86 | hugetlbfs. It is a pseudo filesystem that uses RAM as its backing |
| 87 | store. For the files created in this filesystem the data resides in |
| 88 | the memory and mapped using huge pages. The hugetlbfs is described at |
| 89 | :ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`. |
| 90 | |
| 91 | Another, more recent, mechanism that enables use of the huge pages is |
| 92 | called `Transparent HugePages`, or THP. Unlike the hugetlbfs that |
| 93 | requires users and/or system administrators to configure what parts of |
| 94 | the system memory should and can be mapped by the huge pages, THP |
| 95 | manages such mappings transparently to the user and hence the |
| 96 | name. See |
| 97 | :ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>` |
| 98 | for more details about THP. |
| 99 | |
| 100 | Zones |
| 101 | ===== |
| 102 | |
| 103 | Often hardware poses restrictions on how different physical memory |
| 104 | ranges can be accessed. In some cases, devices cannot perform DMA to |
| 105 | all the addressable memory. In other cases, the size of the physical |
| 106 | memory exceeds the maximal addressable size of virtual memory and |
| 107 | special actions are required to access portions of the memory. Linux |
| 108 | groups memory pages into `zones` according to their possible |
| 109 | usage. For example, ZONE_DMA will contain memory that can be used by |
| 110 | devices for DMA, ZONE_HIGHMEM will contain memory that is not |
| 111 | permanently mapped into kernel's address space and ZONE_NORMAL will |
| 112 | contain normally addressed pages. |
| 113 | |
| 114 | The actual layout of the memory zones is hardware dependent as not all |
| 115 | architectures define all zones, and requirements for DMA are different |
| 116 | for different platforms. |
| 117 | |
| 118 | Nodes |
| 119 | ===== |
| 120 | |
| 121 | Many multi-processor machines are NUMA - Non-Uniform Memory Access - |
| 122 | systems. In such systems the memory is arranged into banks that have |
| 123 | different access latency depending on the "distance" from the |
| 124 | processor. Each bank is referred as `node` and for each node Linux |
| 125 | constructs an independent memory management subsystem. A node has it's |
| 126 | own set of zones, lists of free and used pages and various statistics |
| 127 | counters. You can find more details about NUMA in |
| 128 | :ref:`Documentation/vm/numa.rst <numa>` and in |
| 129 | :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`. |
| 130 | |
| 131 | Page cache |
| 132 | ========== |
| 133 | |
| 134 | The physical memory is volatile and the common case for getting data |
| 135 | into the memory is to read it from files. Whenever a file is read, the |
| 136 | data is put into the `page cache` to avoid expensive disk access on |
| 137 | the subsequent reads. Similarly, when one writes to a file, the data |
| 138 | is placed in the page cache and eventually gets into the backing |
| 139 | storage device. The written pages are marked as `dirty` and when Linux |
| 140 | decides to reuse them for other purposes, it makes sure to synchronize |
| 141 | the file contents on the device with the updated data. |
| 142 | |
| 143 | Anonymous Memory |
| 144 | ================ |
| 145 | |
| 146 | The `anonymous memory` or `anonymous mappings` represent memory that |
| 147 | is not backed by a filesystem. Such mappings are implicitly created |
| 148 | for program's stack and heap or by explicit calls to mmap(2) system |
| 149 | call. Usually, the anonymous mappings only define virtual memory areas |
| 150 | that the program is allowed to access. The read accesses will result |
| 151 | in creation of a page table entry that references a special physical |
| 152 | page filled with zeroes. When the program performs a write, regular |
| 153 | physical page will be allocated to hold the written data. The page |
| 154 | will be marked dirty and if the kernel will decide to repurpose it, |
| 155 | the dirty page will be swapped out. |
| 156 | |
| 157 | Reclaim |
| 158 | ======= |
| 159 | |
| 160 | Throughout the system lifetime, a physical page can be used for storing |
| 161 | different types of data. It can be kernel internal data structures, |
| 162 | DMA'able buffers for device drivers use, data read from a filesystem, |
| 163 | memory allocated by user space processes etc. |
| 164 | |
| 165 | Depending on the page usage it is treated differently by the Linux |
| 166 | memory management. The pages that can be freed at any time, either |
| 167 | because they cache the data available elsewhere, for instance, on a |
| 168 | hard disk, or because they can be swapped out, again, to the hard |
| 169 | disk, are called `reclaimable`. The most notable categories of the |
| 170 | reclaimable pages are page cache and anonymous memory. |
| 171 | |
| 172 | In most cases, the pages holding internal kernel data and used as DMA |
| 173 | buffers cannot be repurposed, and they remain pinned until freed by |
| 174 | their user. Such pages are called `unreclaimable`. However, in certain |
| 175 | circumstances, even pages occupied with kernel data structures can be |
| 176 | reclaimed. For instance, in-memory caches of filesystem metadata can |
| 177 | be re-read from the storage device and therefore it is possible to |
| 178 | discard them from the main memory when system is under memory |
| 179 | pressure. |
| 180 | |
| 181 | The process of freeing the reclaimable physical memory pages and |
| 182 | repurposing them is called (surprise!) `reclaim`. Linux can reclaim |
| 183 | pages either asynchronously or synchronously, depending on the state |
| 184 | of the system. When system is not loaded, most of the memory is free |
| 185 | and allocation request will be satisfied immediately from the free |
| 186 | pages supply. As the load increases, the amount of the free pages goes |
| 187 | down and when it reaches a certain threshold (high watermark), an |
| 188 | allocation request will awaken the ``kswapd`` daemon. It will |
| 189 | asynchronously scan memory pages and either just free them if the data |
| 190 | they contain is available elsewhere, or evict to the backing storage |
| 191 | device (remember those dirty pages?). As memory usage increases even |
| 192 | more and reaches another threshold - min watermark - an allocation |
| 193 | will trigger the `direct reclaim`. In this case allocation is stalled |
| 194 | until enough memory pages are reclaimed to satisfy the request. |
| 195 | |
| 196 | Compaction |
| 197 | ========== |
| 198 | |
| 199 | As the system runs, tasks allocate and free the memory and it becomes |
| 200 | fragmented. Although with virtual memory it is possible to present |
| 201 | scattered physical pages as virtually contiguous range, sometimes it is |
| 202 | necessary to allocate large physically contiguous memory areas. Such |
| 203 | need may arise, for instance, when a device driver requires large |
| 204 | buffer for DMA, or when THP allocates a huge page. Memory `compaction` |
| 205 | addresses the fragmentation issue. This mechanism moves occupied pages |
| 206 | from the lower part of a memory zone to free pages in the upper part |
| 207 | of the zone. When a compaction scan is finished free pages are grouped |
| 208 | together at the beginning of the zone and allocations of large |
| 209 | physically contiguous areas become possible. |
| 210 | |
| 211 | Like reclaim, the compaction may happen asynchronously in ``kcompactd`` |
| 212 | daemon or synchronously as a result of memory allocation request. |
| 213 | |
| 214 | OOM killer |
| 215 | ========== |
| 216 | |
| 217 | It may happen, that on a loaded machine memory will be exhausted. When |
| 218 | the kernel detects that the system runs out of memory (OOM) it invokes |
| 219 | `OOM killer`. Its mission is simple: all it has to do is to select a |
| 220 | task to sacrifice for the sake of the overall system health. The |
| 221 | selected task is killed in a hope that after it exits enough memory |
| 222 | will be freed to continue normal operation. |