Blame - Documentation/admin-guide/mm/concepts.rst - linux-mtk

blob: 291699c810d4e7d99c9c803b01c514b0f63928ea [file] [log] [blame]

Mike Rapoport	f462951	2018-05-29 14:37:25 +0300	[diff] [blame]	1	.. _mm_concepts:
				2
				3	=================
				4	Concepts overview
				5	=================
				6
				7	The memory management in Linux is complex system that evolved over the
				8	years and included more and more functionality to support variety of
				9	systems from MMU-less microcontrollers to supercomputers. The memory
				10	management for systems without MMU is called ``nommu`` and it
				11	definitely deserves a dedicated document, which hopefully will be
				12	eventually written. Yet, although some of the concepts are the same,
				13	here we assume that MMU is available and CPU can translate a virtual
				14	address to a physical address.
				15
				16	.. contents:: :local:
				17
				18	Virtual Memory Primer
				19	=====================
				20
				21	The physical memory in a computer system is a limited resource and
				22	even for systems that support memory hotplug there is a hard limit on
				23	the amount of memory that can be installed. The physical memory is not
				24	necessary contiguous, it might be accessible as a set of distinct
				25	address ranges. Besides, different CPU architectures, and even
				26	different implementations of the same architecture have different view
				27	how these address ranges defined.
				28
				29	All this makes dealing directly with physical memory quite complex and
				30	to avoid this complexity a concept of virtual memory was developed.
				31
				32	The virtual memory abstracts the details of physical memory from the
				33	application software, allows to keep only needed information in the
				34	physical memory (demand paging) and provides a mechanism for the
				35	protection and controlled sharing of data between processes.
				36
				37	With virtual memory, each and every memory access uses a virtual
				38	address. When the CPU decodes the an instruction that reads (or
				39	writes) from (or to) the system memory, it translates the `virtual`
				40	address encoded in that instruction to a `physical` address that the
				41	memory controller can understand.
				42
				43	The physical system memory is divided into page frames, or pages. The
				44	size of each page is architecture specific. Some architectures allow
				45	selection of the page size from several supported values; this
				46	selection is performed at the kernel build time by setting an
				47	appropriate kernel configuration option.
				48
				49	Each physical memory page can be mapped as one or more virtual
				50	pages. These mappings are described by page tables that allow
				51	translation from virtual address used by programs to real address in
				52	the physical memory. The page tables organized hierarchically.
				53
				54	The tables at the lowest level of the hierarchy contain physical
				55	addresses of actual pages used by the software. The tables at higher
				56	levels contain physical addresses of the pages belonging to the lower
				57	levels. The pointer to the top level page table resides in a
				58	register. When the CPU performs the address translation, it uses this
				59	register to access the top level page table. The high bits of the
				60	virtual address are used to index an entry in the top level page
				61	table. That entry is then used to access the next level in the
				62	hierarchy with the next bits of the virtual address as the index to
				63	that level page table. The lowest bits in the virtual address define
				64	the offset inside the actual page.
				65
				66	Huge Pages
				67	==========
				68
				69	The address translation requires several memory accesses and memory
				70	accesses are slow relatively to CPU speed. To avoid spending precious
				71	processor cycles on the address translation, CPUs maintain a cache of
				72	such translations called Translation Lookaside Buffer (or
				73	TLB). Usually TLB is pretty scarce resource and applications with
				74	large memory working set will experience performance hit because of
				75	TLB misses.
				76
				77	Many modern CPU architectures allow mapping of the memory pages
				78	directly by the higher levels in the page table. For instance, on x86,
				79	it is possible to map 2M and even 1G pages using entries in the second
				80	and the third level page tables. In Linux such pages are called
				81	`huge`. Usage of huge pages significantly reduces pressure on TLB,
				82	improves TLB hit-rate and thus improves overall system performance.
				83
				84	There are two mechanisms in Linux that enable mapping of the physical
				85	memory with the huge pages. The first one is `HugeTLB filesystem`, or
				86	hugetlbfs. It is a pseudo filesystem that uses RAM as its backing
				87	store. For the files created in this filesystem the data resides in
				88	the memory and mapped using huge pages. The hugetlbfs is described at
				89	:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
				90
				91	Another, more recent, mechanism that enables use of the huge pages is
				92	called `Transparent HugePages`, or THP. Unlike the hugetlbfs that
				93	requires users and/or system administrators to configure what parts of
				94	the system memory should and can be mapped by the huge pages, THP
				95	manages such mappings transparently to the user and hence the
				96	name. See
				97	:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
				98	for more details about THP.
				99
				100	Zones
				101	=====
				102
				103	Often hardware poses restrictions on how different physical memory
				104	ranges can be accessed. In some cases, devices cannot perform DMA to
				105	all the addressable memory. In other cases, the size of the physical
				106	memory exceeds the maximal addressable size of virtual memory and
				107	special actions are required to access portions of the memory. Linux
				108	groups memory pages into `zones` according to their possible
				109	usage. For example, ZONE_DMA will contain memory that can be used by
				110	devices for DMA, ZONE_HIGHMEM will contain memory that is not
				111	permanently mapped into kernel's address space and ZONE_NORMAL will
				112	contain normally addressed pages.
				113
				114	The actual layout of the memory zones is hardware dependent as not all
				115	architectures define all zones, and requirements for DMA are different
				116	for different platforms.
				117
				118	Nodes
				119	=====
				120
				121	Many multi-processor machines are NUMA - Non-Uniform Memory Access -
				122	systems. In such systems the memory is arranged into banks that have
				123	different access latency depending on the "distance" from the
				124	processor. Each bank is referred as `node` and for each node Linux
				125	constructs an independent memory management subsystem. A node has it's
				126	own set of zones, lists of free and used pages and various statistics
				127	counters. You can find more details about NUMA in
				128	:ref:`Documentation/vm/numa.rst <numa>` and in
				129	:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
				130
				131	Page cache
				132	==========
				133
				134	The physical memory is volatile and the common case for getting data
				135	into the memory is to read it from files. Whenever a file is read, the
				136	data is put into the `page cache` to avoid expensive disk access on
				137	the subsequent reads. Similarly, when one writes to a file, the data
				138	is placed in the page cache and eventually gets into the backing
				139	storage device. The written pages are marked as `dirty` and when Linux
				140	decides to reuse them for other purposes, it makes sure to synchronize
				141	the file contents on the device with the updated data.
				142
				143	Anonymous Memory
				144	================
				145
				146	The `anonymous memory` or `anonymous mappings` represent memory that
				147	is not backed by a filesystem. Such mappings are implicitly created
				148	for program's stack and heap or by explicit calls to mmap(2) system
				149	call. Usually, the anonymous mappings only define virtual memory areas
				150	that the program is allowed to access. The read accesses will result
				151	in creation of a page table entry that references a special physical
				152	page filled with zeroes. When the program performs a write, regular
				153	physical page will be allocated to hold the written data. The page
				154	will be marked dirty and if the kernel will decide to repurpose it,
				155	the dirty page will be swapped out.
				156
				157	Reclaim
				158	=======
				159
				160	Throughout the system lifetime, a physical page can be used for storing
				161	different types of data. It can be kernel internal data structures,
				162	DMA'able buffers for device drivers use, data read from a filesystem,
				163	memory allocated by user space processes etc.
				164
				165	Depending on the page usage it is treated differently by the Linux
				166	memory management. The pages that can be freed at any time, either
				167	because they cache the data available elsewhere, for instance, on a
				168	hard disk, or because they can be swapped out, again, to the hard
				169	disk, are called `reclaimable`. The most notable categories of the
				170	reclaimable pages are page cache and anonymous memory.
				171
				172	In most cases, the pages holding internal kernel data and used as DMA
				173	buffers cannot be repurposed, and they remain pinned until freed by
				174	their user. Such pages are called `unreclaimable`. However, in certain
				175	circumstances, even pages occupied with kernel data structures can be
				176	reclaimed. For instance, in-memory caches of filesystem metadata can
				177	be re-read from the storage device and therefore it is possible to
				178	discard them from the main memory when system is under memory
				179	pressure.
				180
				181	The process of freeing the reclaimable physical memory pages and
				182	repurposing them is called (surprise!) `reclaim`. Linux can reclaim
				183	pages either asynchronously or synchronously, depending on the state
				184	of the system. When system is not loaded, most of the memory is free
				185	and allocation request will be satisfied immediately from the free
				186	pages supply. As the load increases, the amount of the free pages goes
				187	down and when it reaches a certain threshold (high watermark), an
				188	allocation request will awaken the ``kswapd`` daemon. It will
				189	asynchronously scan memory pages and either just free them if the data
				190	they contain is available elsewhere, or evict to the backing storage
				191	device (remember those dirty pages?). As memory usage increases even
				192	more and reaches another threshold - min watermark - an allocation
				193	will trigger the `direct reclaim`. In this case allocation is stalled
				194	until enough memory pages are reclaimed to satisfy the request.
				195
				196	Compaction
				197	==========
				198
				199	As the system runs, tasks allocate and free the memory and it becomes
				200	fragmented. Although with virtual memory it is possible to present
				201	scattered physical pages as virtually contiguous range, sometimes it is
				202	necessary to allocate large physically contiguous memory areas. Such
				203	need may arise, for instance, when a device driver requires large
				204	buffer for DMA, or when THP allocates a huge page. Memory `compaction`
				205	addresses the fragmentation issue. This mechanism moves occupied pages
				206	from the lower part of a memory zone to free pages in the upper part
				207	of the zone. When a compaction scan is finished free pages are grouped
				208	together at the beginning of the zone and allocations of large
				209	physically contiguous areas become possible.
				210
				211	Like reclaim, the compaction may happen asynchronously in ``kcompactd``
				212	daemon or synchronously as a result of memory allocation request.
				213
				214	OOM killer
				215	==========
				216
				217	It may happen, that on a loaded machine memory will be exhausted. When
				218	the kernel detects that the system runs out of memory (OOM) it invokes
				219	`OOM killer`. Its mission is simple: all it has to do is to select a
				220	task to sacrifice for the sake of the overall system health. The
				221	selected task is killed in a hope that after it exits enough memory
				222	will be freed to continue normal operation.