Swap is useless
False, even if a system has a lot of memory, swap allows for better memory
utilisation by swapping out allocated but rarely used anonymous memory pages.
Swap is going to slow down your system by its mere presence
False, as long as the system has enough memory, there would be very little or no swap-related I/O, so there is no slowdown.
It is really bad if you have some memory swapped out 1
False, the kernel swapped out some unused pages, and the memory can be
allocated for something more useful, like the file cache.
swap is going to wear out your SSD 2
False, as long as there is no swap-related I/O, there is no wearing out of the
SSD. And modern SSDs have enough resources to handle swap-related I/O anyway.
Swap is an emergency solution for out-of-memory conditions
False, once your working set exceeds actual physical memory, swap makes things
worse, causing swap thrashing and a slower OOM trigger.
Swap allows running workloads that exceed the system’s physical memory
False, once your active working set exceeds actual physical memory, the performance degradation is exponential. If we assume a 103 difference in latency between RAM and SSD with random 4K page access, then we can calculate that a 0.1% excess causes a 2x degradation, 1% - 10x degradation, 10% - 100x degradation.
Swap causes gradual performance degradation
Under stable workloads, until the active working set exceeds physical memory, swap improves performance by freeing unused memory. Once the working set exceeds physical memory, performance degradation is exponential and the system gets unresponsive very quickly.
On a desktop system, when an inactive application gets swapped out, switching back to it would feel slow.
Kernel evicts program executable pages, making the system unresponsive.
Active executable pages are explicitly excluded from reclaim. The kernel reclaims inactive file cache and inactive anonymous pages first, proportionally to vm.swappiness setting. Then it starts cannibalizing active file cache and anonymous pages, but executable pages are explicitly excluded from that. The system becomes unresponsive due to I/O starvation, because all file cache is dropped. Also, when system memory is below the low watermark, any task that needs a free memory page causes the kernel to go into direct reclaim and the task gets blocked until free pages are found.
Swap size should be double the amount of physical memory.3
False. Unless the system has megabytes of memory instead of gigabytes. If you
allocate more than a few GB of swap size, you are going for a long swap
thrashing session when you run out of memory and before OOM gets triggered.
The proper rule of thumb is to make the swap large enough to keep all inactive anonymous pages after the workload has stabilized, but not too large to cause swap thrashing and a delayed OOM kill if a fast memory leak happens. For an average system with 2 GB - 128 GB RAM, you start by adding a 256 MB swap file, and if it fills up entirely, increase it by another 256 MB. The step can be increased to 512 MB - 1 GB on larger systems with faster storage.
Swap use begins based on the vm.swappiness threshold, e.g. when 40% of RAM remains for vm.swappiness=40. 4
False. Before the introduction of the split-LRU design in kernel version 2.6.28
in 2008, there used to be a different algorithm that used the percentage of allocated memory, but it was more complicated and with the vm.swappiness=40, it wouldn’t start swapping even if all memory was allocated from processes and with the default vm.swappiness=60 it would start swapping at 80% memory allocation. This algorithm is no longer in use.
Swap aggressiveness is configured using vm.swappiness and it is linear between 0 and 100 5
False. vm.swappiness was first described in the kernel documentation in
2009 with the following text:
This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase aggressiveness, lower values decrease the amount of swap. A value of 0 instructs the kernel not to initiate swap until the amount of free and file-backed pages is less than the high water mark in a zone.
It doesn’t say that the relation between vm.swappiness and aggressiveness is
linear, but people made assumptions.
This description is still present in some texts on kernel.org (this file
isn’t present in the kernel tree anymore, and it wasn’t updated since 2019).
The documentation was updated in 2020 to a more appropriate description and the values up to 200 were allowed.
With vm.swappiness=0 kernel won’t swap
False, if the kernel hits the low water mark in any zone, then it is going to swap anyway.
With vm.swappiness=100 kernel is going to swap out everything from memory right away
False, if there is no memory pressure, the kernel isn’t going to swap anything.
vm.swappiness=60 is too agressive 6
False, the vm.swappiness value 60 means that anon_prio is assigned the
value of 60 and file_prio the value of 200 - 60 = 140. The resulting ratio
140/60 means that the kernel would evict 2.33 times more pages from the
page cache than swap out anonymous pages.
The default value of 60 was chosen with the assumption that the file I/O
operations, which tend to be sequential, are more effective than random swap
I/O, but this applies to rotating media like HDDs only. For SSDs,
vm.swappiness=100 is more appropriate.
As the documentation states:
For in-memory swap, like zram or zswap, as well as hybrid setups that have
swap on faster devices than the filesystem, values beyond 100 can be
considered
vm.swappiness=10 is just the right setting and makes your system fast
This value gives a ratio of 19 times preference for discarding page cache over
swapping out. Your system is going to have a lot of unused anon pages sitting
around while churning through file cache pages, making it less effective.
Swap won’t happen if there is some free RAM.
False. If a process runs within a cgroup with defined memory limits, it can be
swapped out, even though the system still has a lot of free memory. Swap and
OOM can also be triggered due to memory fragmentation when high-order
allocations fail, even though there are a lot of free low-order pages.
Swap happens just randomly, when the kernel has nothing to do
False. Swap happens when memory allocation brings the number of free memory
pages below the low watermark specified for a memory zone. See /proc/zoneinfo
and this question on Unix.StackExchange.
Swapping over NFS is a good idea. 7
False. It is very slow, and any packet lost/delayed on the network would cause the system to hang.
OOM won’t trigger if there is swap enabled. 8
False. OOM is triggered regardless of swap being enabled or disabled, full or empty.
OOM won’t trigger if there is some free RAM.
False. Swap and OOM can be triggered due to memory fragmentation when
high-order allocation fails, even though there are a lot of free low-order
pages. 9
OOM kills a random process.
The current Linux kernel just kills a process with the largest RSS+swap usage
(with per-process OOM score adjustable through /proc). In v5.1 (2019), it dropped the heuristic to prefer to sacrifice a child instead of the parent. In v4.17 (2018), CAP_SYS_ADMIN processes lost their 3% bonus. Before v2.6.36 (2010), it used to be much more complicated and involved factors like forking, process runtime, nice values, but at least this is described in the current man 5 proc. But enabling vm.oom_kill_allocating_task sysctl can cause killing a random process because the random process can be the last one trying to allocate
memory and failing.
You can predict OOM. 10
People sometimes assume that you can get metrics from /proc/meminfo and elsewhere, make some calculations, and predict OOM. Even Kubernetes does some naive calculations trying to determine the working set size.
But you can’t predict OOM. The kernel itself can’t predict OOM. There is no precise information readily available to make this prediction. The kernel doesn’t know how much memory is reclaimable. It doesn’t know the exact working set size and how much memory is active and inactive, despite having appropriate fields in /proc/meminfo (See another blog post for details). The hardware doesn’t provide this information, and it is too expensive to track in the kernel from a performance point of view. The kernel has to go through the reclaim process and check each memory page if it has the Accessed flag set before reclaiming it. Only after failing the reclaim process multiple times does the kernel invoke OOM.