How to Make a Keenetic Router a Tailscale Exit Node.

Jan 14th, 2026

Tailscale software is available for Keenetic routers as an OpenWRT package (OPKG). OPKG support is optional and needs to be enabled (See the documentation). It is available for routers with USB ports that support USB flash drives (the supported routers are listed in the documentation).

External USB storage is preferred, as internal router memory is limited to a few hundred MB and Tailscale binaries are about 50 MB.

Prepare USB drive

The first step is to prepare a USB drive with an ext4 file system. It can be a partition or the entire USB can be formatted as an ext4 file system. It needs to be done on a separate Linux system. (See the documentation on how to format a USB drive on different systems)

# check which devices are the USB drives (match TRAN=usb)
lsblk -o NAME,MODEL,TRAN,TYPE,SIZE,MOUNTPOINT

# format the entire USB as an ext4 file system and assign label `USB128GB`
# replace sdX with the device name from `lsblk` output
mkfs.ext4 -L 'USB128GB' /dev/sdX

The next step is to download the Entware installer and save it to the USB drive. The installer should match the router architecture (mipsel, mips or aarch64). See the documentation section 3 to identify which router model requires a specific architecture. The example below uses mipsel architecture.

# mount the file system
mkdir /mnt/usb # create a mount point
mount -t ext4 /dev/sdX /mnt/usb # replace sdX with the device name

# create install directory
mkdir /mnt/usb/install

# download the Entware installer for mipsel architecture
curl -O --output-dir /mnt/usb/install \
  https://bin.entware.net/mipselsf-k3.4/installer/EN_mipsel-installer.tar.gz

# unmount the USB drive
umount /mnt/usb

Install Tailscale on the router

After umount /mnt/usb, it is safe to remove the USB drive and plug it into the router USB port. The drive should appear on the router System Dashboard section USB Drives and printers and in the Applications page under the USB Devices section. Make sure that it shows the correct label set before, in our example, USB128GB. See the documentation for example screenshots.

Go to the OPKG page and select appropriate device in the Drive dropdown. Click the Save button.

Go to the Diagnostics page and check System Log for the "Entware" installed! message and the default ssh login and password.

Dec 7 11:50:02 ndm Opkg::Manager: /opt/etc/init.d/doinstall: Log on to start an SSH session using login - root, password - keenetic.
Dec 7 11:50:02 ndm Opkg::Manager: /opt/etc/init.d/doinstall: [5/5] "Entware" installed!

Add a section to .ssh/config

cat >> ~/.ssh/config <<'EOF'

Host keenetic
  Hostname 192.168.1.1
  User root
  Port 222
EOF

Optionally, put your ssh public keys into /opt/etc/dropbear/authorized_keys for passwordless login.

scp -O ~/.ssh/id_ed25519.pub keenetic:/opt/etc/dropbear/authorized_keys

ssh keenetic

Change the default password:

passwd

Update the OPKG packages:

opkg update

Install Tailscale packages:

opkg install iptables tailscale

Bring the Tailscale node up

The tailscaled won’t be able to modify the system resolv.conf, so use --accept-dns=false.
The system iptables rules are periodically reset by the router software, so we can’t rely on tailscaled to maintain them and need to turn them off: --netfilter-mode=off.
We want to be able to ssh into the router through the Tailnet, so use --ssh (it works only through the Tailscale web admin panel).
We want to access the subnet connected to the router through the Tailnet, so use --advertise-routes 192.168.1.0/24 (replace with your own subnet address).
We want to use the route as an exit node, so use --advertise-exit-node.

tailscale up --accept-dns=false --netfilter-mode=off --ssh \
  --advertise-routes 192.168.1.0/24 --advertise-exit-node

Configure netfilter rules

Without netfilter rules, the exit node and advertised routes won’t work. We need to set up custom hooks for Keentic to configure netfilter rules. We need to create two files in the hook subdirectory /opt/etc/ndm/netfilter.d (See the documentation).

/opt/etc/ndm/netfilter.d/tailscale-filter.sh to configure the default filter table:

Set the variable ROUTER_TAILSCALE_IP to the IP returned by the tailscale ip -4 command on the router or look up in tailscale status output on another Tailscale host.

#!/bin/sh

[ "$type" = "ip6tables" ] && exit 0   # check the protocol type in backward-compatible way
[ "$table" != "filter" ] && exit 0   # check the table name

ROUTER_TAILSCALE_IP=100.X.X.X/32 # Tailscale IP address assigned to this router

# Create chain only once; skip if it already exists
if iptables -w -N ts-forward >/dev/null 2>&1; then
  iptables -w -A ts-forward -i tailscale0 -j MARK --set-xmark 0x40000/0xff0000
  iptables -w -A ts-forward -m mark --mark 0x40000/0xff0000 -j ACCEPT
  iptables -w -A ts-forward -s 100.64.0.0/10 -o tailscale0 -j DROP
  iptables -w -A ts-forward -o tailscale0 -m conntrack ! --ctstate RELATED,ESTABLISHED -j DROP
  iptables -w -A ts-forward -o tailscale0 -j ACCEPT
  iptables -w -I FORWARD 1 -j ts-forward
fi

if iptables -w -N ts-input >/dev/null 2>&1; then
  iptables -w -A ts-input -s $ROUTER_TAILSCALE_IP -i lo -j ACCEPT
  iptables -w -A ts-input -s 100.115.92.0/23 ! -i tailscale0 -j RETURN
  iptables -w -A ts-input -s 100.64.0.0/10 ! -i tailscale0 -j DROP
  iptables -w -A ts-input -i tailscale0 -j ACCEPT
  iptables -w -A ts-input -p udp -m udp --dport 41641 -j ACCEPT
  iptables -w -I INPUT 1 -j ts-input
fi

/opt/etc/ndm/netfilter.d/tailscale-nat.sh to configure nat table:

#!/bin/sh

[ "$type" = "ip6tables" ] && exit 0   # check the protocol type in backward-compatible way
[ "$table" != "nat" ] && exit 0   # check the table name

if iptables -w -t nat -N ts-postrouting >/dev/null 2>&1; then
  iptables -w -t nat -A ts-postrouting -m mark --mark 0x40000/0xff0000 -j MASQUERADE
  iptables -w -t nat -I POSTROUTING 1 -j ts-postrouting
fi

The router software resets iptables rules frequently, and it can do this even while these hooks are running, so it is normal to see iptables errors in the router logs. Verify that the rules are applied properly with ssh keenetic iptables-save |grep ts-.

The rules are created by running tailscale with --netfilter-mode=on and saving the rules with iptables-save.

Linux Page Reclaim and OOM in a Cgroup

Sep 30th, 2025

Page reclaim is triggered when the kernel tries to allocate a page, but the charge would exceed memory.current > memory.high or memory.current > memory.max.

If the kernel is unable to reclaim enough pages when memory.current > memory.max, then the OOM killer is invoked, and by default, the largest process in the cgroup is killed. The kernel does 16 attempts to reclaim the pages before invoking the OOM killer.

If the kernel is unable to reclaim enough pages when memory.current > memory.high, then the OOM killer is never invoked, but the allocating process (not the entire cgroup) is throttled proportionally to the number of pages above memory.high limit.

With the classic split-LRU, the reclaim process checks pages at the tails of the inactive file LRU and, if swap is enabled, of the inactive anon LRU.

If a page doesn’t have its “accessed” bit set by the CPU, then it is considered inactive and can be reclaimed.
If a page in the inactive file LRU isn’t dirty and doesn’t need to be written back to disk, then it is discarded and reclaimed. If it is dirty, then it is scheduled for writeback but can’t be reclaimed yet in this reclaim pass.
If a page in the inactive anon LRU already has a copy in swap (it was swapped in but wasn’t modified by the process), then it is discarded; otherwise, it is scheduled for swap out.

The kernel scans the inactive file and the inactive anon LRUs proportionally to vm.swappiness settings. If the kernel needs to reclaim 200 pages and vm.swappiness has the default 60 value, the kernel tries to reclaim 60 pages from the inactive anon LRU and 140 pages from the inactive file LRU. If the OOM is about to be invoked, then this proportion is ignored and the kernel reclaims whatever it can get.

The swap usage for a cgroup is controlled with memory.swap.max and memory.swap.high limits. memory.swap.max is the hard limit for swap usage and exceeding memory.swap.high causes throttling. Default values are max, so the limits are disabled.

When inactive LRUs are reduced by the reclaim process below a specific proportion between active and inactive LRU size, then the kernel shrinks an active LRU by moving pages from its tail almost unconditionally to the corresponding inactive LRU to restore the minimal proportion.

Originally published as an answer on serverfault.com.

MariaDB Gets Unexpected OOM

Aug 30th, 2025

MariaDB has recently surprised me by getting itself OOM-killed, even though the VM had a couple of GB of spare memory (as I thought). At first, I suspected it was some kind of memory spike, but the VM memory graph showed that it had been sitting just a few MB from the memory limit for some time before the OOM. That was unexpected, too. Started digging around and found that pmm-agent ate 1GB of RAM, but it is still not enough. Finally, checked the OOM task dump in the logs, and, indeed, MariaDB was using about 25% more memory than my calculations estimated. The 11 GB InnoDB with the default other settings gave me an expected 12 GB, and even the memory_used system variable reported the same 12 GB, but the actual process RSS was 15 GB.

Digging through the web, mailing lists and documentation, I discovered that it is a known and documented problem with the default system malloc. The solution is to replace it with jemalloc or tcmalloc, which is also documented. In my case, memory consumption dropped by 40%.

Curiously, it survived without OOMs for months after the last InnoDB size change. As I discovered, MariaDB can monitor for memory pressure events through /proc/pressure/memory and, on detecting memory pressure, it drops all non-dirty pages from the InnoDB buffer pool and marks them with MADV_FREE. That hovering just a few MB from the memory limit was the result of this behaviour, as the kernel reclaimed only the small amount of pages. After the OOM, I added the MemoryHigh limit to the MariaDB systemd unit to give MariaDB a bit of early memory pressure, and was surprised that it wasn’t able to fill its InnoDB buffer. The logs indicated that the InnoDB buffer was reset multiple times per hour. That behaviour to drop almost the entire InnoDB buffer is a bit of an overreaction, and it was disabled by default in MariaDB 10.11.12 (not yet in the Bookworm).

Disappointments With AWS Database Migration Service and Macie

Jul 5th, 2025

AWS Database Migration Service is probably my worst experience with AWS services so far. Wasted half a day just trying to start a replication instance. It appeared to have been stuck in the Starting... state. There is no progress indicator, no logs. I was blindly changing security groups, adding VPC endpoints, subnets, tweaking IAM roles and attached policies, searching CloudWatch and CloudTrail. I read the entire sections on replication instances and troubleshooting in the AWS DMS documentation through and through, hoping to find any hints on how to turn on the logs for the instance creation.

Then I found an old GitHub issue for the Terraform AWS provider that suggested increasing default timeouts to 15 minutes. Given that the DMS replication instances use the same classes as EC2, I expected them to start within a few minutes and waited for 10 minutes before assuming that it hung. It actually took 17 minutes to start an instance in my case. The current default timeout for the aws_dms_replication_instance terraform resource is 40 minutes.

When I got the instance running, the fun didn’t end. First, it didn’t like the database password due to some special characters. The database accepted that password. The AWS Secrets Manager also accepted it. DMS - nope, there are some characters I can’t handle. Ok, no problem, just generated a long alphanumeric password. Then I tried to run ‘Pre-migration assessments’. One of the tests failed to run at all, but no error logs again. Try to run it again - the assessment fails to run at all. Ok, not important, move on. Try to run the migration task itself. It failed. CloudWatch logs contain a lot of output, but don’t tell why the task failed. Ok, when you click on the task status in the console, it shows that it ran out of memory. With a database that can fit entirely into the memory even on t3.micro. Ok, another 20 minutes of waiting to change the instance class. Try again, run out of memory again. Didn’t log the memory failure into CloudWatch logs. Another 20 minutes of waiting.

When I finally got the tables dumped into S3 in Parquet format and moved to the final goal of this adventure: run AWS Macie to detect sensitive information. Another disappointment - it detected only two entries, and one was a false positive. In the end, it was more efficient to review the SQL dump manually - I spent half the time I wasted on DMS and Macie and found more than 30 columns in different tables with sensitive information.

CrowdSec SQLite to MySQL Migration

Jun 30th, 2025

How to convert CrowdSec Local API database from SQLite to MySQL or MariaDB.

The official CrowdSec documentation doesn’t provide instructions on data migration from SQLite to MySQL or MariaDB, and expects the user to re-register all machines and bouncers, which is inconvenient.

Let’s try a straightforward approach: dump the SQLite database into plain SQL and import it into MariaDB (we are running Ubuntu 24.04 Noble).

We need to create a database for CrowdSec first. We can follow the instructions from the documentation:

mysql> CREATE DATABASE crowdsec;
mysql> CREATE USER 'crowdsec'@'%' IDENTIFIED BY '<password>';
mysql> GRANT ALL PRIVILEGES ON crowdsec.* TO 'crowdsec'@'%';
mysql> FLUSH PRIVILEGES;

We will need mysql and sqlite3 on the host where we have CrowdSec installed:

sudo apt install mariadb-client sqlite3

Check the location of the CrowdSec SQLite database file:

cscli config show --key "Config.DbConfig.DbPath"

Let’s try the dump-pipe-import:

# sqlite3 `cscli config show --key "Config.DbConfig.DbPath"` '.dump' |\
    mysql -h mysqlhost -u crowdsec -p crowdsec
Enter password:
--------------
PRAGMA foreign_keys=OFF
--------------

ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'PRAGMA foreign_keys=OFF' at line 1

Unfortunately, SQLite and MySQL/MariaDB have incompatible syntax.

Let’s look for a solution. There is a Python tool, sqlite3-to-mysql, which does exactly what we need: transfer data from SQLite3 to MySQL.

apt install python3-venv
python3 -m venv .venv
source .venv/bin/activate
pip install sqlite3-to-mysql

Now we can convert the data, but stop the CrowdSec first, so we don’t get inconsistent data.

systemctl stop crowdsec

Now convert the data:

sqlite3mysql -f `cscli config show --key "Config.DbConfig.DbPath"` -h mysqlhost -d crowdsec  -u crowdsec -p

The data conversion runs without errors, and we get the data into MariaDB.

Now we can update the /etc/crowdsec/config.yaml to use MariaDB (See the documentation):

db_config:
  log_level: info
  # type: sqlite
  # db_path: /var/lib/crowdsec/data/crowdsec.db
  type: mysql
  db_name: crowdsec
  user: crowdsec
  password: <password>
  host: mysqlhost
  port: 3306

But when we try to start CrowdSec, it fails with a fatal error:

level=fatal msg="unable to create database client: failed creating schema resources: sql/schema: modify \"machines\" table: Error 1833: Cannot
 change column 'id': used in a foreign key constraint 'alerts_FK_0_0' of table 'crowdsec.alerts'"

There seems to be a difference in CrowdSec database schemas between SQLite and MySQL/MariaDB. Let’s try to work around that.

First, recreate the database to clear all the imported data.

DROP DATABASE crowdsec;
CREATE DATABASE crowdsec;

Start CrowdSec with systemctl start crowdsec. It is going to fail, but it will create the correct database schema. (Backup your local_api_credentials.yaml - it overwrote it with new credentials, but I can’t repeat this in a clean environment.)

Now we can try to convert the data again, but without creating the schema (option -K) this time (it also uses a hardcoded path to the SQLite database):

sqlite3mysql -K -f /var/lib/crowdsec/data/crowdsec.db -h mysqlhost -d crowdsec  -u crowdsec -p

Now we can start CrowdSec with systemctl start crowdsec, and it works without any issues.

Falsehoods People (and LLMs) Believe About Linux Swap and OOM.

May 15th, 2025

Swap is useless

False, even if a system has a lot of memory, swap allows for better memory utilisation by swapping out allocated but rarely used anonymous memory pages.

Swap is going to slow down your system by its mere presence

False, as long as the system has enough memory, there would be very little or no swap-related I/O, so there is no slowdown.

It is really bad if you have some memory swapped out ¹

False, the kernel swapped out some unused pages, and the memory can be allocated for something more useful, like the file cache.

swap is going to wear out your SSD ²

False, as long as there is no swap-related I/O, there is no wearing out of the SSD. And modern SSDs have enough resources to handle swap-related I/O anyway.

Swap is an emergency solution for out-of-memory conditions

False, once your working set exceeds actual physical memory, swap makes things worse, causing swap thrashing and a slower OOM trigger.

Swap allows running workloads that exceed the system’s physical memory

False, once your active working set exceeds actual physical memory, the performance degradation is exponential. If we assume a 10³ difference in latency between RAM and SSD with random 4K page access, then we can calculate that a 0.1% excess causes a 2x degradation, 1% - 10x degradation, 10% - 100x degradation.

Swap causes gradual performance degradation

Under stable workloads, until the active working set exceeds physical memory, swap improves performance by freeing unused memory. Once the working set exceeds physical memory, performance degradation is exponential and the system gets unresponsive very quickly.

On a desktop system, when an inactive application gets swapped out, switching back to it would feel slow.

Kernel evicts program executable pages, making the system unresponsive.

Active executable pages are explicitly excluded from reclaim. The kernel reclaims inactive file cache and inactive anonymous pages first, proportionally to vm.swappiness setting. Then it starts cannibalizing active file cache and anonymous pages, but executable pages are explicitly excluded from that. The system becomes unresponsive due to I/O starvation, because all file cache is dropped. Also, when system memory is below the low watermark, any task that needs a free memory page causes the kernel to go into direct reclaim and the task gets blocked until free pages are found.

Swap size should be double the amount of physical memory.³

False. Unless the system has megabytes of memory instead of gigabytes. If you allocate more than a few GB of swap size, you are going for a long swap thrashing session when you run out of memory and before OOM gets triggered.

The proper rule of thumb is to make the swap large enough to keep all inactive anonymous pages after the workload has stabilized, but not too large to cause swap thrashing and a delayed OOM kill if a fast memory leak happens. For an average system with 2 GB - 128 GB RAM, you start by adding a 256 MB swap file, and if it fills up entirely, increase it by another 256 MB. The step can be increased to 512 MB - 1 GB on larger systems with faster storage.

Swap use begins based on the vm.swappiness threshold, e.g. when 40% of RAM remains for `vm.swappiness=40`. ⁴

False. Before the introduction of the split-LRU design in kernel version 2.6.28 in 2008, there used to be a different algorithm that used the percentage of allocated memory, but it was more complicated and with the vm.swappiness=40, it wouldn’t start swapping even if all memory was allocated from processes and with the default vm.swappiness=60 it would start swapping at 80% memory allocation. This algorithm is no longer in use.

Swap aggressiveness is configured using vm.swappiness and it is linear between 0 and 100 ⁵

False. vm.swappiness was first described in the kernel documentation in 2009 with the following text:

This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase aggressiveness, lower values decrease the amount of swap. A value of 0 instructs the kernel not to initiate swap until the amount of free and file-backed pages is less than the high water mark in a zone.

It doesn’t say that the relation between vm.swappiness and aggressiveness is linear, but people made assumptions.

This description is still present in some texts on kernel.org (this file isn’t present in the kernel tree anymore, and it wasn’t updated since 2019).

The documentation was updated in 2020 to a more appropriate description and the values up to 200 were allowed.

With vm.swappiness=0 kernel won’t swap

False, if the kernel hits the low water mark in any zone, then it is going to swap anyway.

With vm.swappiness=100 kernel is going to swap out everything from memory right away

False, if there is no memory pressure, the kernel isn’t going to swap anything.

vm.swappiness=60 is too agressive ⁶

False, the vm.swappiness value 60 means that anon_prio is assigned the value of 60 and file_prio the value of 200 - 60 = 140. The resulting ratio 140/60 means that the kernel would evict 2.33 times more pages from the page cache than swap out anonymous pages.

The default value of 60 was chosen with the assumption that the file I/O operations, which tend to be sequential, are more effective than random swap I/O, but this applies to rotating media like HDDs only. For SSDs, vm.swappiness=100 is more appropriate.

As the documentation states:

For in-memory swap, like zram or zswap, as well as hybrid setups that have swap on faster devices than the filesystem, values beyond 100 can be considered

vm.swappiness=10 is just the right setting and makes your system fast

This value gives a ratio of 19 times preference for discarding page cache over swapping out. Your system is going to have a lot of unused anon pages sitting around while churning through file cache pages, making it less effective.

Swap won’t happen if there is some free RAM.

False. If a process runs within a cgroup with defined memory limits, it can be swapped out, even though the system still has a lot of free memory. Swap and OOM can also be triggered due to memory fragmentation when high-order allocations fail, even though there are a lot of free low-order pages.

Swap happens just randomly, when the kernel has nothing to do

False. Swap happens when memory allocation brings the number of free memory pages below the low watermark specified for a memory zone. See /proc/zoneinfo and this question on Unix.StackExchange.

Swapping over NFS is a good idea. ⁷

False. It is very slow, and any packet lost/delayed on the network would cause the system to hang.

OOM won’t trigger if there is swap enabled. ⁸

False. OOM is triggered regardless of swap being enabled or disabled, full or empty.

OOM won’t trigger if there is some free RAM.

False. Swap and OOM can be triggered due to memory fragmentation when high-order allocation fails, even though there are a lot of free low-order pages. ⁹

OOM kills a random process.

The current Linux kernel just kills a process with the largest RSS+swap usage (with per-process OOM score adjustable through /proc). In v5.1 (2019), it dropped the heuristic to prefer to sacrifice a child instead of the parent. In v4.17 (2018), CAP_SYS_ADMIN processes lost their 3% bonus. Before v2.6.36 (2010), it used to be much more complicated and involved factors like forking, process runtime, nice values, but at least this is described in the current man 5 proc. But enabling vm.oom_kill_allocating_task sysctl can cause killing a random process because the random process can be the last one trying to allocate memory and failing.

You can predict OOM. ¹⁰

People sometimes assume that you can get metrics from /proc/meminfo and elsewhere, make some calculations, and predict OOM. Even Kubernetes does some naive calculations trying to determine the working set size.

But you can’t predict OOM. The kernel itself can’t predict OOM. There is no precise information readily available to make this prediction. The kernel doesn’t know how much memory is reclaimable. It doesn’t know the exact working set size and how much memory is active and inactive, despite having appropriate fields in /proc/meminfo (See another blog post for details). The hardware doesn’t provide this information, and it is too expensive to track in the kernel from a performance point of view. The kernel has to go through the reclaim process and check each memory page if it has the Accessed flag set before reclaiming it. Only after failing the reclaim process multiple times does the kernel invoke OOM.

Linux Inactive Memory

May 13th, 2025

Since the introduction of the split-LRU design in version 2.6.28, the kernel maintains 4 (actually 5) LRU (least recently used) lists for memory pages:

Active anon pages
Active file pages
Inactive anon pages
Inactive file pages
(unevictabled pages)

The total sizes of these pages are reported in vmstat and /proc/meminfo

$ vmstat -s | head -5
    131669088 K total memory
     69020576 K used memory
     56007124 K active memory
     29691936 K inactive memory
      6140772 K free memory

$ cat /proc/meminfo | grep -i active
Active:         56007676 kB
Inactive:       29691936 kB
Active(anon):   34381776 kB
Inactive(anon): 16134108 kB
Active(file):   21625900 kB
Inactive(file): 13557828 kB

These lists are maintained for the purpose of the page reclaim. When the kernel needs to allocate some memory pages and there are not enough free pages, the kernel goes through the LRU lists, trying to find pages that can be reclaimed. The pages are organised in LRU lists so the kernel can reclaim these pages that weren’t used recently and are supposedly least likely to be used soon.

The kernel goes through both inactive anon and inactive file LRU lists, starting from their tails while maintaining the proportion specified by vm.swappiness between the reclaimed anon and file pages. With the default value vm.swappiness=60, when the kernel needs 200 pages, it is going to reclaim 60 pages from inactive anon LRU and 140 pages from inactive file LRU, but if there are not enough pages available, it won’t keep the balance and is going to reclaim whatever it finds.

When a file-backed page is reclaimed, it can be discarded right away if it isn’t modified, or it needs to be written back into the file. An anon page always needs to be written into a swap when reclaimed, so it is more expensive to reclaim anon pages.

So far, so good - we established the purpose of the separation of the memory into active/inactive, so the kernel knows which pages to reclaim.

Now the main question: how exactly does the kernel decide which page is active and which is not?

The problem is that the kernel can’t really track all memory access, and the hardware provides only one single bit of information - if the page was ever accessed. It doesn’t know when it was accessed and how many times. The kernel works around this by periodically scanning the pages and clearing the ‘Accessed’ bit and looking at this bit again on the next scan. But it doesn’t scan all the pages all the time. Only when the kernel needs some memory pages, it scans the tail of the inactive LRU, and it stops the scan when it finds enough pages. So, most of the pages don’t get activity information updated.

To illustrate this: a newly allocated page starts its life on the head of the inactive LRU with the ‘Accessed’ bit reset to 0. As the other new pages are allocated, our page is getting pushed back to the tail of the inactive LRU. If the page gets accessed by the user space program, the CPU sets the ‘Accessed’ bit to 1. The page can be accessed many times, but it is still considered inactive until the reclaim scan reaches this page position in the inactive LRU from the tail. Pages get promoted to the active LRU on the second (some are on the first) scan if they have the ‘Accessed’ bit set. If a page doesn’t get promoted to the active LRU, it gets reclaimed if it is at the tail end of the inactive LRU and the kernel needs some free pages. Once a page gets to the active LRU, it is considered active even if it isn’t accessed there at all. Until the active list grows too large or the inactive list gets too small by the reclaim, there is no change to the active list and pages in the active list are not scanned at all and more active pages are never moved to the head within the active LRU.

The kernel maintains a specific ratio between the active and inactive LRU, depending on the memory size (since v4.10):

 * total     target    max
 * memory    ratio     inactive
 * -------------------------------------
 *   10MB       1         5MB
 *  100MB       1        50MB
 *    1GB       3       250MB
 *   10GB      10       0.9GB
 *  100GB      31         3GB
 *    1TB     101        10GB
 *   10TB     320        32GB

When the kernel needs to grow the inactive LRU, it moves pages from the tail of the active LRU regardless of their activity status, except for file-backed executable pages, which get promoted back to the head of the active LRU.

Even though the design of active/inactive LRUs is to reference active and inactive pages, the kernel intentionally spends minimal effort to maintain these lists (for performance reasons), so the information about the pages in these lists is outdated most of the time for most of the pages.

What can we actually tell about the pages in the LRUs?

The page at the head of the active list was accessed just recently. It was accessed once or twice, or a million times.
The page at the tail of the active list was accessed some time ago when it was added to the head of the list. It could have been accessed a million times or zero times while being in the active list.
The page at the head of the inactive list was just allocated or it was just pushed out of the active list where it spent unknown time and has been accessed zero or million times.
The page in the middle of the inactive list was allocated some time ago, and we don’t know if it has been accessed yet. Until it is scanned, it can be accessed zero times or a million times.
The pages at the tail of the inactive list are only pages that have accurate information, but only when they are scanned. We also still don’t know if they were accessed more than once. We also don’t know if they are going to be accessed again, but at least this is expected.

So the more appropriate name for the ‘Inactive’ LRU would be ‘candidates for the reclaim’, and for ‘active’ - ‘not considered for the reclaim yet’. The numbers derived from the lengths of these LRUs and reported as Active/Inactive memory in /proc/meminfo have little relation to the actual working set size.

AI Fails With DevOps Tasks

Mar 1st, 2025

Did a quick evaluation and comparison between ChatGPT and Claude on my typical task to decide if the Claude subscription discount offer is worth it. All models failed miserably on a simple and straightforward task of creating a single Terraform resource. The caveat is that this resource was implemented relatively recently by AWS (in 2021) and in the Terraform AWS provider (10 months ago).

The initial question is simple: “How to add rDNS for AWS EIP with Terraform”. All models answered that Terraform doesn’t support it natively and offered a workaround with “local-exec” and call to “aws ec2 modify-address-attribute”. Claude gave the correct parameter “—domain-name”, both 4o and o3-mini-high hallucinated parameter names “—reverse-dns-name” and “—reverse-dns”.

Given a correction that Terraform does support this natively, the models started hallucinating by inventing or repurposing “aws_eip” resource attributes. 4o suggested using the “domain” attribute, which is not related to DNS. o3-mini-high invented “reverse_dns” block for “aws_eip” resource. Claude suggested assigning the “reverse_dns” attribute, which doesn’t exist. Interestingly, with web search enabled, 4o was able to find the correct “aws_eip_domain_name” resource. Both Claude and o3 went back to suggesting using “local-exec” and inventing random resource names like “aws_ec2_address”, “aws_ec2_address_attribute”, “aws_eip_reverse_dns”.

I have noticed that o3 is much more stubborn, and if it goes a wrong way, it is almost impossible to correct - a few weeks ago, it tried to correct me that MySQL 9 doesn’t exist. Not sure if the new Claude works the same way, but at least it is much more cheerful. Sill gave that subscription option a pass as there is no improvement, and these tasks are still too challenging for AIs.

One Infrastructure Migration Equals Two Disaster Recoveries

Feb 8th, 2025

To paraphrase a proverb, “One infrastructure migration equals two disaster recoveries” ¹. Well, I had to move not two but 25 services. Some are dockerised, some are running in legacy VMs managed with Chef, and one very old VM that used to be a dedicated server more than a decade ago, with a jumble of legacy half-abandoned websites and mail services.

While all the services are internal and not customer-facing, there are some critical for the team and infrastructure. To avoid much service disruption and downtime, I decided to move services one at a time so I could deploy and test the service on the new infrastructure, shut down the old one, re-sync the data and switch the DNS (almost forgot to lower DNS TTL before the move) with the option to roll-back to the old service if anything goes wrong. Beforehand, I created a table in Notion with all service resource requirements, SLAs, inter-dependencies, priorities and nice color labels. Had this table open in front of me for almost two months.

The initial infrastructure preparation was a clean greenfield - made it all modern, fresh and shiny. While I had time, I started with the most complex and important services, refreshed them, threw away a lot of organic-growth stuff, and re-wrote some Chef cookbooks to Ansible. When the deadline started approaching, it became a kind of a slog - move the service, search all the configuration spread for IPs to update², get the service working, repeat, preferably early in the morning. The team also dropped on me two other projects(including a major OS upgrade) with the same deadline to keep the pressure. Had a lot of overtime and even broke my anti-burnout rule of not working full-time on weekends once. Ironically, I managed to follow Parkinson’s law and finished by retiring the old infrastructure exactly at the deadline, the last day of the month.

But there are still a lot of small broken parts to fix and old dust and junk to clear. Still sorting out and prioritising my ToDo list as it added more than a hundred items during the migration. Two major things to re-work - remove as much as possible from the Chef attributes and re-organize packet filter rules management.

Also, had some interesting and surprising moments:

The first one was when the hosting provider moved an IP from one server to another, and I started seeing incoming traffic for the IP on both the old and the new servers simultaneously. I was monitoring the traffic with tcpdump on both servers to catch the moment when the IP switch was about to happen. At first, I suspected their network engineer somehow managed to mirror the traffic and even sent them a message, as I was clearly getting ‘host unreachable’ from the old server public IP, but I didn’t stop investigating. The IP wasn’t responding even locally, though I double-checked that it was assigned to a VM and that a route was configured properly. When I got a ‘host unreachable’ message from an internal IP from the old server, it all became clear. There was an IPSec tunnel between the old and the new servers, and the IP was still routed to the old server through the tunnel. The confusing part was that the packets coming in through the IPSec tunnel on the old server were seen as coming through normal ethernet. The irony is that I had already prepared IPSec reconfiguration, and it was the next step in my checklist.

The second was a chicken-and-egg problem. To move the Chef server, I needed to reconfigure load balancers, but the load balancers’ configuration was managed with Chef. Ended up changing the configuration manually.

The third was a little adventure with a new secondary server at the old hosting provider. They delayed its delivery for a month. When I got access to its console, it wasn’t even re-imaged with a fresh Debian; it was a used node from the provider’s cloud solution. Their netboot image was running Alpine version from 2019, and their netboot installer didn’t have anything newer than Debian Buster. Their netboot didn’t support UEFI boot with GPT, and I found it the hard way. It also explained why they failed to re-image the server. Re-partitioned the disks with MBR, installed Debian 10, and followed with two immediate upgrades to Bullseye and Bookworm. Almost entire day wasted doing someone else work. And then I had to wait 24h for them to re-route an IP to this server. Oh, and the disks had 50K power-on hours on a supposedly new server. My expectations were adjusted lower with every interaction with these guys. But in the end, most services were moved from this provider and we got twice the capacity for a lower price.

Notes:

I discovered that in the English proverb, two or three moves equal one house fire, but in the Russian version, the proportion is opposite - one move equals two house fires.↩
There is an anti-pattern with Chef in that it accumulates a lot of information in attributes, which are spread along different places - node/role/environments, and they are not versioned and hard to search globally. Luckily, I have a daily dump of all roles/nodes/environments/data bags as JSON files, so I have an option to grep through them.↩

Why Is It Always Just a Single SQL Statement Causing a Major Performance Regression?

Oct 19th, 2024

A few weeks ago, I had to investigate a batch job that was taking more than 3 hours to run every night. The DB was the obvious bottleneck, as the job was hitting it so hard that I noticed excessive load even before the business started complaining. Before looking into the details, I assumed that the application code was doing some loops internally, retrieving mostly the same data again and again. I started preparing myself to dive into the code for a week to untangle data flows, imagining the horrors of multi-page SQL statements.

But “Premature optimisation is the root of all evil”, so first things first: enable detailed monitoring and collect query statistics for a day. The next morning, there is some data: the code is hitting mostly a single table with a simple query but with WHERE on a column without an index. Add an index, maybe it will help improve the performance before I have to dive into the code.

The next morning, I check the DB graphs, and there is no load at all. Did the job run? Did someone disable it? Did I change something, causing the job to crash?

Checking the application logs. The job did run. And completed successfully. In 10 minutes. One single index reduced run time from more than 3 hours to 10 minutes. The classic “one hit with the hammer but you need to know where to hit”.

On the other hand, I’m trying to imagine what someone who “doesn’t know where to hit” and has no visibility into the database and application performance would do to solve this issue. Just crank up the instance size? I suspect they would end up paying ten times more.

← Older Blog Archives

Alex L. Demidov

DevOps/SRE consultant

How to Make a Keenetic Router a Tailscale Exit Node.

Prepare USB drive

Install Tailscale on the router

Bring the Tailscale node up

Configure netfilter rules

Linux Page Reclaim and OOM in a Cgroup

MariaDB Gets Unexpected OOM

Disappointments With AWS Database Migration Service and Macie

CrowdSec SQLite to MySQL Migration

Falsehoods People (and LLMs) Believe About Linux Swap and OOM.

Swap is useless

Swap is going to slow down your system by its mere presence

It is really bad if you have some memory swapped out ¹

swap is going to wear out your SSD ²

Swap is an emergency solution for out-of-memory conditions

Swap allows running workloads that exceed the system’s physical memory

Swap causes gradual performance degradation

Kernel evicts program executable pages, making the system unresponsive.

Swap size should be double the amount of physical memory.³

Swap use begins based on the vm.swappiness threshold, e.g. when 40% of RAM remains for `vm.swappiness=40`. ⁴

Swap aggressiveness is configured using vm.swappiness and it is linear between 0 and 100 ⁵

With vm.swappiness=0 kernel won’t swap

With vm.swappiness=100 kernel is going to swap out everything from memory right away

vm.swappiness=60 is too agressive ⁶

vm.swappiness=10 is just the right setting and makes your system fast

Swap won’t happen if there is some free RAM.

Swap happens just randomly, when the kernel has nothing to do

Swapping over NFS is a good idea. ⁷

OOM won’t trigger if there is swap enabled. ⁸

OOM won’t trigger if there is some free RAM.

OOM kills a random process.

You can predict OOM. ¹⁰

Linux Inactive Memory

AI Fails With DevOps Tasks

One Infrastructure Migration Equals Two Disaster Recoveries

Why Is It Always Just a Single SQL Statement Causing a Major Performance Regression?

Prepare USB drive

Install Tailscale on the router

Bring the Tailscale node up

Configure netfilter rules

Swap is useless

Swap is going to slow down your system by its mere presence

It is really bad if you have some memory swapped out 1

swap is going to wear out your SSD 2

Swap is an emergency solution for out-of-memory conditions

Swap allows running workloads that exceed the system’s physical memory

Swap causes gradual performance degradation

Kernel evicts program executable pages, making the system unresponsive.

Swap size should be double the amount of physical memory.3

Swap use begins based on the vm.swappiness threshold, e.g. when 40% of RAM remains for vm.swappiness=40. 4

Swap aggressiveness is configured using vm.swappiness and it is linear between 0 and 100 5

With vm.swappiness=0 kernel won’t swap

With vm.swappiness=100 kernel is going to swap out everything from memory right away

vm.swappiness=60 is too agressive 6

vm.swappiness=10 is just the right setting and makes your system fast

Swap won’t happen if there is some free RAM.

Swap happens just randomly, when the kernel has nothing to do

Swapping over NFS is a good idea. 7

OOM won’t trigger if there is swap enabled. 8

OOM won’t trigger if there is some free RAM.

OOM kills a random process.

You can predict OOM. 10

It is really bad if you have some memory swapped out ¹

swap is going to wear out your SSD ²

Swap size should be double the amount of physical memory.³

Swap use begins based on the vm.swappiness threshold, e.g. when 40% of RAM remains for `vm.swappiness=40`. ⁴

Swap aggressiveness is configured using vm.swappiness and it is linear between 0 and 100 ⁵

vm.swappiness=60 is too agressive ⁶

Swapping over NFS is a good idea. ⁷

OOM won’t trigger if there is swap enabled. ⁸

You can predict OOM. ¹⁰