Linux Page Reclaim and OOM in a Cgroup

Sep 30th, 2025

Page reclaim is triggered when the kernel tries to allocate a page, but the charge would exceed memory.current > memory.high or memory.current > memory.max.

If the kernel is unable to reclaim enough pages when memory.current > memory.max, then the OOM killer is invoked, and by default, the largest process in the cgroup is killed. The kernel does 16 attempts to reclaim the pages before invoking the OOM killer.

If the kernel is unable to reclaim enough pages when memory.current > memory.high, then the OOM killer is never invoked, but the allocating process (not the entire cgroup) is throttled proportionally to the number of pages above memory.high limit.

With the classic split-LRU, the reclaim process checks pages at the tails of the inactive file LRU and, if swap is enabled, of the inactive anon LRU.

If a page doesn’t have its “accessed” bit set by the CPU, then it is considered inactive and can be reclaimed.
If a page in the inactive file LRU isn’t dirty and doesn’t need to be written back to disk, then it is discarded and reclaimed. If it is dirty, then it is scheduled for writeback but can’t be reclaimed yet in this reclaim pass.
If a page in the inactive anon LRU already has a copy in swap (it was swapped in but wasn’t modified by the process), then it is discarded; otherwise, it is scheduled for swap out.

The kernel scans the inactive file and the inactive anon LRUs proportionally to vm.swappiness settings. If the kernel needs to reclaim 200 pages and vm.swappiness has the default 60 value, the kernel tries to reclaim 60 pages from the inactive anon LRU and 140 pages from the inactive file LRU. If the OOM is about to be invoked, then this proportion is ignored and the kernel reclaims whatever it can get.

The swap usage for a cgroup is controlled with memory.swap.max and memory.swap.high limits. memory.swap.max is the hard limit for swap usage and exceeding memory.swap.high causes throttling. Default values are max, so the limits are disabled.

When inactive LRUs are reduced by the reclaim process below a specific proportion between active and inactive LRU size, then the kernel shrinks an active LRU by moving pages from its tail almost unconditionally to the corresponding inactive LRU to restore the minimal proportion.

Originally published as an answer on serverfault.com.

MariaDB Gets Unexpected OOM

Aug 30th, 2025

MariaDB has recently surprised me by getting itself OOM-killed, even though the VM had a couple of GB of spare memory (as I thought). At first, I suspected it was some kind of memory spike, but the VM memory graph showed that it had been sitting just a few MB from the memory limit for some time before the OOM. That was unexpected, too. Started digging around and found that pmm-agent ate 1GB of RAM, but it is still not enough. Finally, checked the OOM task dump in the logs, and, indeed, MariaDB was using about 25% more memory than my calculations estimated. The 11 GB InnoDB with the default other settings gave me an expected 12 GB, and even the memory_used system variable reported the same 12 GB, but the actual process RSS was 15 GB.

Digging through the web, mailing lists and documentation, I discovered that it is a known and documented problem with the default system malloc. The solution is to replace it with jemalloc or tcmalloc, which is also documented. In my case, memory consumption dropped by 40%.

Curiously, it survived without OOMs for months after the last InnoDB size change. As I discovered, MariaDB can monitor for memory pressure events through /proc/pressure/memory and, on detecting memory pressure, it drops all non-dirty pages from the InnoDB buffer pool and marks them with MADV_FREE. That hovering just a few MB from the memory limit was the result of this behaviour, as the kernel reclaimed only the small amount of pages. After the OOM, I added the MemoryHigh limit to the MariaDB systemd unit to give MariaDB a bit of early memory pressure, and was surprised that it wasn’t able to fill its InnoDB buffer. The logs indicated that the InnoDB buffer was reset multiple times per hour. That behaviour to drop almost the entire InnoDB buffer is a bit of an overreaction, and it was disabled by default in MariaDB 10.11.12 (not yet in the Bookworm).

Disappointments With AWS Database Migration Service and Macie

Jul 5th, 2025

AWS Database Migration Service is probably my worst experience with AWS services so far. Wasted half a day just trying to start a replication instance. It appeared to have been stuck in the Starting... state. There is no progress indicator, no logs. I was blindly changing security groups, adding VPC endpoints, subnets, tweaking IAM roles and attached policies, searching CloudWatch and CloudTrail. I read the entire sections on replication instances and troubleshooting in the AWS DMS documentation through and through, hoping to find any hints on how to turn on the logs for the instance creation.

Then I found an old GitHub issue for the Terraform AWS provider that suggested increasing default timeouts to 15 minutes. Given that the DMS replication instances use the same classes as EC2, I expected them to start within a few minutes and waited for 10 minutes before assuming that it hung. It actually took 17 minutes to start an instance in my case. The current default timeout for the aws_dms_replication_instance terraform resource is 40 minutes.

When I got the instance running, the fun didn’t end. First, it didn’t like the database password due to some special characters. The database accepted that password. The AWS Secrets Manager also accepted it. DMS - nope, there are some characters I can’t handle. Ok, no problem, just generated a long alphanumeric password. Then I tried to run ‘Pre-migration assessments’. One of the tests failed to run at all, but no error logs again. Try to run it again - the assessment fails to run at all. Ok, not important, move on. Try to run the migration task itself. It failed. CloudWatch logs contain a lot of output, but don’t tell why the task failed. Ok, when you click on the task status in the console, it shows that it ran out of memory. With a database that can fit entirely into the memory even on t3.micro. Ok, another 20 minutes of waiting to change the instance class. Try again, run out of memory again. Didn’t log the memory failure into CloudWatch logs. Another 20 minutes of waiting.

When I finally got the tables dumped into S3 in Parquet format and moved to the final goal of this adventure: run AWS Macie to detect sensitive information. Another disappointment - it detected only two entries, and one was a false positive. In the end, it was more efficient to review the SQL dump manually - I spent half the time I wasted on DMS and Macie and found more than 30 columns in different tables with sensitive information.

CrowdSec SQLite to MySQL Migration

Jun 30th, 2025

How to convert CrowdSec Local API database from SQLite to MySQL or MariaDB.

The official CrowdSec documentation doesn’t provide instructions on data migration from SQLite to MySQL or MariaDB, and expects the user to re-register all machines and bouncers, which is inconvenient.

Let’s try a straightforward approach: dump the SQLite database into plain SQL and import it into MariaDB (we are running Ubuntu 24.04 Noble).

We need to create a database for CrowdSec first. We can follow the instructions from the documentation:

mysql> CREATE DATABASE crowdsec;
mysql> CREATE USER 'crowdsec'@'%' IDENTIFIED BY '<password>';
mysql> GRANT ALL PRIVILEGES ON crowdsec.* TO 'crowdsec'@'%';
mysql> FLUSH PRIVILEGES;

We will need mysql and sqlite3 on the host where we have CrowdSec installed:

sudo apt install mariadb-client sqlite3

Check the location of the CrowdSec SQLite database file:

cscli config show --key "Config.DbConfig.DbPath"

Let’s try the dump-pipe-import:

# sqlite3 `cscli config show --key "Config.DbConfig.DbPath"` '.dump' |\
    mysql -h mysqlhost -u crowdsec -p crowdsec
Enter password:
--------------
PRAGMA foreign_keys=OFF
--------------

ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'PRAGMA foreign_keys=OFF' at line 1

Unfortunately, SQLite and MySQL/MariaDB have incompatible syntax.

Let’s look for a solution. There is a Python tool, sqlite3-to-mysql, which does exactly what we need: transfer data from SQLite3 to MySQL.

apt install python3-venv
python3 -m venv .venv
source .venv/bin/activate
pip install sqlite3-to-mysql

Now we can convert the data, but stop the CrowdSec first, so we don’t get inconsistent data.

systemctl stop crowdsec

Now convert the data:

sqlite3mysql -f `cscli config show --key "Config.DbConfig.DbPath"` -h mysqlhost -d crowdsec  -u crowdsec -p

The data conversion runs without errors, and we get the data into MariaDB.

Now we can update the /etc/crowdsec/config.yaml to use MariaDB (See the documentation):

db_config:
  log_level: info
  # type: sqlite
  # db_path: /var/lib/crowdsec/data/crowdsec.db
  type: mysql
  db_name: crowdsec
  user: crowdsec
  password: <password>
  host: mysqlhost
  port: 3306

But when we try to start CrowdSec, it fails with a fatal error:

level=fatal msg="unable to create database client: failed creating schema resources: sql/schema: modify \"machines\" table: Error 1833: Cannot
 change column 'id': used in a foreign key constraint 'alerts_FK_0_0' of table 'crowdsec.alerts'"

There seems to be a difference in CrowdSec database schemas between SQLite and MySQL/MariaDB. Let’s try to work around that.

First, recreate the database to clear all the imported data.

DROP DATABASE crowdsec;
CREATE DATABASE crowdsec;

Start CrowdSec with systemctl start crowdsec. It is going to fail, but it will create the correct database schema. (Backup your local_api_credentials.yaml - it overwrote it with new credentials, but I can’t repeat this in a clean environment.)

Now we can try to convert the data again, but without creating the schema (option -K) this time (it also uses a hardcoded path to the SQLite database):

sqlite3mysql -K -f /var/lib/crowdsec/data/crowdsec.db -h mysqlhost -d crowdsec  -u crowdsec -p

Now we can start CrowdSec with systemctl start crowdsec, and it works without any issues.

Falsehoods People (and LLMs) Believe About Linux Swap and OOM.

May 15th, 2025

Swap is useless

False, even if a system has a lot of memory, swap allows for better memory utilisation by swapping out allocated but rarely used anonymous memory pages.

Swap is going to slow down your system by its mere presence

False, as long as the system has enough memory, there would be very little or no swap-related I/O, so there is no slowdown.

It is really bad if you have some memory swapped out ¹

False, the kernel swapped out some unused pages, and the memory can be allocated for something more useful, like file cache.

swap is going to wear out your SSD ²

False, as long as there is no swap-related I/O, there is no wearing out of the SSD. And modern SSDs have enough resources to handle swap-related I/O anyway.

Swap is an emergency solution for out-of-memory conditions

False, once your working set exceeds actual physical memory, swap makes things worse, causing swap thrashing and a slower OOM trigger.

Swap allows running workloads that exceed the system’s physical memory

False, once your active working set exceeds actual physical memory, you are for some swap thrashing.

Swap size should be double the amount of the physical memory.³

False. Unless the system has megabytes of memory instead of gigabytes. If you allocate more than a few GB of swap size, you are going for a long swap thrashing session when you run out of memory and before OOM gets triggered.

Swap use begins based on the vm.swappiness threshold, e.g. when 40% of RAM remains for `vm.swappiness=40`. ⁴

False. Before the introduction of the split-LRU design in kernel version 2.6.28 in 2008, there used to be a different algorithm that used the percentage of allocated memory, but it was more complicated and with the vm.swappiness=40, it wouldn’t start swapping even if all memory was allocated from processes and with the default vm.swappiness=60 it would start swapping at 80% memory allocation. This algorithm is no longer in use.

Swap aggressiveness is configured using vm.swappiness and it is linear between and 100 ⁵

False. vm.swappiness was first described in the kernel documentation in 2009 with the following text:

This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase aggressiveness, lower values decrease the amount of swap. A value of 0 instructs the kernel not to initiate swap until the amount of free and file-backed pages is less than the high water mark in a zone.

It doesn’t say that the relation between vm.swappiness and aggressiveness is linear but people made assumptions.

This description is still present in some texts on kernel.org (this file isn’t present in the kernel tree anymore, and it wasn’t updated since 2019).

The documentation was updated in 2020 to a more appropriate description and the values up to 200 were allowed.

With vm.swappiness=0 kernel won’t swap

False, if the kernel hits the high water mark in any zone, then it is going to swap anyway.

With vm.swappiness=100 kernel is going to swap out everything from memory right away

False, if there is no memory pressure, the kernel isn’t going to swap anything.

vm.swappiness=60 is too agressive ⁶

False, the vm.swappiness value 60 means that anon_prio is assigned the value of 60 and file_prio the value of 200 - 60 = 140. The resulting ratio 140/60 means that the kernel would evict 2.33 times more pages from the page cache than swap out anonymous pages.

The default value of 60 was chosen with the assumption that the file I/O operations, which tend to be sequential, are more effective than random swap I/O, but this applies to rotating media like HDDs only. For SSDs, vm.swappiness=100 is more appropriate.

As the documentation states:

For in-memory swap, like zram or zswap, as well as hybrid setups that have swap on faster devices than the filesystem, values beyond 100 can be considered

vm.swappiness=10 is just the right setting and makes your system fast

This value gives a ratio of 19 times preference for discarding page cache over swapping out. Your system is going to have a lot of unused anon pages sitting around while churning through file cache pages, making it less effective.

Swap won’t happen if there is some free RAM.

False. If a process runs within a cgroup with defined memory limits, it can be swapped out, even though the system still has a lot of free memory. Swap and OOM can also be triggered due to memory fragmentation when high-order allocation fails, even though there are a lot of free low-order pages.

Swap happens just randomly, when kernel has nothing to do

False. Swap happens when memory allocation brings the number of free memory pages below the low watermark specified for a memory zone. See /proc/zoneinfo and this question on Unix.StackExchange.

Swapping over NFS is a good idea. ⁷

False. It is very slow, and any packet lost/delayed on the network would cause the system to hang.

OOM won’t trigger if there is swap enabled. ⁸

False. OOM is triggered regardless of swap being enabled or disabled, full or empty.

OOM won’t trigger if there is some free RAM.

False. Swap and OOM can be triggered due to memory fragmentation when high-order allocation fails, even though there are a lot of free low-order pages. ⁹

OOM kills a random process.

The current Linux kernel just kills a process with the largest RSS+swap usage (with per-process OOM score adjustable through /proc). In v5.1 (2019) it dropped the heuristic to prefer to sacrifice a child instead of the parent, in v4.17 (2018) CAP_SYS_ADMIN processes lost their 3% bonus. Before v2.6.36 (2010) it used to be much more complicated and involved factors like forking, process runtime, nice values but at least this is described in the current man 5 proc. But enabling vm.oom_kill_allocating_task sysctl can cause killing a random process because the random process can be the last one trying to allocate memory and failing.

Linux Inactive Memory

May 13th, 2025

Since the introduction of the split-LRU design in version 2.6.28, the kernel maintains 4 (actually 5) LRU (least recently used) lists for memory pages:

Active anon pages
Active file pages
Inactive anon pages
Inactive file pages
(unevictabled pages)

The total sizes of these pages are reported in vmstat and /proc/meminfo

$ vmstat -s | head -5
    131669088 K total memory
     69020576 K used memory
     56007124 K active memory
     29691936 K inactive memory
      6140772 K free memory

$ cat /proc/meminfo | grep -i active
Active:         56007676 kB
Inactive:       29691936 kB
Active(anon):   34381776 kB
Inactive(anon): 16134108 kB
Active(file):   21625900 kB
Inactive(file): 13557828 kB

These lists are maintained for the purpose of the page reclaim. When the kernel needs to allocate some memory pages and there are not enough free pages, the kernel goes through the LRU lists, trying to find pages that can be reclaimed. The pages are organised in LRU lists so the kernel can reclaim these pages that weren’t used recently and are supposedly least likely to be used soon.

The kernel goes through both inactive anon and inactive file LRU lists, starting from their tails while maintaining the proportion specified by vm.swappiness between the reclaimed anon and file pages. With the default value vm.swappiness=60, when the kernel needs 200 pages, it is going to reclaim 60 pages from inactive anon LRU and 140 pages from inactive file LRU, but if there are not enough pages available, it won’t keep the balance and is going to reclaim whatever it finds.

When a file-backed page is reclaimed, it can be discarded right away if it isn’t modified, or it needs to be written back into the file. An anon page always needs to be written into a swap when reclaimed, so it is more expensive to reclaim anon pages.

So far, so good - we established the purpose of the separation of the memory into active/inactive, so the kernel knows which pages to reclaim.

Now the main question: how exactly does the kernel decide which page is active and which is not?

The problem is that the kernel can’t really track all memory access, and the hardware provides only one single bit of information - if the page was ever accessed. It doesn’t know when it was accessed and how many times. The kernel works around this by periodically scanning the pages and clearing the ‘Accessed’ bit and looking at this bit again on the next scan. But it doesn’t scan all the pages all the time. Only when the kernel needs some memory pages, it scans the tail of the inactive LRU, and it stops the scan when it finds enough pages. So, most of the pages don’t get activity information updated.

To illustrate this: a newly allocated page starts its life on the head of the inactive LRU with the ‘Accessed’ bit reset to 0. As the other new pages are allocated, our page is getting pushed back to the tail of the inactive LRU. If the page gets accessed by the user space program, the CPU sets the ‘Accessed’ bit to 1. The page can be accessed many times, but it is still considered inactive until the reclaim scan reaches this page position in the inactive LRU from the tail. Pages get promoted to the active LRU on the second (some are on the first) scan if they have the ‘Accessed’ bit set. If a page doesn’t get promoted to the active LRU, it gets reclaimed if it is at the tail end of the inactive LRU and the kernel needs some free pages. Once a page gets to the active LRU, it is considered active even if it isn’t accessed there at all. Until the active list grows too large or the inactive list gets too small by the reclaim, there is no change to the active list and pages in the active list are not scanned at all and more active pages are never moved to the head within the active LRU.

The kernel maintains a specific ratio between the active and inactive LRU, depending on the memory size (since v4.10):

 * total     target    max
 * memory    ratio     inactive
 * -------------------------------------
 *   10MB       1         5MB
 *  100MB       1        50MB
 *    1GB       3       250MB
 *   10GB      10       0.9GB
 *  100GB      31         3GB
 *    1TB     101        10GB
 *   10TB     320        32GB

When the kernel needs to grow the inactive LRU, it moves pages from the tail of the active LRU regardless of their activity status, except for file-backed executable pages, which get promoted back to the head of the active LRU.

Even though the design of active/inactive LRUs is to reference active and inactive pages, the kernel intentionally spends minimal effort to maintain these lists (for performance reasons), so the information about the pages in these lists is outdated most of the time for most of the pages.

What can we actually tell about the pages in the LRUs?

The page at the head of the active list was accessed just recently. It was accessed once or twice, or a million times.
The page at the tail of the active list was accessed some time ago when it was added to the head of the list. It could have been accessed a million times or zero times while being in the active list.
The page at the head of the inactive list was just allocated or it was just pushed out of the active list where it spent unknown time and has been accessed zero or million times.
The page in the middle of the inactive list was allocated some time ago, and we don’t know if it has been accessed yet. Until it is scanned, it can be accessed zero times or a million times.
The pages at the tail of the inactive list are only pages that have accurate information, but only when they are scanned. We also still don’t know if they were accessed more than once. We also don’t know if they are going to be accessed again, but at least this is expected.

So the more appropriate name for the ‘Inactive’ LRU would be ‘candidates for the reclaim’, and for ‘active’ - ‘not considered for the reclaim yet’. The numbers derived from the lengths of these LRUs and reported as Active/Inactive memory in /proc/meminfo have little relation to the actual working set size.

AI Fails With DevOps Tasks

Mar 1st, 2025

Did a quick evaluation and comparison between ChatGPT and Claude on my typical task to decide if the Claude subscription discount offer is worth it. All models failed miserably on a simple and straightforward task of creating a single Terraform resource. The caveat is that this resource was implemented relatively recently by AWS (in 2021) and in the Terraform AWS provider (10 months ago).

The initial question is simple: “How to add rDNS for AWS EIP with Terraform”. All models answered that Terraform doesn’t support it natively and offered a workaround with “local-exec” and call to “aws ec2 modify-address-attribute”. Claude gave the correct parameter “—domain-name”, both 4o and o3-mini-high hallucinated parameter names “—reverse-dns-name” and “—reverse-dns”.

Given a correction that Terraform does support this natively, the models started hallucinating by inventing or repurposing “aws_eip” resource attributes. 4o suggested using the “domain” attribute, which is not related to DNS. o3-mini-high invented “reverse_dns” block for “aws_eip” resource. Claude suggested assigning the “reverse_dns” attribute, which doesn’t exist. Interestingly, with web search enabled, 4o was able to find the correct “aws_eip_domain_name” resource. Both Claude and o3 went back to suggesting using “local-exec” and inventing random resource names like “aws_ec2_address”, “aws_ec2_address_attribute”, “aws_eip_reverse_dns”.

I have noticed that o3 is much more stubborn, and if it goes a wrong way, it is almost impossible to correct - a few weeks ago, it tried to correct me that MySQL 9 doesn’t exist. Not sure if the new Claude works the same way, but at least it is much more cheerful. Sill gave that subscription option a pass as there is no improvement, and these tasks are still too challenging for AIs.

One Infrastructure Migration Equals Two Disaster Recoveries

Feb 8th, 2025

To paraphrase a proverb, “One infrastructure migration equals two disaster recoveries” ¹. Well, I had to move not two but 25 services. Some are dockerised, some are running in legacy VMs managed with Chef, and one very old VM that used to be a dedicated server more than a decade ago, with a jumble of legacy half-abandoned websites and mail services.

While all the services are internal and not customer-facing, there are some critical for the team and infrastructure. To avoid much service disruption and downtime, I decided to move services one at a time so I could deploy and test the service on the new infrastructure, shut down the old one, re-sync the data and switch the DNS (almost forgot to lower DNS TTL before the move) with the option to roll-back to the old service if anything goes wrong. Beforehand, I created a table in Notion with all service resource requirements, SLAs, inter-dependencies, priorities and nice color labels. Had this table open in front of me for almost two months.

The initial infrastructure preparation was a clean greenfield - made it all modern, fresh and shiny. While I had time, I started with the most complex and important services, refreshed them, threw away a lot of organic-growth stuff, and re-wrote some Chef cookbooks to Ansible. When the deadline started approaching, it became a kind of a slog - move the service, search all the configuration spread for IPs to update², get the service working, repeat, preferably early in the morning. The team also dropped on me two other projects(including a major OS upgrade) with the same deadline to keep the pressure. Had a lot of overtime and even broke my anti-burnout rule of not working full-time on weekends once. Ironically, I managed to follow Parkinson’s law and finished by retiring the old infrastructure exactly at the deadline, the last day of the month.

But there are still a lot of small broken parts to fix and old dust and junk to clear. Still sorting out and prioritising my ToDo list as it added more than a hundred items during the migration. Two major things to re-work - remove as much as possible from the Chef attributes and re-organize packet filter rules management.

Also, had some interesting and surprising moments:

The first one was when the hosting provider moved an IP from one server to another, and I started seeing incoming traffic for the IP on both the old and the new servers simultaneously. I was monitoring the traffic with tcpdump on both servers to catch the moment when the IP switch was about to happen. At first, I suspected their network engineer somehow managed to mirror the traffic and even sent them a message, as I was clearly getting ‘host unreachable’ from the old server public IP, but I didn’t stop investigating. The IP wasn’t responding even locally, though I double-checked that it was assigned to a VM and that a route was configured properly. When I got a ‘host unreachable’ message from an internal IP from the old server, it all became clear. There was an IPSec tunnel between the old and the new servers, and the IP was still routed to the old server through the tunnel. The confusing part was that the packets coming in through the IPSec tunnel on the old server were seen as coming through normal ethernet. The irony is that I had already prepared IPSec reconfiguration, and it was the next step in my checklist.

The second was a chicken-and-egg problem. To move the Chef server, I needed to reconfigure load balancers, but the load balancers’ configuration was managed with Chef. Ended up changing the configuration manually.

The third was a little adventure with a new secondary server at the old hosting provider. They delayed its delivery for a month. When I got access to its console, it wasn’t even re-imaged with a fresh Debian; it was a used node from the provider’s cloud solution. Their netboot image was running Alpine version from 2019, and their netboot installer didn’t have anything newer than Debian Buster. Their netboot didn’t support UEFI boot with GPT, and I found it the hard way. It also explained why they failed to re-image the server. Re-partitioned the disks with MBR, installed Debian 10, and followed with two immediate upgrades to Bullseye and Bookworm. Almost entire day wasted doing someone else work. And then I had to wait 24h for them to re-route an IP to this server. Oh, and the disks had 50K power-on hours on a supposedly new server. My expectations were adjusted lower with every interaction with these guys. But in the end, most services were moved from this provider and we got twice the capacity for a lower price.

Notes:

I discovered that in the English proverb, two or three moves equal one house fire, but in the Russian version, the proportion is opposite - one move equals two house fires.↩
There is an anti-pattern with Chef in that it accumulates a lot of information in attributes, which are spread along different places - node/role/environments, and they are not versioned and hard to search globally. Luckily, I have a daily dump of all roles/nodes/environments/data bags as JSON files, so I have an option to grep through them.↩

Why Is It Always Just a Single SQL Statement Causing a Major Performance Regression?

Oct 19th, 2024

A few weeks ago, I had to investigate a batch job that was taking more than 3 hours to run every night. The DB was the obvious bottleneck, as the job was hitting it so hard that I noticed excessive load even before the business started complaining. Before looking into the details, I assumed that the application code was doing some loops internally, retrieving mostly the same data again and again. I started preparing myself to dive into the code for a week to untangle data flows, imagining the horrors of multi-page SQL statements.

But “Premature optimisation is the root of all evil”, so first things first: enable detailed monitoring and collect query statistics for a day. The next morning, there is some data: the code is hitting mostly a single table with a simple query but with WHERE on a column without an index. Add an index, maybe it will help improve the performance before I have to dive into the code.

The next morning, I check the DB graphs, and there is no load at all. Did the job run? Did someone disable it? Did I change something, causing the job to crash?

Checking the application logs. The job did run. And completed successfully. In 10 minutes. One single index reduced run time from more than 3 hours to 10 minutes. The classic “one hit with the hammer but you need to know where to hit”.

On the other hand, I’m trying to imagine what someone who “doesn’t know where to hit” and has no visibility into the database and application performance would do to solve this issue. Just crank up the instance size? I suspect they would end up paying ten times more.

Building Vagrant-based Development Environment

Jul 2nd, 2014

Over the course of the last few months I have built three different custom Vagrant boxes to create local development environment for two different applications — one is WordPress based and another is Rails one with a few PHP parts.

The problem which Vagrant solved was that both applications are too complex to setup manually. Even when working with WordPress developers didn’t work locally but instead used to edit files directly on live server and even when we imported all code into git they started using integration server for day-to-day development and their workflow looked terrible — change a line, commit, push, wait for deploy script to run, check integration server for results, repeat. Moreover, as a result of this workflow git history looked ugly — myriad of one-line commits with no commit messages, which are painful to merge. For Rails app we needed some CSS/HTML tweaks and there is just no way average front-end developer can setup Rails development environment on Windows.

At first I thought about distributing binary Vagrant box but I still needed to distribute application source code as git repository and Vagrantfile to configure sharing and I was too lazy to setup password protected directory on web server to download binary box and hand out all credentials to individual developers (Vagrant Cloud for organizations hadn’t been available yet). So I decided to make one single git repo with Vagrant configuration and cookbooks and source code repo included as submodule.

It was a while since I did chef cookbook development so I googled a lot at first trying to find what is the best current approach and which tools to use. Cookbook development completely changed over last year or two — there are now test-kitchen, berkshelf, serverspec etc and all these tools are changing very fast — almost any tutorial older than a few months is obsolete.

So far I have found following blog posts as the most current:

Getting Started Writing Chef Cookbooks the Berkshelf Way (ignore parts 1 and 2)
Automating Cookbook Testing with Test-Kitchen, Berkshelf, Vagrant and Guard

In my setup I have followed the second one and cross-checked with the first article. I chose to include in my toolbox test-kitchen, berkshelf, serverspec, chefspec, foodcritic, rubocop and wrap everything with guard (but later disabled test-kitchen run from guard as it was failing). In the beginning I started preparing custom Vagrant base box with veewee but dropped it as I didn’t really need anything custom and standard chef/debian box from vagrantcloud.com worked well.

My main repo has very simple structure — Gemfile with berkshelf, Berksfile with all necessary cookbooks, Vagrantfile and INSTALL file with step-by-step instructions for developers. In www sub-directory I have site source code as git submodule and in cookbooks sub-directory all depended cookbooks vendored using berks vendor cookbooks. At first I used to include my own cookbooks as git submodule too into site-cookbooks but as berks vendor retrieves them anyway I dropped this. Also I decided not to use vagrant-berkshelf plugin to maintain cookbooks as it is deprecated.

For each application I created individual cookbook and one cookbook for common configuration. Each cookbook has own git repo and follows standard layout created by berks cookbook. I have also decided to rely on community cookbooks for all dependencies like MySQL, PHP etc, even though I didn’t do much customization but this decision caused a bit of pain — I had to fork cookbooks for MySQL and monit to support Debian squeeze and had to use alternative cookbook for PHP as phpmyadmin cookbook depends on it. Each cookbook has multiple recipes: for Vagrant setup, for integration server setup and for live server setup as there is some differences between them — SSL support and while integration server runs php-fpm live server still uses mod_php.

At first I followed quite strict TDD/BDD loop — create serverspec tests, then chefspec and then write recipe but after a while dropped chefspec tests as I find writing expect(chef_run).to include_recipe('apache2') and then include_recipe 'apache2' a bit boring. Also running kitchen converge && kitchen verify is quite slow even with a lot of RAM and on SSD disk. I tried to speed up things by switching to LXC but kitchen-lxc seems to be broken and unsupported and using vagrant-lxc with test-kitchen isn’t documented very well and requires building LXC base boxes manually using outdated instructions — some links to configuration templates are 404 and after you build base boxes recent Vagrant complains about outdated box format. My attempts to use more up to date scripts to build base box failed as these scripts just segfaulted on me and I didn’t have time to fix them as manually built base boxes already working. Another issue is that my Linux Mint box had sudo configuration setting which caused vagrant-lxc to fail when used with test-kitchen and a couple weeks passed before I found time to find a solution so all cookbooks were developed slowly using VirtualBox.

But overall development went quite smoothly except for few PHP/WordPress surprises in the end — e.g. PHP with short_open_tag switched fails with syntax error pointing to the end of huge 5K LOC .php file without any hint of real error cause or WordPress shows just blank front page without any error messages in error logs if some plugin fails or missing. But real adventure was still ahead. When all cookbooks were ready and fully tested locally on Linux and Mac OS X it was time to deploy to Windows boxes where everything failed just at the very beginning — Vagrant was launching VirtualBox VMs but unable to ssh into them. Few days of remote debugging using email and I had found that even vagrant init hashicorp/precise failed to work on Windows so I got idea and tried to switch to 32-bit OS image which worked. Later I got RDP access to Window 8 box and launched VirtualBox directly which complained that VT-x is disabled (it needs to be enabled in BIOS and this feature is unavailable on Celeron processors) and it can’t launch 64-bit image. Once I switched images to 32-bit all Windows users were able to use them without much problems, except occasional cases when developers didn’t read documentation and forgot to use git clone --recursive and similar issues.

Another quite problematic issue with Windows was that it is impossible to create symbolic links on shared file system with default settings and Rails app were deployed capistrano style and relied on symbolic links heavily. I had to revamp whole recipe for Rails app and remove all symbolic links to get it working on Vagrant under Windows. Another Rails specific issue is that rvm cookbooks needs special recipe rvm::vagrant to be included before any other recipe if it is run in Vagrant VM.

← Older Blog Archives

Alex L. Demidov

DevOps/SRE consultant

Linux Page Reclaim and OOM in a Cgroup

MariaDB Gets Unexpected OOM

Disappointments With AWS Database Migration Service and Macie

CrowdSec SQLite to MySQL Migration

Falsehoods People (and LLMs) Believe About Linux Swap and OOM.

Swap is useless

Swap is going to slow down your system by its mere presence

It is really bad if you have some memory swapped out ¹

swap is going to wear out your SSD ²

Swap is an emergency solution for out-of-memory conditions

Swap allows running workloads that exceed the system’s physical memory

Swap size should be double the amount of the physical memory.³

Swap use begins based on the vm.swappiness threshold, e.g. when 40% of RAM remains for `vm.swappiness=40`. ⁴

Swap aggressiveness is configured using vm.swappiness and it is linear between and 100 ⁵

With vm.swappiness=0 kernel won’t swap

With vm.swappiness=100 kernel is going to swap out everything from memory right away

vm.swappiness=60 is too agressive ⁶

vm.swappiness=10 is just the right setting and makes your system fast

Swap won’t happen if there is some free RAM.

Swap happens just randomly, when kernel has nothing to do

Swapping over NFS is a good idea. ⁷

OOM won’t trigger if there is swap enabled. ⁸

OOM won’t trigger if there is some free RAM.

OOM kills a random process.

Linux Inactive Memory

AI Fails With DevOps Tasks

One Infrastructure Migration Equals Two Disaster Recoveries

Why Is It Always Just a Single SQL Statement Causing a Major Performance Regression?

Building Vagrant-based Development Environment

Swap is useless

Swap is going to slow down your system by its mere presence

It is really bad if you have some memory swapped out 1

swap is going to wear out your SSD 2

Swap is an emergency solution for out-of-memory conditions

Swap allows running workloads that exceed the system’s physical memory

Swap size should be double the amount of the physical memory.3

Swap use begins based on the vm.swappiness threshold, e.g. when 40% of RAM remains for vm.swappiness=40. 4

Swap aggressiveness is configured using vm.swappiness and it is linear between and 100 5

With vm.swappiness=0 kernel won’t swap

With vm.swappiness=100 kernel is going to swap out everything from memory right away

vm.swappiness=60 is too agressive 6

vm.swappiness=10 is just the right setting and makes your system fast

Swap won’t happen if there is some free RAM.

Swap happens just randomly, when kernel has nothing to do

Swapping over NFS is a good idea. 7

OOM won’t trigger if there is swap enabled. 8

OOM won’t trigger if there is some free RAM.

OOM kills a random process.

It is really bad if you have some memory swapped out ¹

swap is going to wear out your SSD ²

Swap size should be double the amount of the physical memory.³

Swap use begins based on the vm.swappiness threshold, e.g. when 40% of RAM remains for `vm.swappiness=40`. ⁴

Swap aggressiveness is configured using vm.swappiness and it is linear between and 100 ⁵

vm.swappiness=60 is too agressive ⁶

Swapping over NFS is a good idea. ⁷

OOM won’t trigger if there is swap enabled. ⁸