Alex L. Demidov

DevOps/SRE consultant

Falsehoods People (and LLMs) Believe About Linux Swap and OOM.

Swap is useless

False, even if a system has a lot of memory, swap allows for better memory utilisation by swapping out allocated but rarely used anonymous memory pages.

Swap is going to slow down your system by its mere presence

False, as long as the system has enough memory, there would be very little or no swap-related I/O, so there is no slowdown.

It is really bad if you have some memory swapped out 1

False, the kernel swapped out some unused pages, and the memory can be allocated for something more useful, like file cache.

swap is going to wear out your SSD 2

False, as long as there is no swap-related I/O, there is no wearing out of the SSD. And modern SSDs have enough resources to handle swap-related I/O anyway.

Swap is an emergency solution for out-of-memory conditions

False, once your working set exceeds actual physical memory, swap makes things worse, causing swap thrashing and a slower OOM trigger.

Swap allows running workloads that exceed the system’s physical memory

False, once your active working set exceeds actual physical memory, you are for some swap thrashing.

Swap size should be double the amount of the physical memory.3

False. Unless the system has megabytes of memory instead of gigabytes. If you allocate more than a few GB of swap size, you are going for a long swap thrashing session when you run out of memory and before OOM gets triggered.

Swap use begins based on the vm.swappiness threshold, e.g. when 40% of RAM remains for vm.swappiness=40. 4

False. Before the introduction of the split-LRU design in kernel version 2.6.28 in 2008, there used to be a different algorithm that used the percentage of allocated memory, but it was more complicated and with the vm.swappiness=40, it wouldn’t start swapping even if all memory was allocated from processes and with the default vm.swappiness=60 it would start swapping at 80% memory allocation. This algorithm is no longer in use.

Swap aggressiveness is configured using vm.swappiness and it is linear between and 100 5

False. vm.swappiness was first described in the kernel documentation in 2009 with the following text:

This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase aggressiveness, lower values decrease the amount of swap. A value of 0 instructs the kernel not to initiate swap until the amount of free and file-backed pages is less than the high water mark in a zone.

It doesn’t say that the relation between vm.swappiness and aggressiveness is linear but people made assumptions.

This description is still present in some texts on kernel.org (this file isn’t present in the kernel tree anymore, and it wasn’t updated since 2019).

The documentation was updated in 2020 to a more appropriate description and the values up to 200 were allowed.

With vm.swappiness=0 kernel won’t swap

False, if the kernel hits the high water mark in any zone, then it is going to swap anyway.

With vm.swappiness=100 kernel is going to swap out everything from memory right away

False, if there is no memory pressure, the kernel isn’t going to swap anything.

vm.swappiness=60 is too agressive 6

False, the vm.swappiness value 60 means that anon_prio is assigned the value of 60 and file_prio the value of 200 - 60 = 140. The resulting ratio 140/60 means that the kernel would evict 2.33 times more pages from the page cache than swap out anonymous pages.

The default value of 60 was chosen with the assumption that the file I/O operations, which tend to be sequential, are more effective than random swap I/O, but this applies to rotating media like HDDs only. For SSDs, vm.swappiness=100 is more appropriate.

As the documentation states:

For in-memory swap, like zram or zswap, as well as hybrid setups that have swap on faster devices than the filesystem, values beyond 100 can be considered

vm.swappiness=10 is just the right setting and makes your system fast

This value gives a ratio of 19 times preference for discarding page cache over swapping out. Your system is going to have a lot of unused anon pages sitting around while churning through file cache pages, making it less effective.

Swap won’t happen if there is some free RAM.

False. If a process runs within a cgroup with defined memory limits, it can be swapped out, even though the system still has a lot of free memory. Swap and OOM can also be triggered due to memory fragmentation when high-order allocation fails, even though there are a lot of free low-order pages.

Swap happens just randomly, when kernel has nothing to do

False. Swap happens when memory allocation brings the number of free memory pages below the low watermark specified for a memory zone. See /proc/zoneinfo and this question on Unix.StackExchange.

Swapping over NFS is a good idea. 7

False. It is very slow, and any packet lost/delayed on the network would cause the system to hang.

OOM won’t trigger if there is swap enabled. 8

False. OOM is triggered regardless of swap being enabled or disabled, full or empty.

OOM won’t trigger if there is some free RAM.

False. Swap and OOM can be triggered due to memory fragmentation when high-order allocation fails, even though there are a lot of free low-order pages. 9

OOM kills a random process.

The current Linux kernel just kills a process with the largest RSS+swap usage (with per-process OOM score adjustable through /proc). In v5.1 (2019) it dropped the heuristic to prefer to sacrifice a child instead of the parent, in v4.17 (2018) CAP_SYS_ADMIN processes lost their 3% bonus. Before v2.6.36 (2010) it used to be much more complicated and involved factors like forking, process runtime, nice values but at least this is described in the current man 5 proc. But enabling vm.oom_kill_allocating_task sysctl can cause killing a random process because the random process can be the last one trying to allocate memory and failing.

Linux Inactive Memory

Since the introduction of the split-LRU design in version 2.6.28, the kernel maintains 4 (actually 5) LRU (least recently used) lists for memory pages:

  • Active anon pages
  • Active file pages
  • Inactive anon pages
  • Inactive file pages
  • (unevictabled pages)

The total sizes of these pages are reported in vmstat and /proc/meminfo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ vmstat -s | head -5
    131669088 K total memory
     69020576 K used memory
     56007124 K active memory
     29691936 K inactive memory
      6140772 K free memory

$ cat /proc/meminfo | grep -i active
Active:         56007676 kB
Inactive:       29691936 kB
Active(anon):   34381776 kB
Inactive(anon): 16134108 kB
Active(file):   21625900 kB
Inactive(file): 13557828 kB

These lists are maintained for the purpose of the page reclaim. When the kernel needs to allocate some memory pages and there are not enough free pages, the kernel goes through the LRU lists, trying to find pages that can be reclaimed. The pages are organised in LRU lists so the kernel can reclaim these pages that weren’t used recently and are supposedly least likely to be used soon.

The kernel goes through both inactive anon and inactive file LRU lists, starting from their tails while maintaining the proportion specified by vm.swappiness between the reclaimed anon and file pages. With the default value vm.swappiness=60, when the kernel needs 200 pages, it is going to reclaim 60 pages from inactive anon LRU and 140 pages from inactive file LRU, but if there are not enough pages available, it won’t keep the balance and is going to reclaim whatever it finds.

When a file-backed page is reclaimed, it can be discarded right away if it isn’t modified, or it needs to be written back into the file. An anon page always needs to be written into a swap when reclaimed, so it is more expensive to reclaim anon pages.

So far, so good - we established the purpose of the separation of the memory into active/inactive, so the kernel knows which pages to reclaim.

Now the main question: how exactly does the kernel decide which page is active and which is not?

The problem is that the kernel can’t really track all memory access, and the hardware provides only one single bit of information - if the page was ever accessed. It doesn’t know when it was accessed and how many times. The kernel works around this by periodically scanning the pages and clearing the ‘Accessed’ bit and looking at this bit again on the next scan. But it doesn’t scan all the pages all the time. Only when the kernel needs some memory pages, it scans the tail of the inactive LRU, and it stops the scan when it finds enough pages. So, most of the pages don’t get activity information updated.

To illustrate this: a newly allocated page starts its life on the head of the inactive LRU with the ‘Accessed’ bit reset to 0. As the other new pages are allocated, our page is getting pushed back to the tail of the inactive LRU. If the page gets accessed by the user space program, the CPU sets the ‘Accessed’ bit to 1. The page can be accessed many times, but it is still considered inactive until the reclaim scan reaches this page position in the inactive LRU from the tail. Pages get promoted to the active LRU on the second (some are on the first) scan if they have the ‘Accessed’ bit set. If a page doesn’t get promoted to the active LRU, it gets reclaimed if it is at the tail end of the inactive LRU and the kernel needs some free pages. Once a page gets to the active LRU, it is considered active even if it isn’t accessed there at all. Until the active list grows too large or the inactive list gets too small by the reclaim, there is no change to the active list and pages in the active list are not scanned at all and more active pages are never moved to the head within the active LRU.

The kernel maintains a specific ratio between the active and inactive LRU, depending on the memory size (since v4.10):

1
2
3
4
5
6
7
8
9
10
 * total     target    max
 * memory    ratio     inactive
 * -------------------------------------
 *   10MB       1         5MB
 *  100MB       1        50MB
 *    1GB       3       250MB
 *   10GB      10       0.9GB
 *  100GB      31         3GB
 *    1TB     101        10GB
 *   10TB     320        32GB

When the kernel needs to grow the inactive LRU, it moves pages from the tail of the active LRU regardless of their activity status, except for file-backed executable pages, which get promoted back to the head of the active LRU.

Even though the design of active/inactive LRUs is to reference active and inactive pages, the kernel intentionally spends minimal effort to maintain these lists (for performance reasons), so the information about the pages in these lists is outdated most of the time for most of the pages.

What can we actually tell about the pages in the LRUs?

  • The page at the head of the active list was accessed just recently. It was accessed once or twice, or a million times.
  • The page at the tail of the active list was accessed some time ago when it was added to the head of the list. It could have been accessed a million times or zero times while being in the active list.
  • The page at the head of the inactive list was just allocated or it was just pushed out of the active list where it spent unknown time and has been accessed zero or million times.
  • The page in the middle of the inactive list was allocated some time ago, and we don’t know if it has been accessed yet. Until it is scanned, it can be accessed zero times or a million times.
  • The pages at the tail of the inactive list are only pages that have accurate information, but only when they are scanned. We also still don’t know if they were accessed more than once. We also don’t know if they are going to be accessed again, but at least this is expected.

So the more appropriate name for the ‘Inactive’ LRU would be ‘candidates for the reclaim’, and for ‘active’ - ‘not considered for the reclaim yet’. The numbers derived from the lengths of these LRUs and reported as Active/Inactive memory in /proc/meminfo have little relation to the actual working set size.

AI Fails With DevOps Tasks

Did a quick evaluation and comparison between ChatGPT and Claude on my typical task to decide if the Claude subscription discount offer is worth it. All models failed miserably on a simple and straightforward task of creating a single Terraform resource. The caveat is that this resource was implemented relatively recently by AWS (in 2021) and in the Terraform AWS provider (10 months ago).

The initial question is simple: “How to add rDNS for AWS EIP with Terraform”. All models answered that Terraform doesn’t support it natively and offered a workaround with “local-exec” and call to “aws ec2 modify-address-attribute”. Claude gave the correct parameter “—domain-name”, both 4o and o3-mini-high hallucinated parameter names “—reverse-dns-name” and “—reverse-dns”.

Given a correction that Terraform does support this natively, the models started hallucinating by inventing or repurposing “aws_eip” resource attributes. 4o suggested using the “domain” attribute, which is not related to DNS. o3-mini-high invented “reverse_dns” block for “aws_eip” resource. Claude suggested assigning the “reverse_dns” attribute, which doesn’t exist. Interestingly, with web search enabled, 4o was able to find the correct “aws_eip_domain_name” resource. Both Claude and o3 went back to suggesting using “local-exec” and inventing random resource names like “aws_ec2_address”, “aws_ec2_address_attribute”, “aws_eip_reverse_dns”.

I have noticed that o3 is much more stubborn, and if it goes a wrong way, it is almost impossible to correct - a few weeks ago, it tried to correct me that MySQL 9 doesn’t exist. Not sure if the new Claude works the same way, but at least it is much more cheerful. Sill gave that subscription option a pass as there is no improvement, and these tasks are still too challenging for AIs.

One Infrastructure Migration Equals Two Disaster Recoveries

To paraphrase a proverb, “One infrastructure migration equals two disaster recoveries” 1. Well, I had to move not two but 25 services. Some are dockerised, some are running in legacy VMs managed with Chef, and one very old VM that used to be a dedicated server more than a decade ago, with a jumble of legacy half-abandoned websites and mail services.

While all the services are internal and not customer-facing, there are some critical for the team and infrastructure. To avoid much service disruption and downtime, I decided to move services one at a time so I could deploy and test the service on the new infrastructure, shut down the old one, re-sync the data and switch the DNS (almost forgot to lower DNS TTL before the move) with the option to roll-back to the old service if anything goes wrong. Beforehand, I created a table in Notion with all service resource requirements, SLAs, inter-dependencies, priorities and nice color labels. Had this table open in front of me for almost two months.

The initial infrastructure preparation was a clean greenfield - made it all modern, fresh and shiny. While I had time, I started with the most complex and important services, refreshed them, threw away a lot of organic-growth stuff, and re-wrote some Chef cookbooks to Ansible. When the deadline started approaching, it became a kind of a slog - move the service, search all the configuration spread for IPs to update2, get the service working, repeat, preferably early in the morning. The team also dropped on me two other projects(including a major OS upgrade) with the same deadline to keep the pressure. Had a lot of overtime and even broke my anti-burnout rule of not working full-time on weekends once. Ironically, I managed to follow Parkinson’s law and finished by retiring the old infrastructure exactly at the deadline, the last day of the month.

But there are still a lot of small broken parts to fix and old dust and junk to clear. Still sorting out and prioritising my ToDo list as it added more than a hundred items during the migration. Two major things to re-work - remove as much as possible from the Chef attributes and re-organize packet filter rules management.

Also, had some interesting and surprising moments:

The first one was when the hosting provider moved an IP from one server to another, and I started seeing incoming traffic for the IP on both the old and the new servers simultaneously. I was monitoring the traffic with tcpdump on both servers to catch the moment when the IP switch was about to happen. At first, I suspected their network engineer somehow managed to mirror the traffic and even sent them a message, as I was clearly getting ‘host unreachable’ from the old server public IP, but I didn’t stop investigating. The IP wasn’t responding even locally, though I double-checked that it was assigned to a VM and that a route was configured properly. When I got a ‘host unreachable’ message from an internal IP from the old server, it all became clear. There was an IPSec tunnel between the old and the new servers, and the IP was still routed to the old server through the tunnel. The confusing part was that the packets coming in through the IPSec tunnel on the old server were seen as coming through normal ethernet. The irony is that I had already prepared IPSec reconfiguration, and it was the next step in my checklist.

The second was a chicken-and-egg problem. To move the Chef server, I needed to reconfigure load balancers, but the load balancers’ configuration was managed with Chef. Ended up changing the configuration manually.

The third was a little adventure with a new secondary server at the old hosting provider. They delayed its delivery for a month. When I got access to its console, it wasn’t even re-imaged with a fresh Debian; it was a used node from the provider’s cloud solution. Their netboot image was running Alpine version from 2019, and their netboot installer didn’t have anything newer than Debian Buster. Their netboot didn’t support UEFI boot with GPT, and I found it the hard way. It also explained why they failed to re-image the server. Re-partitioned the disks with MBR, installed Debian 10, and followed with two immediate upgrades to Bullseye and Bookworm. Almost entire day wasted doing someone else work. And then I had to wait 24h for them to re-route an IP to this server. Oh, and the disks had 50K power-on hours on a supposedly new server. My expectations were adjusted lower with every interaction with these guys. But in the end, most services were moved from this provider and we got twice the capacity for a lower price.

Notes:


  1. I discovered that in the English proverb, two or three moves equal one house fire, but in the Russian version, the proportion is opposite - one move equals two house fires.
  2. There is an anti-pattern with Chef in that it accumulates a lot of information in attributes, which are spread along different places - node/role/environments, and they are not versioned and hard to search globally. Luckily, I have a daily dump of all roles/nodes/environments/data bags as JSON files, so I have an option to grep through them.

Why Is It Always Just a Single SQL Statement Causing a Major Performance Regression?

A few weeks ago, I had to investigate a batch job that was taking more than 3 hours to run every night. The DB was the obvious bottleneck, as the job was hitting it so hard that I noticed excessive load even before the business started complaining. Before looking into the details, I assumed that the application code was doing some loops internally, retrieving mostly the same data again and again. I started preparing myself to dive into the code for a week to untangle data flows, imagining the horrors of multi-page SQL statements.

But “Premature optimisation is the root of all evil”, so first things first: enable detailed monitoring and collect query statistics for a day. The next morning, there is some data: the code is hitting mostly a single table with a simple query but with WHERE on a column without an index. Add an index, maybe it will help improve the performance before I have to dive into the code.

The next morning, I check the DB graphs, and there is no load at all. Did the job run? Did someone disable it? Did I change something, causing the job to crash?

Checking the application logs. The job did run. And completed successfully. In 10 minutes. One single index reduced run time from more than 3 hours to 10 minutes. The classic “one hit with the hammer but you need to know where to hit”.

On the other hand, I’m trying to imagine what someone who “doesn’t know where to hit” and has no visibility into the database and application performance would do to solve this issue. Just crank up the instance size? I suspect they would end up paying ten times more.

Building Vagrant-based Development Environment

Over the course of the last few months I have built three different custom Vagrant boxes to create local development environment for two different applications — one is WordPress based and another is Rails one with a few PHP parts.

The problem which Vagrant solved was that both applications are too complex to setup manually. Even when working with WordPress developers didn’t work locally but instead used to edit files directly on live server and even when we imported all code into git they started using integration server for day-to-day development and their workflow looked terrible — change a line, commit, push, wait for deploy script to run, check integration server for results, repeat. Moreover, as a result of this workflow git history looked ugly — myriad of one-line commits with no commit messages, which are painful to merge. For Rails app we needed some CSS/HTML tweaks and there is just no way average front-end developer can setup Rails development environment on Windows.

At first I thought about distributing binary Vagrant box but I still needed to distribute application source code as git repository and Vagrantfile to configure sharing and I was too lazy to setup password protected directory on web server to download binary box and hand out all credentials to individual developers (Vagrant Cloud for organizations hadn’t been available yet). So I decided to make one single git repo with Vagrant configuration and cookbooks and source code repo included as submodule.

It was a while since I did chef cookbook development so I googled a lot at first trying to find what is the best current approach and which tools to use. Cookbook development completely changed over last year or two — there are now test-kitchen, berkshelf, serverspec etc and all these tools are changing very fast — almost any tutorial older than a few months is obsolete.

So far I have found following blog posts as the most current:

In my setup I have followed the second one and cross-checked with the first article. I chose to include in my toolbox test-kitchen, berkshelf, serverspec, chefspec, foodcritic, rubocop and wrap everything with guard (but later disabled test-kitchen run from guard as it was failing). In the beginning I started preparing custom Vagrant base box with veewee but dropped it as I didn’t really need anything custom and standard chef/debian box from vagrantcloud.com worked well.

My main repo has very simple structure — Gemfile with berkshelf, Berksfile with all necessary cookbooks, Vagrantfile and INSTALL file with step-by-step instructions for developers. In www sub-directory I have site source code as git submodule and in cookbooks sub-directory all depended cookbooks vendored using berks vendor cookbooks. At first I used to include my own cookbooks as git submodule too into site-cookbooks but as berks vendor retrieves them anyway I dropped this. Also I decided not to use vagrant-berkshelf plugin to maintain cookbooks as it is deprecated.

For each application I created individual cookbook and one cookbook for common configuration. Each cookbook has own git repo and follows standard layout created by berks cookbook. I have also decided to rely on community cookbooks for all dependencies like MySQL, PHP etc, even though I didn’t do much customization but this decision caused a bit of pain — I had to fork cookbooks for MySQL and monit to support Debian squeeze and had to use alternative cookbook for PHP as phpmyadmin cookbook depends on it. Each cookbook has multiple recipes: for Vagrant setup, for integration server setup and for live server setup as there is some differences between them — SSL support and while integration server runs php-fpm live server still uses mod_php.

At first I followed quite strict TDD/BDD loop — create serverspec tests, then chefspec and then write recipe but after a while dropped chefspec tests as I find writing expect(chef_run).to include_recipe('apache2') and then include_recipe 'apache2' a bit boring. Also running kitchen converge && kitchen verify is quite slow even with a lot of RAM and on SSD disk. I tried to speed up things by switching to LXC but kitchen-lxc seems to be broken and unsupported and using vagrant-lxc with test-kitchen isn’t documented very well and requires building LXC base boxes manually using outdated instructions — some links to configuration templates are 404 and after you build base boxes recent Vagrant complains about outdated box format. My attempts to use more up to date scripts to build base box failed as these scripts just segfaulted on me and I didn’t have time to fix them as manually built base boxes already working. Another issue is that my Linux Mint box had sudo configuration setting which caused vagrant-lxc to fail when used with test-kitchen and a couple weeks passed before I found time to find a solution so all cookbooks were developed slowly using VirtualBox.

But overall development went quite smoothly except for few PHP/WordPress surprises in the end — e.g. PHP with short_open_tag switched fails with syntax error pointing to the end of huge 5K LOC .php file without any hint of real error cause or WordPress shows just blank front page without any error messages in error logs if some plugin fails or missing. But real adventure was still ahead. When all cookbooks were ready and fully tested locally on Linux and Mac OS X it was time to deploy to Windows boxes where everything failed just at the very beginning — Vagrant was launching VirtualBox VMs but unable to ssh into them. Few days of remote debugging using email and I had found that even vagrant init hashicorp/precise failed to work on Windows so I got idea and tried to switch to 32-bit OS image which worked. Later I got RDP access to Window 8 box and launched VirtualBox directly which complained that VT-x is disabled (it needs to be enabled in BIOS and this feature is unavailable on Celeron processors) and it can’t launch 64-bit image. Once I switched images to 32-bit all Windows users were able to use them without much problems, except occasional cases when developers didn’t read documentation and forgot to use git clone --recursive and similar issues.

Another quite problematic issue with Windows was that it is impossible to create symbolic links on shared file system with default settings and Rails app were deployed capistrano style and relied on symbolic links heavily. I had to revamp whole recipe for Rails app and remove all symbolic links to get it working on Vagrant under Windows. Another Rails specific issue is that rvm cookbooks needs special recipe rvm::vagrant to be included before any other recipe if it is run in Vagrant VM.

WordPress Site Performance Optimization

Spent about a week working on optimizing performance of WordPress-based web application. While site already had some optimizations in place, like W3 Total Cache backed by APC and mod_pagespeed installed, there still were complaints that site loads very slow.

Before making any action I started by measuring actual performance and gathered metrics using New Relic and Chrome Developer Tools Audit Tab. New Relic showed a few of critical insights into performance troubles.

First one was two widgets in the footer of every page each making requests to external API services taking on average ~600ms one and ~1500 ms another. As the second service was our own custom service I quickly optimized it by adding counter cache field to table instead of making select on depended records on each request and request time went down to around ~150 ms. But these requests were still made on every page load so I patched both widgets to cache responses from external API in memcached for 5 minutes. Average page generation time went down to around ~800 ms.

Another thing to optimize was W3 Total Cache. First I turned off its minify option as it was sometimes taking up 2-3 seconds according to New Relic. Next I switched cache storage from APC to memcached as APC was constantly reset every minute by some rogue code somewhere on this server: grep -r apc_clear_cache showed about hundred of matches. This issue also affected PHP opcode caching so I decided to switch to Zend OPcache for opcode caching. For page cache I choose to switch from memcached storage to extended disk so if page was cached then Apache would serve it directly without hitting any PHP code. As database queries were taking insignificant percentage of all page generation time I switched DB cache completely off to avoid PHP overhead and instead cranked up MySQL own query cache memory limits.

With all these optimizations page generation time stabilized around ~600ms against ~2500 ms week ago and I called it a day - I don’t think I can squeeze more performance out of WordPress without going through analyzing performance impact of each plugin.

The next step in the site optimization was tuning of mod_pagespeed settings. At first it looked like it was not working at all. After checking mod_pagespeed logs I have found that it doesn’t work with SSL by default and we have https-only site. Another obstacle with which I have spent good half hour is that W3 Total Cache page cache interferes with mod_pagespeed, looks like later ignores static HTML files. After that I have enabled most of mod_pagespeed CoreFilters, focusing on CSS and JS files optimization, as we had about 30-40 assets of each kind, and was able to reduce number of external CSS/JS down to 9-10 per page. I have also tried to optimize page loading by using filter defer_javascript which moves JS files down to page footer but had to turn it off later as it broke some JS navigation menus. Overall page load speed went down from around 10 seconds to 5 seconds on average.

Octopress Revival

Resurrected my standalone blog for the third time, this time again on Octopress and still on 2.0 version. I didn’t intend to do this but there is still no good blogging platform with code highlighting support.

I set it up pretty quickly but the first problem was that I wanted to keep old content but I didn’t have source code for it anymore. Converting by hand seemed tedious so I thought about hiring someone on oDesk but then, after a few Google searches I found a tool to convert HTML back to markdown – reverse_markdown ruby gem. At first attempt it did no conversion but after stripping all HTML code around actual post content (the most important is to remove article tags around) it produced nice markdown which I put back into Octopress.

After initial import I did some cleanup – removed unnecessary /blog prefix from post permalinks, fixed links in old content pointing to my old-old MovableType blog and imported static files into Octopress. To check all links I installed link-checker ruby gem – it works pretty fine but seems to be having problems with some https:// links.

Once all content was in good shape I tweaked CSS colors back to my old palette, added Stack Exchange badge, enabled Disqus comments, updated Google Analytics JavaScript to latest universal code.

After comparing generated HTML with old blog using diff I have found a bug in Octopress: canonical link for categories pages is broken by default – has missing /, see Octopress issue #949 for fix. Once I was satisfied with content I deployed it to server using rsync.

There is currently Octopress version 3.0 in development and it is close to final release but it seems to be quite different from version 2.0 in its concept and as its author says:

For those currently using Octopress, it will be a while before the new releases can compete with the features of the old version.

Hunt for the Bug

Spent three days last week hunting for mysterious bug which caused factory_girl factories randomly fail with Trait not registered: class message during full test suite run, but when you run all controller or model tests separately – everything is fine and all tests which were failing during full run worked perfectly.

At first I ignored this issue – I had just added two new factories, which coincidentally used class parameter in their definition to specify generated class explicitly, and I needed these factories to test code I was working on and I though that I’ll fix or just remove these factories later.

But as it usually happens it wasn’t simple as I thought. Suddenly I discovered that tests started failing with the same symptoms on common develop branch and not on only topic branch. And I broke tests already in two other places, so clean up was really needed.

First two days I spent trying to find out what happening in factory_girl internals using old-school print logging and later pry-debugger but without much success except that I was able to locate single spec file in spec/workers/ which caused failure of all consecutive factory calls. Then I started looking at git history trying to find commit which introduced this issue. Luckily, in spite of heavy rebasing and few backported commits, my master branch didn’t have this issue and I was able to pinpoint this to single commit. At first glance this commit looked almost innocent – it just extracted code from model and moved it to app/workers/. But there were two tests added to failing spec file and they were the tests which caused cascading failure of all remaining tests in suite. After reviewing the code under test I had found that real culprit was memory leak debugging code I quickly slapped in without running tests:

counts = Hash.new { 0 } ObjectSpace.each_object do |o| counts[o.class] += 1 end counts.reject! { |_k, v| v < 100 }

It seems that FactoryGirl::DefinitionsProxy undefines all methods including class and method_missing in this class adds any calls as traits to factory so walking through ObjectSpace and calling class on every object wreaks havoc on factories.

Сборка русской версии Movable Type

Для сборки руссифицированной версии Movable Type нам опять понадобятся исходники из svn-репозитария. Делаем чекаут, как описано в предыдущей статье о работе с файлом перевода и прикладываем к полученным исходникам патч patch-rubuild.gz, который добавляет возможность сборки русской версии.

Следующим шагом вносим в исходники все необходимые изменения для поддержки русского языка:

  • patch-rudate41.gz — добавляет русский формат дат
  • patch-rudirify41.gz — добавляет русские символы в таблицы преобразования заголовков статей в имена файлов (в отличии от патча Алексея Тутубалина изменяется также и Javascript код, а преобразование русских символов в латиницу сделано в соответствии с ГОСТ 7.79-2000)
  • patch-nofollow.gz — добавляет поддержку тэга необходимого для российских поисковиков (автор Алексей Тутубалин)
  • patch-monday-mt41.gz — делает первым днем недели в календаре понедельник (автор Алексей Тутубалин)

После внесения изменений в код и шаблоны осталось добавить собственно файл перевода lib/MT/L10N/ru.pm ( перевод проекта movable-type.ru, либо мой, менее полный перевод), файл стилей mt-static/styles_ru.css и два HTML-файла — index.html.ru и readme.html.ru (они есть в репозитарии русской версии Movable Type на Google Code). Файл перевода необходимо обработать в соответствии с инструкциями в предыдущей статье. В завершении можно изменить некоторые настройки по умолчанию (часовой пояс, кодировки, ссылки на новости, портал и техподдержку) для собираемого пакета, отредактировав файл build/mt-dists/ru.mk.

Сделав все необходимые изменения выполняем сборку пакета MTOS:

env LANG=C ./build/exportmt.pl --local --pack=MTOS --lang=ru --prod

В результате выполнения данной команды получим два архива с русской версией Movable Type — MTOS-4.1-ru.tar.gz и MTOS-4.1-ru.zip.