This is about some fun I had while putting together a small Unix-y utility to measure RAM and swap usage - psm. I was able to decrease runtime by more than 65% during a 3-hour train ride with the help of some standard go tools.

Background

At my place of employment we’ve had memory-usage problems over the past month, following a restructuring in how we distribute long-running jobs between ec2 instances. In the course of several minutes to several hours, on each machine the memory usage of 8 celery jobs would grow first to fill all available RAM, and eventually all RAM + swap - invoking the wrath of the Linux OOM killer. This was an unfortunate series of events, particularly so because it only happened with any frequency on production.

While investigating I made frequent use of the excellent ps_mem.py by pixelbeat (source). This script is fantastic, especially since I can run it remotely with:

$cat which ps_mem | ssh$REMOTE_HOST -- sudo python


The two (minor) downsides are that it requires a custom sudo rule for non-admin users to be able to use it:

%developers ALL= NOPASSWD:/usr/bin/ps_mem.py


And it is somewhat slow, especially if the server is under load (due to memory pressure/swapping) or the number of total mappings is large.

After our immediate issues were resolved (a separate story), I spent some time over the holidays seeing if I could implement something similar in Go which would run quickly and be installable as setuid root, so that devs who don’t have root on our production boxes could quickly see who was using what RAM and swap. In addition, it would be a fun way to dig in to the proc filesystem, learn a bit more about how the kernel accounts for memory, and write some Go. Wins all around.

I thought hard about trying to speed up ps_mem.py itself, but since this was a spare time fun hack, I gave into my latent Not Invented Here syndrome so that I could write something fresh up in Go.

World, meet psm

“It’s like ps_mem.py, but easier to type”

The end result is a small program, psm, which produces output similar to ps_mem.py, but with the addition of a column for swap usage:

bpowers@python-worker-01:~$psm -filter=celery MB RAM SHARED SWAPPED PROCESS (COUNT) 60.6 1.1 134.2 [celeryd@notifications:MainProcess] (1) 62.6 1.1 [celeryd@health:MainProcess] (1) 113.7 1.2 [celeryd@uploads:MainProcess] (1) 155.1 1.1 [celeryd@triggers:MainProcess] (1) 176.7 1.2 [celeryd@updates:MainProcess] (1) 502.9 1.2 [celeryd@lookbacks:MainProcess] (1) 623.8 1.2 28.5 [celeryd@stats:MainProcess] (1) 671.3 1.2 [celeryd@default:MainProcess] (1) # 2366.7 164.7 TOTAL USED BY PROCESSES  It took me a leisurely day at a coffee shop of studying ps_mem.py, the kernel’s fs/proc/task_mmu.c, and hacking to get some useful output. The optimization was all done on an Amtrak ride from NY to DC, and the rest of my time on the project has been spent on small cleanups, documentation and packaging. The GitHub page for psm has installation instructions, but for brevity if you’re on Ubuntu 12.04 and interested in trying it out you can simply do: # if apt-add-repository isn't installed:$ sudo apt-get install python-software-properties

$sudo add-apt-repository ppa:bobbypowers/psm$ sudo apt-get update
$sudo apt-get install psm  How does it work? Each running process on a Linux machine is associated with a set of virtual-memory areas known as VMAs, and each VMA has an entry in /proc/$PID/smaps like so:

7fff6b915000-7fff6b937000 rw-p 00000000 00:00 0                  [stack]
Size:                140 kB
Rss:                  12 kB
Pss:                  12 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:        12 kB
Referenced:           12 kB
Anonymous:            12 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me gd ac


...
$go tool pprof ./psm cpuprof.1 (pprof) web Total: 81 samples Loading web page file:////tmp/pprof31371.0.svg  There are several ways to explore pprof data. One of the easiest is the graphviz-dependent web command, which pops up an SVG in your browser: Cool, so 93% of our time is spent in procMem where we total each process’s memory usage, and most of that is in turn spent in the splitSpaces utility function. Looks like a good place to start. When we start optimizing, percent of total-time isn’t comparable between profiles, but number of samples is. Digging in Each line in /proc/$PID/smaps has several columns with an arbitrary number of spaces between them. As donio from HN pointed out, bytes.Fields does what I am looking for. At the time I wrote psm, I someohow completely overlooked that and threw my own splitSpaces together. The initial implementation was pretty naive:

splitSpaces is called for each line in /proc/\$PID/smaps. How many lines is that? Currently on my laptop:

[root@fina /]# cat /proc/**/smaps | wc
406608 1370315 13109167


Okay, so 400k lines of text, totaling 12 MB. Of all the (83) times the Go pprof thread woke up to record data about psm, 47 of them were in splitSpaces or a function called from it. bytes.SplitN and bytes.TrimSpace let me quickly implement the thing, but my usage of them is too inefficient. My take-two (v0.2) was:

This is pretty mind-numbingly complicated with its multiple doubly-nested loops and conditional rewind (I had to add tests to make sure it was working correctly), but is it faster?

All right! So, yes, splitSpaces is down from 47 to 13, a 3.6× improvement. A big part of this improvement came when I switched from calling unicode.IsSpace() to checking b[i] == 0 directly. I made a mental note to come up with something more sane for splitSpaces, but for now lets move onto other low hanging fruit.

Now, 38 of procMem’s samples are in bufio.Reader.ReadLine(), and 30 of those 38 samples are in syscall.Syscall() itself. That means that 79% of the runtime of ReadLine() is simply spent waiting for smaps data from the kernel. Not much I can change in my userspace program to get around that, so lets move on. The next biggest user of CPU time was the 9 samples spent in runtime.slicebytetostring() - the routine called for code like string([]byte{'a', 'b', 'c'}).

There are 2 places in procMem() where we convert byte slices to strings. The fist is where we figure out what type of smap line we’re on:

I like how clean switch statements look, but its a little unnecessary to do ~400k malloc + memcpy’s (one for every line in smaps) just to compare some bytes. This should remove that overhead:

The other place in procMem where slicebytetostring is called is when we convert a []byte field to an integer. The routines in strconv only accept string arguments, so we must convert:

Unfortunately, the only way I could see to get around this was to fork (copy/paste) the ParseInt function from strconv, and change its declaration to accept a []byte argument. I copied the file in with 33d7fe8a, and modified it in 77e0408f. I was on a roll at that point, so that second commit also removes a O(n) string search and replaces it with an O(1) length check for whether an smaps line should be skipped.

The results?

0.24user 0.28system 0:00.17elapsed 292%CPU (0avgtext+0avgdata 5048maxresident)k
0inputs+8outputs (0major+941minor)pagefaults 0swaps

real	0m0.186s
user	0m0.246s
sys	0m0.285s


Much better! We’ve gotten more than twice as fast. The last thing left to do is see if I can make splitSpaces sane while keeping it fast. I played around with that for a while, even creating a micro-benchmark, but the benchmark ended up not being very useful. Techniques that improved overall runtime weren’t necessarily reflected in the benchmark results. The current version of splitSpaces is:

And improves performance ~ 9% over the previous implementation. Finally, I removed the tracking of shared-memory (since it can be derived as Pss - private), which shaved a few more percentage points off. The final profile looks like:

0.26user 0.25system 0:00.16elapsed 319%CPU (0avgtext+0avgdata 5164maxresident)k
0inputs+8outputs (0major+906minor)pagefaults 0swaps

real	0m0.170s
user	0m0.266s
sys	0m0.258s


Memory usage

I had trouble getting useful memory allocation info out of pprof apart from the side-effects visible on CPU profiles, like the byteslicetostring() discussion above. time(1) consistently measured psm’s max memory usage as at or under 5 MB, so it doesn’t seem to be much of an issue.

Wrapping up

I love programming in go. It provides an environment where I can quickly produce production-quality code with awesome auto-generated docs. On top of that it provides easy, built-in tools for profiling, making optimization (even more) fun.

UPDATE: I edited the part where I tried to claim that the Go standard library didn’t have anything that worked like splitSpaces. bytes.Fields does indeed already exist, as pointed out on Hacker News. I would still have come to the same conclusion - bytes.Fields is much slower than my final splitSpaces function due to its use of unicode.IsSpace (.249 seconds runtime vs .158, currently).

Optimizing Real World Go - January 5, 2013 - Bobby Powers