Tales from the Dark Side - Optimising Perl Memory Usage

I have recently had to optimise the memory usage of a perl script, and learned a number of lessons that I thought worth sharing.

The script in question is part of a web interface to Bacula, which needs to create an in-memory tree representing the filesystem that was backed up (from the database indexes). I originally wrote it more than a year ago. At the time, it replaced my first attempt which had been a PHP implementation of the same thing, but which used way too much memory itself. Since then, it's been operating on backup sets consisting of up to 2 million or so files/directories, without any noticeable difficulty.

Then we deployed to a new site, which has substantially more files than anywhere else. Interestingly the data volume is the same, but now we've got a single backup with 8 million files/directories in it. Add to this, when processing a restore from a few days into a weekly cycle, now there's 8 million files from the first Full backup, and 3 million + from each of the Differential's since then. I got a call from the site saying they were getting an Out of Memory error while trying to initiate a restore. The server had plenty of RAM, so I was a little confused until I tried it myself, and watched the script crash and burn when it hit 3GB. Oh. 32-bit OS. 3GB per process memory limit. Bother.

Hold on. A 32-bit OS in 2011? Yeah, hysterical raisins[1]. Upgrading to 64-bit isn't an immediate option (it's coming, but we need restores working sooner than that). So, I truly did have to optimise the memory usage of this thing, and I've pretty much succeeded. Three days later, here's some things I learned:

As with any optimisation, measure measure, and measure again. Yes you can and should use your experience and knowledge to point yourself in the right direction, but prove it. Check your assumptions with empirical evidence, because they might be wrong. And if you spend time optimising the wrong section, it'll be wasted effort. In this case, I had some guesses, but it actually took me 75% of the time to find out where the main memory usage was. As silly as this sounds, part of this was figuring out the right techniques to use in this context.

Devel::Size can be both useful and not useful. For doing something like figuring how much space a hash is using, not counting the actual contents, it's very useful (use "size"). When you've got some complex data structures and you're trying to find which subsection of it is using all the memory, it's not so easy. In this example, I found some alternative ways of measuring that indicated that more memory was being used than Devel::Size was claiming. I suspect this was because I wasn't measuring at the right point with Devel::Size, rather than any problem with the module itself. It's also not very fast; when you're asking it to do total_size on hashes with multi-million entries, it can take quite some time to run slowing down your turn around on experiments.

I was able to get a better handle on memory usage by selectively disabling bits of the code within the loop. E.g. to prove exactly how much memory using a certain hash required, I measured memory use as per normal, then again without adding the objects to the hash. That worked in this case because I didn't need the contents of the hash until the last step after building the tree. I was able to get away with this for a couple of other places in the code as well. Although the results weren't as comprehensive (it couldn't get further than 60% through for algorithm specific reasons), it pointed me in the right direction for some additional optimisations.

Tie::SubstrHash is very cool. But, when testing, remember that this will have allocated all the memory up front, so make sure you run the program to completion and measure memory all the way through the run. For the first 20% of the processing time it was using more memory than the old code; I'm glad I left it to run to completion, by which time it was saving almost 1GB.

Perl isn't very memory efficient, although PHP is worse. Neither are really the most optimal choice for this implementation, from a memory usage perspective. However, given some other local constraints (likely future maintainers of this code are not going to be particularly well placed with C/C++), it's still the most reasonable choice in this situation. It may well not be the best choice in other situations.

pack/unpack are Larry's gift to something vaguely approaching efficiency of storage. Learn to use them, you'll be glad you did. Just remember that you've still got the overhead of the string value itself (the bytes for storing reference counts, pointers, and whatever other bookkeeping is required), so the more you can store in the packed string the better the relative efficiency. But it certainly saved my bacon, storing 1 64-bit value and 2 32-bit values in 16 bytes + one lot of overhead, rather than in 1 string and 2 ints, with 4 lots of overhead.

64-bit integers on 32-bit platforms. Don't expect this to work well. Or even at all. Or with anything in the way of warning. Got a string representing a number larger than 32-bits? Good. Cast it to an int or try to pack it? Watch it get truncated to 32-bits without warning. To pack/unpack them on 32-bit hardware you need to use Vec::new_Dec and Math::BigInt (see http://www.perlmonks.org/?node_id=163123 for some useful code).

Optimising for memory usage can affect CPU usage in *either* direction. In this case, it increased runtime by 2-3 times. So I had to do at least a little investigation into the CPU efficiency afterwards, to see if there's some simple things I've done suboptimally. I'm still doing that, but so far Devel::NYTProf seems pretty cool. And easy to use too, although it increased my runtime about 10 times, and the 2.9GB profile file took quite a bit of time to process afterwards (1hr+). It's worth the wait though. NB: curious as to how far through it is reading the profile data? Find the process id with ''ps'', then find which file descriptor is the profile data file (ls -l /proc//fd/), although it will probably be '4', then see what file position it's at with cat /proc//fdinfo/. "pos" is the current file pointer position in bytes.

Hopefully this is of some use to someone, somewhere, although if you really need it, you do have my sympathies. It's both fun to do, particularly when you've succeeded, but it's also frustrating when something you thought was a really good idea turns out to not change anything.

[1] Historical Reasons