This page describes how to use Valgrind's Helgrind tool to find data races and other threading errors. For details on using Valgrind's Memcheck tool to find memory errors, see here.
Helgrind is a tool for debugging threaded programs. It detects three categories of errors:
- data races -- memory accessed by more than one thread, but without adequate synchronisation
- lock ordering inconsistencies -- these are potential deadlocks
- various misuses of the POSIX pthreads API -- unlocking somebody else's lock, etc.
Most of the complexity and difficulty is in detection of data races, so the discussion below focusses on that aspect.
Data race detection is a much harder problem than Memcheck-style memory error detection. Consequently, using Helgrind on Firefox requires somewhat more care and patience than using Memcheck on Firefox. Here are some hints to smooth the way. You need three things:
- Markup for the Mozilla code base. This describes to Helgrind the effect of some about synchronisation events it doesn't understand, primarily the behaviour of release methods in thread-safe refcounted classes. It also stops Helgrind complaining about some harmless races in the JS engine. Get it from bug 551155 comment 19.
- A suppression file that hides error reports in system libraries, at bug 551155 comment 20.
- A development version of Helgrind. A stock 3.6.1 installation won't work -- you won't be able to compile the marked-up Mozilla tree against the 3.6.1. headers. SVN trunk will work, but you'll miss out on some of the new goodies described below. You can check out and build the development version thusly:
svn co svn://svn.valgrind.org/valgrind/branches/HGDEV2 hgdev2
make && make install
You need to use Linux. Helgrind hasn't been stress tested on MacOS to nearly the same extent. Besides, the use of suppression files for system libraries is somewhat platform specific.
If you're wanting to run any serious workload -- anything much more than a startup and immediate quit of the browser -- a 64 bit build is recommended. In a 32 bit environment, Helgrind will quickly eat up your 3GB of address space and die. If you're doing something less demanding, for example checking a standalone build of the JS engine, a 32 bit build is OK.
If you're race checking just the JS engine, you first need to do a build of the entire marked-up browser. This is so as to create a suitably marked up NSPR. Then build the JS shell but link against the NSPR you just created. The system NSPR won't suffice, because some of the markup applies to NSPR.
What to expect
The markup patch tries to silence or fix enough races so that a startup to a blank page and immediate quit produces no errors, at least on 64 bit Ubuntu 10.04. Unfortunately, Helgrind reports (rightly or wrongly) many errors in system libraries, especially the Gnome libraries. Any deviation outside the startup/quit above tends to produce false errors.
The first thing to expect is possibly a number of errors that are nothing to do with your code. You can suppress these in the normal way, by using
--gen-suppressions=all and putting the resulting bits of text in a suppression file. A bit of time assembling a suppression file for errors that seem irrelevant quickly ameliorates this problem.
The second thing to expect is that you won't necessarily get the exactly same set of error reports from identical runs -- you might, or you might not. Helgrind uses a race detection algorithm which is unfortunately scheduling sensitive, and multiple identical runs produce overlapping subsets of the full set of detectable races.
The third thing to expect is that Helgrind will run slowly and eat large amounts of memory. The next-but-one section discusses ways around this.
Just in case you're feeling discouraged, bear in mind that even with these difficulties, it's easily possible to get something useful out of Helgrind.
Differences from the SVN trunk Helgrind
You may notice that this development branch of Helgrind produces error messages in a different format from the SVN trunk or 3.6.1. In particular, whenever it shows a stack for a thread involved in a race, it also shows you the set of locks held by the thread at that point. This makes it much easier to reason about who held what lock when, whether two threads agreed on the lock to use, etc.
This branch can also report races where one thread accesses heap memory whilst another one frees it, and there is no synchronisation event to guarantee that the access happens before the free. This is disabled by default.
--free-is-write=yes enables it.
Cranking reasonable performance out of Helgrind
At its default settings, Helgrind collects a huge amount of data about the history of your program's run, which is expensive in space and time. To get around this, there are command line options to selectively disable some of that data collection.
A useful way to approach the resource problem is to differentiate the activities of (1) detecting the presence of a race from (2) collecting enough information to diagnose the cause of a race. (1) is what we need to do when checking code for races, and when verifying that a proposed patch really does fix a race. (2) is what we need to do when investigating a race report.
The good news is that (1) is much cheaper than (2). By default, Helgrind tries to report both stacks involved in a race. That is expensive because it means collecting a stack trace for, in effect, every memory reference, just in case it finds a later memory reference that it races against. It is nearly impossible to make sense of race reports without having stack traces for both accesses involved, and that in turn requires collecting just such a huge set of backtraces. This is what makes (2) expensive.
Hence, the following scheme is recommended.
When doing (1), use the flag
--history-level=none. This disables the collection of old backtraces, which easily doubles the speed of Helgrind. It means that Helgrind can only report a stack for one of the accesses in a race -- the later observed one -- so you can tell the race is there, but you can't tell what it is racing against.
When you want to investigate in detail, cut the workload down as much as possible, and then re-enable the history mechanism, either by simply omitting
--history-level=none, or giving the default setting
--history-level=full. That should give you both stacks involved in the race. If it doesn't, you may have to throw even more memory at the problem via the
--conflict-cache-size= (try valgrind
--tool=helgrind --help for details). This controls how much historical data Helgrind accumulates. You may also like to try
--history-level=approx, which tries to strike a balance between these two extremes.
There are two other flags for controlling resource use.
--check-stack-refs=notells Helgrind not to race-check references to thread stacks. Since stack accesses constitute a significant fraction of the total data accesses done, it's worth quite a bit in performance terms, 30% perhaps. This obviously means it won't detect races on thread stacks. Allowing one thread to access another's stack sounds pretty dubious, and it doesn't appear to happen much: most reported races are to the heap or global variables, so this is quite a good tradeoff.
--track-lockorders=nodisables checking for inconsistencies in lock order acquisition. Normally that doesn't consume much in the way of CPU or memory, but we have seen some bad cases, so if you're pushed on memory, it's worth disabling.