This page describes how to use Valgrind's Helgrind tool to find data races and other threading errors. For details on using Valgrind's Memcheck tool to find memory errors, see here.
Helgrind is a tool for debugging threaded programs. It detects three categories of errors:
data races -- memory accessed by more than one thread, but without adequate synchronisation
lock ordering inconsistencies -- which are potential deadlocks
various misuses of the POSIX pthreads API -- unlocking somebody else's lock, etc.
Most of the complexity and difficulty is in detection of data races, so the discussion below focusses on that aspect.
Data race detection is a much harder problem than Memcheck-style memory error detection. Consequently, using Helgrind on Firefox requires somewhat more care and patience than using Memcheck on it. Here are some hints to smooth the way.
You need three things
* Markup for the Mozilla code base, that tells Helgrind about
synchronisation events it doesn't understand, and about some harmless
races. Get it from
bug 551155 comment 19
* A suppression file that hides error reports in system libraries, at
bug 551155 comment 20
* A development version of Helgrind, which you can check out and build
make && make install
You need to use Linux. Helgrind hasn't been stress
tested on MacOS to nearly the same extent. Besides, the use of
suppression files for system libraries is somewhat platform
If you're wanting to run any serious workload (eg, anything
much more than a startup and immediate quit of the browser),
a 64 bit build is strongly recommended. In a 32 bit environment,
Helgrind will quickly eat up your 3GB of address space and die.
If you're doing something less demanding, for example checking a
standalone build of the JS engine, a 32 bit build is OK.
If you're race checking just the JS shell, you first need to
do a build of the entire marked-up browser. This is so as to
create a suitably marked up NSPR. Then build the JS shell but
link against the NSPR you just created. The system NSPR won't
suffice, because some of the markup applies to NSPR.
What to expect
The markup patch tries to silence or fix enough races so that
a startup to a blank page and immediate quit produces no errors,
at least on 64 bit Ubuntu 10.04. Unfortunately, Helgrind reports
(rightly or wrongly) many errors in system libraries, especially
the Gnome libraries. Any deviation outside the startup/quit
above tends to produce false errors.
Hence the first thing to expect is possibly a number of errors
that are nothing to do with your code. You can suppress these
in the normal way, by using --gen-suppressions=all and putting
the resulting bits of text in a suppression file. A bit of time
assembling a suppression file for errors that seem irrelevant
quickly ameliorates this problem.
The second thing to expect is that you won't necessarily get the
exactly same set of error reports from identical runs -- you might,
or you might not. Helgrind uses a race detection algorithm which
is unfortunately scheduling sensitive, and multiple identical
runs produce overlapping subsets of the full set of detectable
The third thing to expect is that Helgrind will run slowly
and eat large amounts of memory. The next section discusses ways
Now .. just in case you're feeling discouraged .. bear in mind that even
with these difficulties, it's easily possible to get something
useful out of Helgrind.
Differences from the SVN trunk Helgrind
You may notice that this development branch of Helgrind produces
error messages in a different format from the SVN trunk or 3.6.1.
In particular, whenever it shows a stack for a thread involved
in a race, it also shows you the set of locks held by the thread
at that point. This makes it much easier to reason about who
held what lock when, whether the two threads agreed on the lock
to use, etc.
This branch can also report races where one thread accesses heap
memory whilst another one frees it, and there is no synchronisation
event to guarantee that the access happens before the free. This
is disabled by default. --free-is-write=yes enables it.
Cranking reasonable performance out of Helgrind
Left to itself Helgrind collects a huge amount of data about the
history of your program's run. To get around this, there are
command line options to selectively disable some of that
A useful way to approach the resource problem is to differentiate
the activities of (1) detecting the presence of a race from (2)
enough information to diagnose the cause of a race. (1) is what
we need to do when checking code for races, and when verifying that
a proposed patch really does fix a race. (2) is what we need to do
when investigating a race report.
The good news is that (1) is much cheaper than (2). By default,
Helgrind tries to report both stacks involved in a race. That
is expensive because it means collecting a stack trace for,
in effect, every memory reference, just in case it finds a
later memory reference that it races against. It is nearly
impossible to make sense of race reports without having stack
traces for both accesses involved, but reporting those requires
collecting just such a huge set of backtraces. This is what
makes (2) expensive.
Hence, the following scheme is recommended.
When doing (1), use the flag --history-level=none. This disables
the collection of old backtraces, which easily doubles the speed
of Helgrind. It means that Helgrind can only report a stack for
one of the accesses in a race -- the later observed one -- so you
can tell the race is there, but you can't tell what it is racing
When you want to investigate in detail, cut the workload down
as much as possible, and then re-enable the history mechanism,
either by simply omitting --history-level=none, or giving the
default setting --history-level=full. That should give you both
stacks involved in the race. If it doesn't, you may have to
throw even more memory at the problem via the --conflict-cache-size=
(try valgrind --tool=helgrind --help for details). This controls
how much historical data Helgrind accumulates.
There are two other flags for controlling resource use.
--check-stack-refs=no tells Helgrind not to race-check
references to thread stacks. Since stack accesses constitute a
significant fraction of the total data accesses done, it's worth
quite a bit in performance terms. 30% ish improvement, maybe.
This obviously means it won't detect races, on thread stacks.
Allowing one thread to access another's stack sounds
pretty dubious, and it doesn't seem to happen much:
most reported races are to the heap or global variables,
so this is quite a good tradeoff.
--track-lockorders=no disables checking for inconsistencies in
lock order acquisition. Normally that doesn't consume much in
the way of CPU or memory, but we have seen some bad cases, so
if you're pushed on memory, it's worth disabling.