mozilla

Revision 92425 of Debugging Mozilla with Helgrind

  • Revision slug: Debugging_Mozilla_with_Helgrind
  • Revision title: Debugging Mozilla with Helgrind
  • Revision id: 92425
  • Created:
  • Creator: jseward@acm.org
  • Is current revision? No
  • Comment 1461 words added, 5 words removed

Revision Content

This page describes how to use Valgrind's Helgrind tool to find data races and other threading errors.  For details on using Valgrind's Memcheck tool to find memory errors, see here.

Helgrind is a tool for debugging threaded programs.  It detects three categories of errors:

data races -- memory accessed by more than one thread, but without adequate synchronisation

lock ordering inconsistencies -- which are potential deadlocks

various misuses of the POSIX pthreads API -- unlocking somebody else's lock, etc.

Most of the complexity and difficulty is in detection of data races, so the discussion below focusses on that aspect.

Prerequisites

Data race detection is a much harder problem than Memcheck-style memory error detection.  Consequently, using Helgrind on Firefox requires somewhat more care and patience than using Memcheck on it.  Here are some hints to smooth the way.

You need three things

* Markup for the Mozilla code base, that tells Helgrind about
synchronisation events it doesn't understand, and about some harmless
races.  Get it from
bug 551155 comment 19
https://bugzilla.mozilla.org/show_bu...?id=551155#c19.

* A suppression file that hides error reports in system libraries, at
bug 551155 comment 20
https://bugzilla.mozilla.org/show_bu...?id=551155#c20.

* A development version of Helgrind, which you can check out and build
as follows
svn://svn.valgrind.org/valgrind/branches/HGDEV2 hgdev2
cd hgdev2
./autogen.sh
./configure --prefix=`pwd`/Inst
make && make install



Platform

You need to use Linux.  Helgrind hasn't been stress
tested on MacOS to nearly the same extent.  Besides, the use of
suppression files for system libraries is somewhat platform
specific.

If you're wanting to run any serious workload (eg, anything
much more than a startup and immediate quit of the browser),
a 64 bit build is strongly recommended.  In a 32 bit environment,
Helgrind will quickly eat up your 3GB of address space and die.
If you're doing something less demanding, for example checking a
standalone build of the JS engine, a 32 bit build is OK.

If you're race checking just the JS shell, you first need to
do a build of the entire marked-up browser.  This is so as to
create a suitably marked up NSPR.  Then build the JS shell but
link against the NSPR you just created.  The system NSPR won't
suffice, because some of the markup applies to NSPR.


What to expect

The markup patch tries to silence or fix enough races so that
a startup to a blank page and immediate quit produces no errors,
at least on 64 bit Ubuntu 10.04.  Unfortunately, Helgrind reports
(rightly or wrongly) many errors in system libraries, especially
the Gnome libraries.  Any deviation outside the startup/quit
above tends to produce false errors.

Hence the first thing to expect is possibly a number of errors
that are nothing to do with your code.  You can suppress these
in the normal way, by using --gen-suppressions=all and putting
the resulting bits of text in a suppression file.  A bit of time
assembling a suppression file for errors that seem irrelevant
quickly ameliorates this problem.

The second thing to expect is that you won't necessarily get the
exactly same set of error reports from identical runs -- you might,
or you might not.  Helgrind uses a race detection algorithm which
is unfortunately scheduling sensitive, and multiple identical
runs produce overlapping subsets of the full set of detectable
races.

The third thing to expect is that Helgrind will run slowly
and eat large amounts of memory.  The next section discusses ways
around this.

Now .. just in case you're feeling discouraged .. bear in mind that even
with these difficulties, it's easily possible to get something
useful out of Helgrind. 


Differences from the SVN trunk Helgrind

You may notice that this development branch of Helgrind produces
error messages in a different format from the SVN trunk or 3.6.1.
In particular, whenever it shows a stack for a thread involved
in a race, it also shows you the set of locks held by the thread
at that point.  This makes it much easier to reason about who
held what lock when, whether the two threads agreed on the lock
to use, etc.

This branch can also report races where one thread accesses heap
memory whilst another one frees it, and there is no synchronisation
event to guarantee that the access happens before the free.  This
is disabled by default.  --free-is-write=yes enables it.



Cranking reasonable performance out of Helgrind

Left to itself Helgrind collects a huge amount of data about the
history of your program's run.  To get around this, there are
command line options to selectively disable some of that
data collection.

A useful way to approach the resource problem is to differentiate
the activities of (1) detecting the presence of a race from (2)
collecting
enough information to diagnose the cause of a race.  (1) is what
we need to do when checking code for races, and when verifying that
a proposed patch really does fix a race.  (2) is what we need to do
when investigating a race report.

The good news is that (1) is much cheaper than (2).  By default,
Helgrind tries to report both stacks involved in a race.  That
is expensive because it means collecting a stack trace for,
in effect, every memory reference, just in case it finds a
later memory reference that it races against.  It is nearly
impossible to make sense of race reports without having stack
traces for both accesses involved, but reporting those requires
collecting just such a huge set of backtraces.  This is what
makes (2) expensive.

Hence, the following scheme is recommended.

When doing (1), use the flag --history-level=none.  This disables
the collection of old backtraces, which easily doubles the speed
of Helgrind.  It means that Helgrind can only report a stack for
one of the accesses in a race -- the later observed one -- so you
can tell the race is there, but you can't tell what it is racing
against.

When you want to investigate in detail, cut the workload down
as much as possible, and then re-enable the history mechanism,
either by simply omitting --history-level=none, or giving the
default setting --history-level=full.  That should give you both
stacks involved in the race.  If it doesn't, you may have to
throw even more memory at the problem via the --conflict-cache-size=
(try valgrind --tool=helgrind --help for details).  This controls
how much historical data Helgrind accumulates.



There are two other flags for controlling resource use.

--check-stack-refs=no tells Helgrind not to race-check
references to thread stacks.  Since stack accesses constitute a
significant fraction of the total data accesses done, it's worth
quite a bit in performance terms.  30% ish improvement, maybe.
This obviously means it won't detect races, on thread stacks.
Allowing one thread to access another's stack sounds
pretty dubious, and it doesn't seem to happen much:
most reported races are to the heap or global variables,
so this is quite a good tradeoff. 

--track-lockorders=no disables checking for inconsistencies in
lock order acquisition.  Normally that doesn't consume much in
the way of CPU or memory, but we have seen some bad cases, so
if you're pushed on memory, it's worth disabling.

Revision Source

<p>This page describes how to use Valgrind's Helgrind tool to find data races and other threading errors.  For details on using Valgrind's Memcheck tool to find memory errors, see <a href="/en/Debugging_Mozilla_with_Valgrind" title="https://developer.mozilla.org/en/Debugging_Mozilla_with_Valgrind">here</a>.<br>
<br>
Helgrind is a tool for debugging threaded programs.  It detects three categories of errors: </p>
<p>data races -- memory accessed by more than one thread, but without adequate synchronisation</p>
<p>lock ordering inconsistencies -- which are potential deadlocks<br>
<br>
various misuses of the POSIX pthreads API -- unlocking somebody else's lock, etc.<br>
<br>
Most of the complexity and difficulty is in detection of data races, so the discussion below focusses on that aspect.<br>
<br>
Prerequisites<br>
<br>
Data race detection is a much harder problem than Memcheck-style memory error detection.  Consequently, using Helgrind on Firefox requires somewhat more care and patience than using Memcheck on it.  Here are some hints to smooth the way.<br>
<br>
You need three things<br>
<br>
* Markup for the Mozilla code base, that tells Helgrind about <br>
synchronisation events it doesn't understand, and about some harmless<br>
races.  Get it from <br>
bug 551155 comment 19<br>
<a class=" link-https" href="https://bugzilla.mozilla.org/show_bug.cgi?id=551155#c19" rel="freelink">https://bugzilla.mozilla.org/show_bu...?id=551155#c19</a>.<br>
<br>
* A suppression file that hides error reports in system libraries, at<br>
bug 551155 comment 20<br>
<a class=" link-https" href="https://bugzilla.mozilla.org/show_bug.cgi?id=551155#c20" rel="freelink">https://bugzilla.mozilla.org/show_bu...?id=551155#c20</a>.<br>
<br>
* A development version of Helgrind, which you can check out and build<br>
as follows<br>
<a class=" external" href="svn://svn.valgrind.org/valgrind/branches/HGDEV2" rel="freelink">svn://svn.valgrind.org/valgrind/branches/HGDEV2</a> hgdev2<br>
cd hgdev2<br>
./autogen.sh<br>
./configure --prefix=`pwd`/Inst<br>
make &amp;&amp; make install<br>
<br>
<br>
<br>
Platform<br>
<br>
You need to use Linux.  Helgrind hasn't been stress<br>
tested on MacOS to nearly the same extent.  Besides, the use of<br>
suppression files for system libraries is somewhat platform<br>
specific.<br>
<br>
If you're wanting to run any serious workload (eg, anything<br>
much more than a startup and immediate quit of the browser),<br>
a 64 bit build is strongly recommended.  In a 32 bit environment,<br>
Helgrind will quickly eat up your 3GB of address space and die.<br>
If you're doing something less demanding, for example checking a<br>
standalone build of the JS engine, a 32 bit build is OK.<br>
<br>
If you're race checking just the JS shell, you first need to<br>
do a build of the entire marked-up browser.  This is so as to<br>
create a suitably marked up NSPR.  Then build the JS shell but<br>
link against the NSPR you just created.  The system NSPR won't<br>
suffice, because some of the markup applies to NSPR.<br>
<br>
<br>
What to expect<br>
<br>
The markup patch tries to silence or fix enough races so that <br>
a startup to a blank page and immediate quit produces no errors,<br>
at least on 64 bit Ubuntu 10.04.  Unfortunately, Helgrind reports<br>
(rightly or wrongly) many errors in system libraries, especially<br>
the Gnome libraries.  Any deviation outside the startup/quit<br>
above tends to produce false errors.<br>
<br>
Hence the first thing to expect is possibly a number of errors<br>
that are nothing to do with your code.  You can suppress these<br>
in the normal way, by using --gen-suppressions=all and putting<br>
the resulting bits of text in a suppression file.  A bit of time<br>
assembling a suppression file for errors that seem irrelevant<br>
quickly ameliorates this problem.<br>
<br>
The second thing to expect is that you won't necessarily get the<br>
exactly same set of error reports from identical runs -- you might,<br>
or you might not.  Helgrind uses a race detection algorithm which<br>
is unfortunately scheduling sensitive, and multiple identical<br>
runs produce overlapping subsets of the full set of detectable<br>
races.<br>
<br>
The third thing to expect is that Helgrind will run slowly<br>
and eat large amounts of memory.  The next section discusses ways<br>
around this.<br>
<br>
Now .. just in case you're feeling discouraged .. bear in mind that even<br>
with these difficulties, it's easily possible to get something<br>
useful out of Helgrind.  <br>
<br>
<br>
Differences from the SVN trunk Helgrind<br>
<br>
You may notice that this development branch of Helgrind produces<br>
error messages in a different format from the SVN trunk or 3.6.1.<br>
In particular, whenever it shows a stack for a thread involved<br>
in a race, it also shows you the set of locks held by the thread<br>
at that point.  This makes it much easier to reason about who<br>
held what lock when, whether the two threads agreed on the lock<br>
to use, etc.<br>
<br>
This branch can also report races where one thread accesses heap<br>
memory whilst another one frees it, and there is no synchronisation<br>
event to guarantee that the access happens before the free.  This<br>
is disabled by default.  --free-is-write=yes enables it.<br>
<br>
<br>
<br>
Cranking reasonable performance out of Helgrind<br>
<br>
Left to itself Helgrind collects a huge amount of data about the<br>
history of your program's run.  To get around this, there are<br>
command line options to selectively disable some of that <br>
data collection.<br>
<br>
A useful way to approach the resource problem is to differentiate<br>
the activities of (1) detecting the presence of a race from (2)<br>
collecting<br>
enough information to diagnose the cause of a race.  (1) is what <br>
we need to do when checking code for races, and when verifying that<br>
a proposed patch really does fix a race.  (2) is what we need to do<br>
when investigating a race report.<br>
<br>
The good news is that (1) is much cheaper than (2).  By default,<br>
Helgrind tries to report both stacks involved in a race.  That <br>
is expensive because it means collecting a stack trace for,<br>
in effect, every memory reference, just in case it finds a <br>
later memory reference that it races against.  It is nearly <br>
impossible to make sense of race reports without having stack<br>
traces for both accesses involved, but reporting those requires<br>
collecting just such a huge set of backtraces.  This is what<br>
makes (2) expensive.<br>
<br>
Hence, the following scheme is recommended.<br>
<br>
When doing (1), use the flag --history-level=none.  This disables<br>
the collection of old backtraces, which easily doubles the speed<br>
of Helgrind.  It means that Helgrind can only report a stack for<br>
one of the accesses in a race -- the later observed one -- so you<br>
can tell the race is there, but you can't tell what it is racing<br>
against.<br>
<br>
When you want to investigate in detail, cut the workload down<br>
as much as possible, and then re-enable the history mechanism,<br>
either by simply omitting --history-level=none, or giving the<br>
default setting --history-level=full.  That should give you both<br>
stacks involved in the race.  If it doesn't, you may have to <br>
throw even more memory at the problem via the --conflict-cache-size=<br>
(try valgrind --tool=helgrind --help for details).  This controls<br>
how much historical data Helgrind accumulates.<br>
<br>
<br>
<br>
There are two other flags for controlling resource use.<br>
<br>
--check-stack-refs=no tells Helgrind not to race-check<br>
references to thread stacks.  Since stack accesses constitute a<br>
significant fraction of the total data accesses done, it's worth<br>
quite a bit in performance terms.  30% ish improvement, maybe.<br>
This obviously means it won't detect races, on thread stacks.<br>
Allowing one thread to access another's stack sounds<br>
pretty dubious, and it doesn't seem to happen much:<br>
most reported races are to the heap or global variables,<br>
so this is quite a good tradeoff.  <br>
<br>
--track-lockorders=no disables checking for inconsistencies in<br>
lock order acquisition.  Normally that doesn't consume much in<br>
the way of CPU or memory, but we have seen some bad cases, so<br>
if you're pushed on memory, it's worth disabling.<br>
</p>
Revert to this revision