This article needs a technical review. How you can help.
This page explains how to read a performance profile using the 'Cleopatra' front-end. Performance profiles containing a lot of data and finding performance bottlenecks are not easy. Thus, Cleopatra is a complicated tool that requires practice to master! Here's some useful information to get started.
For some, the best way to learn is by watching. Here's an example of Benoit reading a B2G profile. This is an example of how to read a profile and finding the significant bottlenecks. It can take a lot of practice to get better at collecting good quality profiles (i.e. recording the thing that is slow) and how to analyze it and find the key samples. You may have to rewatch the video several times.
The very top is the breadcrumb bar shows your current active selection. Currently, we're showing the 'Complete Profile', but we have an inactive 'Sample Range [x, y]' selection.
The first part of the UI, below the breadcrumb bar, are the timelines. At the very top, you will see the 'Frames' view. This is a summary of the painting pipeline, similar to what you'd see from the DevTools profiler. This gives you an overview of the painting pipeline of a particular process. In this case the 'GeckoMain' (Chrome) process. The painting pipeline is only active if potential rendering changes are made. Therefore, as seen near the start of the Frames timeline, there may be a long period of inactivity. This is fine.
Thread/Process Timelines: Then we have a list of profiled threads. These threads may come from different processes. In this case, we have the 'GeckoMain' (Chrome) process's main thread, the plugin process's main thread, and an e10s content process's main thread. Each of these timelines is lined up by wallclock time. So for example, when a thread is blocked, like 'Content', on a thread like GeckoMain/Chrome main thread we can see what's occurring on the main thread that is preventing it from responding.
X (Time) & Y (Stack) axis: The timelines go from right to left as wall clock time increases along the X axis. The Y axis is the stack depth, not the CPU activity. The change in stack height is useful to find a pattern like long blocking call (long flatlines), very tall spiky blocks (recursive calls and JS). With more experience you can read profiles faster by recognizing patterns.
Red samples: The samples fade from black to red when the event loop is being unresponsive. The plugin thread is red because its event loop is not being monitored correctly. Gaps in the timeline indicate that the profiler was unable to collect a 'sample' in time and that there is not data during this period.
Green samples: Green samples are samples that are selected by the tree view. This is a very critical feature in the profiler. It lets you see if the execution time is clustered in a single blocking chunk of time, or split across the timeline.
Selection & Zooming: The grey rectangle shows a selection area, which is highlighted in the breadcrumb bar at the very top. You can drag over any region of the timeline and zoom into this time range to look at it in more detail. This will push an inactive breadcrumb entry in the timeline. To focus this view you must activate the breadcrumb entry. When zooming in to your new selection, the entire UI will update to the new view. This will also update the Tree View (see below).
2. Tree View
The Tree View shows the samples organized by 'Running Time'. This will show the data by wall clock time, and thus, a significant portion of time is spent in idle, blocking calls like waiting for events, which is ideal for a responsive application to be ready to service incoming events.
As we progress into a more specific part of the tree, you'll notice that the 'Running time' decreases. This happens when a function has 2 or more non-trivial calls, the running time will be split between its children.
On the right-hand size, you'll see a 'Most time spent in'. This is strictly a shortcut display of the heaviest left-most extensions of the tree and can be ignored.
3. Profile Summary
Profile Info: This will display information about the build in which the profile was collected and which profiler settings were selected.
Selection Info: This shows: 1) The event lag, a measurement of how long it takes for events to get through the event loop. An ideal measurement in ~10ms (the measurement frequency). 2) The Real Interval is a measurement of the average time between samples. Ideally, this number should match the selected 'Interval' as listed in the 'Profile Info' section above. Since this is a userspace profiler, excessive system load can cause the profiler not to fire within its deadline. In this case, an ideal 'Real Interval' would be 1 ms + or - 0, which 1.01 ms + or - 2 is close enough. In this case, the +2 indicates that we went up to 3ms without taking a sample which is normal.
Pre Filtering: Allows your to filter using a case insensitive substring search over the Tree View and Timeline. This is useful if you're looking for a particular symbol.
View Options: This will allow you to change which and how the data is presented in the tree view. In particular 'Invert Callstack' will show a 'Heavy' view. This is useful if you're spending a lot of time in a function but have many possible ways to get into that function. This may not be shown clearly in the normal view but should be more obvious in the 'Invert callstack' view.
Share With URL: This section allows you to download the profile to a file to share offline or upload the file publicly to a web server so that you may easily share the profile over email, Bugzilla or IRC and ask for help or share your findings.
Compare: An experimental feature to compare to profiles. For this feature to be effective, the profiles must be collected until the [tbd].
If you profile the Compositor thread in the main b2g process, you get a frames view each time a new frame is produced. It will look something like this:
We want to take a look at the Frames view. We see a couple of things:
- Vsync [Orange] - This is a vsync marker. This is available when the hardware composer on the device is supplying vsync events. Since vsyncs occur at a single point in time, vsyncs are a block only for visual readability. The actual vsync time occurred at the start of the vsync box. See Project Silk for why this is important.
- Refresh [Green] - This is what's occurring during an nsRefreshDriver::Tick phase.
- DisplayList [Red] - The amount of time taken to create a display list.
- Reflow [Green] - Amount of time spent during reflow.
- Composite [Purple] - Amount of time spent performing a composite.
- LayerTranscaction - Amount of time spent performing a layer transaction.
In the results above we can see that we're spending ~113 milliseconds in
cp_handleEvent() when running,
BrowserElementPanning.js line 77. That means 113 ms for
cp_handleEvent and ALL child functions that are called. We spend 56 ms of that in
cp_onTouchStart. We then spend 41 ms in
cp_getPannable, 40 ms of which is spent in
cp_findPannable. Essentially, you're looking for the processes that are taking the most time; then you can figure out how to optimize them.