5.3. ALIGNMENT OF MULTIPLE TRACE FILES
Figure 5.6: Zoom to compute iterations of trace C
trace C is so short that it is almost not visible. Zooming into the compute iterations
of trace C would make them visible but would also only reveal the “MPI Init” phase of
trace A and B, see Figure 5.6. In order to compare the compute iterations the trace
files need to be aligned properly. This process is described in Section 5.3.
5.3 Alignment of Multiple Trace Files
The “Compare View” functionality to shift individual trace files in time allows to compare
areas of the data that did not occur at the same time. For instance, in order to compare
the compute iterations of the three example trace files these areas need to be aligned
to each other. This is required due to the fact that the initialization of the application
took different times on the three machines.
60
CHAPTER 5. COMPARISON OF TRACE FILES
Figure 5.7: Context menu controlling the time offset
Figure 5.8: Alignment in the Navigation Toolbar
There are several ways to shift the trace files in time. One way is to use the context
menu of the “Navigation Toolbar”. A right click on the toolbar reveals the menu shown
in Figure 5.7. Here the entry “Set Time Offset” allows to manually set the time offset
for the trace file. The entry “Reset Time Offset” resets the offset.
The easiest way to achieve a coarse alignment is to drag the trace file in the “Navigation
Toolbar” itself. While holding the “Ctrl” (“Cmd” on Mac OS X) modifier key pressed the
61
5.3. ALIGNMENT OF MULTIPLE TRACE FILES
trace can be dragged to the desired position with the left mouse button. In Figure 5.8
the compute iterations of all example trace files are coarsely aligned.
Figure 5.9: Alignment in the Master Timeline
After the coarse shifting a finer alignment can be done in the “Master Timeline”. There-
fore the user needs to zoom into the area to compare. Then, while keeping the “Ctrl”
(“Cmd” on Mac OS X) modifier key pressed, the trace can be dragged with the left
mouse button in the “Master Timeline”. Figure 5.9 depicts the process of dragging
trace B to the compute iterations of trace A. As can be seen in the Figure 5.9, al-
though the initialization of trace A took the longest time, this machine was the fastest
in computing the iterations.
62
CHAPTER 5. COMPARISON OF TRACE FILES
5.4 Usage of Predefined Markers
The Open Trace Format (OTF) allows to define markers pointing to particular places
of interest in the trace data. These markers can be used to navigate in the trace files.
For trace file comparison markers are interesting due to their potential to quickly locate
places in large trace data. With the help of markers it is possible to find the same
location in multiple trace files with just a few clicks.
Figure 5.10: Open Marker View
First step in order to use markers is to open the “Marker View”. Figure 5.10 shows a
“Compare View” with an open “Marker View” for each trace file. After a click on one
marker in the “Marker View” the selected marker is highlighted in the “Master Timeline”
and the “Process Timeline”.
Another way to navigate to a marker in the timeline displays is to use the Vampir zoom.
If the user zoomed in the “Master Timeline” or “Process Timeline” into the desired
zooming level, then a click on a marker in the “Marker View” will shift the timeline zoom
to the marker position. Thus, the marker appears in the center of the timeline display,
see Figure 5.11.
63
5.4. USAGE OF PREDEFINED MARKERS
Figure 5.11: Jump to marker in the Master Timeline
64
CHAPTER 6. CUSTOMIZATION
6 Customization
The appearance of the trace file and various other application settings can be altered
in the preferences accessible via the main menu entry “File → Preferences”. Settings
concerning the trace file itself, e.g. layout or function group colors are saved individually
next to the trace file in a file with the ending “.vsettings”. This way it is possible to adjust
the colors for individual trace files without interfering with others.
The options “Import Preferences” and “Export Preferences” provide the loading and
saving of preferences of arbitrary trace files.
6.1 General Preferences
The “General” settings allow to change application and trace specific values.
“Show time as” decides whether the time format for the trace analysis is based on
seconds or ticks.
With the “Automatically open context view” option disabled Vampir does not open the
context view after the selection of an item, like a message or function.
“Use color gradient in charts” allows to switch off the color gradient used in the perfor-
mance charts.
The next option allows to change the style and size of the font.
“Show source code” enables the internal source code viewer. This viewer shows the
source code corresponding to selected locations in the trace file. In order to open a
source file first click on the intended function in the “Master Timeline” and then on the
source code path in the “Context View”. For the source code location to work properly,
you need a trace file with source code location support. The path to the source file
can be adjusted in the “Preferences” dialog. A limit for the size of the source file to be
opened can be set, too.
In the “Analysis” section the number of analysis threads can be chosen. If this option
is disabled, Vampir determines the number automatically by the number of cores, e.g.
two analysis threads on a dual-core machine.
In the “Updates” section the user can decide if Vampir should check automatically for
new versions.
65
6.2. APPEARANCE
Figure 6.1: General Settings
Vampir also features a color blindness support mode.
On Linux systems there is also the “Document layout” option available. If this option
is enabled all open “Trace View” windows need to stay in one main window. If it is
disabled, the “Trace View” windows can be moved freely over the Desktop.
6.2 Appearance
In the “Appearance” settings of the “Preferences” dialog there are six different objects
for which the color options can be changed. The functions/function groups, markers,
counters, collectives, messages, and I/O events. Choose an entry and click on its color
to make a modification. A color picker dialog opens where it is possible to adjust the
color. For messages and collectives a change of the line width is also available.
In order to quickly find the desired item a search box is provided at the bottom of the
dialog.
66
CHAPTER 6. CUSTOMIZATION
Figure 6.2: Appearance Settings
6.3 Saving Policy
Vampir detects whenever changes to the various settings are made. In the “Saving
Policy” dialog it is possible to adjust the saving behavior of the different components to
the own needs.
In the dialog “Saving Behavior” you tell Vampir what to do in the case of changed
preferences. The user can choose the categories of settings, e.g., the layout, that
should be affected by the selected behavior. Possible options are that the application
automatically “Always” or “Never” saves changes. The default option is to have Vampir
asking you whether to save or discard changes.
Usually the settings are stored in the folder of the trace file. If the user has no write
access to it, it is possible to place them alternatively in the “Application Data Folder”.
All such stored settings are listed in the tab “Locally Stored Preferences” with creation
and modification date.
Note: On loading Vampir always favors settings in the “Application Data Folder”.
67
6.3. SAVING POLICY
Figure 6.3: Saving Policy Settings
“Default Preferences” offers to save preferences of the current trace file as default
settings. Then they are used for trace files without settings. Another option is to restore
the default settings. Then the current preferences of the trace file are reverted.
68
CHAPTER 7. A USE CASE
7 A Use Case
This chapter explains by example how Vampir can be used to discover performance
problems in your code and how to correct them.
7.1 Introduction
In many cases the Vampir suite has been successfully applied to identify performance
bottlenecks and assist their correction. To show in which ways the provided toolset
can be used to find performance problems in program code, one optimization process
is illustrated in this chapter. The following example is a three-part optimization of a
weather forecast model including simulation of cloud microphysics. Every run of the
code has been performed on 100 cores with manual function instrumentation, MPI
communication instrumentation, and recording of the number of L2 cache misses.
Figure 7.1: Master Timeline and Function Summary showing an overview of the pro-
gram run
Getting a grasp of the program’s overall behavior is a reasonable first step. In Figure 7.1
Vampir has been set up to provide such a high-level overview of the model’s code.
This layout can be achieved through two simple manipulations. Set up the Master
Timeline to adjust the process bar height to fit the chart height. All 100 processes
are now arranged into one view. Likewise, change the event category in the Function
69
7.2. IDENTIFIED PROBLEMS AND SOLUTIONS
Summary to show function groups. This way the many functions are condensed into
fewer function groups.
One run of the instrumented program took 290 seconds to finish. The first half of
the trace (Figure 7.1 A) is the initialization part. Processes get started and synced,
input is read and distributed among these processes. The preparation of the cloud
microphysics (function group: MP) is done here as well.
The second half is the iteration part, where the actual weather forecasting takes place.
In a normal weather simulation this part would be much larger. But in order to keep
the recorded trace data and the overhead introduced by tracing as small as possible
only a few iterations have been recorded. This is sufficient since they are all doing the
same work anyway. Therefore the simulation has been configured to only forecast the
weather 20 seconds into the future. The iteration part consists of two ”large” iterations
(Figure 7.1 B and C), each calculating 10 seconds of forecast. Each of these in turn is
partitioned into several ”smaller” iterations.
For our observations we focus on only two of these small, inner iterations, since this
is the part of the program where most of the time is spent. The initialization work
does not increase with a higher forecast duration and would only take a relatively small
amount of time in a real world run. The constant part at the beginning of each large
iteration takes less than a tenth of the whole iteration time. Therefore, by far the most
time is spent in the small iterations. Thus they are the most promising candidates for
optimization.
All screenshots starting with Figure 7.2 are in a before-and-after fashion to point out
what changed by applying the specific improvements.
7.2 Identified Problems and Solutions
7.2.1 Computational Imbalance
A varying size of work packages (thus varying processing time of this work) means
waiting time in subsequent synchronization routines. This section points out two easy
ways to recognize this problem.
Problem
As can be seen in Figure 7.2 each occurrence of the MICROPHYSICS-routine (purple
color) starts at the same time on all processes inside one iteration, but takes between
1.7 and 1.3 seconds to finish. This imbalance leads to idle time in subsequent syn-
chronization calls on the processes 1 to 4, because they have to wait for process 0 to
finish its work (marked parts in Figure 7.2). This is wasted time which could be used for
70
CHAPTER 7. A USE CASE
Figure 7.2: Before Tuning: Master Timeline and Function Summary identifying MICRO-
PHYSICS (purple color) as predominant and unbalanced
Figure 7.3: After Tuning: Timeline and Function Summary showing an improvement in
communication behavior
71
7.2. IDENTIFIED PROBLEMS AND SOLUTIONS
computational work, if all MICROPHYSICS-calls would have the same duration. An-
other hint at this overhead in synchronization is the fact that the MPI receive routine
uses 17.6% of the time of one iteration (Function Summary in Figure 7.2).
Solution
To even out this asymmetry the code which determines the size of the work packages
for each process had to be changed. To achieve the desired effect an improved ver-
sion of the domain decomposition has been implemented. Figure 7.3 shows that all
occurrences of the MICROPHYSICS-routine are vertically aligned, thus balanced. Ad-
ditionally the MPI receive routine calls are now clearly smaller than before. Comparing
the Function Summary of Figure 7.2 and Figure 7.3 shows that the relative time spent
in MPI receive has been decreased, and in turn the time spent inside MICROPHYSICS
has been increased greatly. This means that we now spend more time computing and
less time communicating, which is exactly what we want.
7.2.2 Serial Optimization
Inlining of frequently called functions and elimination of invariant calculations inside
loops are two ways to improve the serial performance. This section shows how to
detect candidate functions for serial optimization and suggests measures to speed
them up.
Problem
All performance charts in Vampir show information of the time span currently selected
in the timeline. Thus the most time-intensive routine of one iteration can be determined
by zooming into one or more iterations and having a look at the Function Summary.
The function with the largest bar takes up the most time. In this example (Figure 7.2)
the MICROPHYSICS-routine can be identified as the most costly part of an iteration.
Therefore it is a good candidate for gaining speedup through serial optimization tech-
niques.
Solution
In order to get a fine-grained view of the MICROPHYSICS-routine’s inner workings we
had to trace the program using full function instrumentation. Only then it was possible
to inspect and measure subroutines and subsubroutines of MICROPHYSICS. This way
the most time consuming subroutines have been spotted, and could be analyzed for
optimization potential.
72
CHAPTER 7. A USE CASE
The review showed that there were a couple of small functions which were called a lot.
So we simply inlined them. With Vampir you can determine how often a functions is
called by changing the metric of the Function Summary to the number of invocations.
The second inefficiency we discovered had been invariant calculations being done in-
side loops. So we just moved them in front of the respective loops.
Figure 7.3 sums up the tuning of the computational imbalance and the serial optimiza-
tion. In the timeline you can see that the duration of the MICROPHYSICS-routine is
now equal among all processes. Through serial optimization the duration has been
decreased from about 1.5 to 1.0 second. A decrease in duration of about 33% is quite
good given the simplicity of the changes done.
7.2.3 High Cache Miss Rate
The latency gap between cache and main memory is about a factor of 8. Therefore
optimizing for cache usage is crucial for performance. If you don’t access your data
in a linear fashion as the cache expects, so called cache misses occur and the spe-
cific instructions have to suspend execution until the requested data arrives from main
memory. A high cache miss rate therefore indicates that performance might be im-
proved through reordering of the memory access pattern to match the cache layout of
the platform.
Problem
As can be seen in the Counter Data Timeline (Figure 7.4) the CLIPPING-routine (light
blue) causes a high amount of L2 cache misses. Also its duration is long enough to
make it a candidate for inspection. What caused these inefficiencies in cache usage
were nested loops, which accessed data in a very random, non-linear fashion. Data
access can only profit from cache if subsequent read calls access data in the vicinity
of the previously accessed data.
Solution
After reordering the nested loops to match the memory order, the tuned version of the
CLIPPING-routine now needs only a fraction of the original time. (Figure 7.5)
73
7.2. IDENTIFIED PROBLEMS AND SOLUTIONS
Figure 7.4: Before Tuning: Counter Data Timeline revealing a high amount of L2 cache
misses inside the CLIPPING-routine (light blue)
Figure 7.5: After Tuning: Visible improvement of the cache usage
74
CHAPTER 7. A USE CASE
7.3 Conclusion
By using the Vampir toolkit, three problems have been identified. As a consequence of
addressing each problem, the duration of one iteration has been decreased from 3.5
seconds to 2.0 seconds.
Figure 7.6: Overview showing a significant overall improvement
As is shown by the Ruler (Chapter 4.1) in Figure 7.6 two large iterations now take 84
seconds to finish. Whereas at first (Figure 7.1) it took roughly 140 seconds, making a
total speed gain of 40%.
This huge improvement has been achieved by using the insight into the program’s
runtime behavior, provided by the Vampir toolkit, to optimize the inefficient parts of the
code.
75
Document Outline - Introduction
- Event-based Performance Tracing and Profiling
- The Open Trace Format (OTF)
- Vampir and Windows HPC Server 2008
- Getting Started
- Installation of Vampir
- Unix, Linux
- Mac OS X
- Windows
- Generation of Trace Data on Windows Systems
- Generation of Trace Data on Linux Systems
- Starting Vampir and Loading a Trace File
- Basics
- Chart Arrangement
- Context Menus
- Zooming
- The Zoom Toolbar
- The Charts Toolbar
- Properties of the Trace File
- Performance Data Visualization
- Timeline Charts
- Master Timeline and Process Timeline
- Counter Data Timeline
- Performance Radar
- Statistical Charts
- Function Summary
- Process Summary
- Message Summary
- Communication Matrix View
- I/O Summary
- Call Tree
- Informational Charts
- Function Legend
- Marker View
- Context View
- Information Filtering and Reduction
- Function Filtering
- Comparison of Trace Files
- Customization
- General Preferences
- Appearance
- Saving Policy
- A Use Case
- Introduction
- Identified Problems and Solutions
- Computational Imbalance
- Serial Optimization
- High Cache Miss Rate
- Conclusion
Dostları ilə paylaş: |