Vampir 7 User Manual

Yüklə 326,77 Kb.

Pdf görüntüsü

səhifə	14/14
tarix	06.05.2018
ölçüsü	326,77 Kb.
	#42388

1 ... 6 7 8 9 10 11 12 13 14

5.3. ALIGNMENT OF MULTIPLE TRACE FILES

Figure 5.6: Zoom to compute iterations of trace C

trace C is so short that it is almost not visible. Zooming into the compute iterations

of trace C would make them visible but would also only reveal the “MPI Init” phase of

trace A and B, see Figure 5.6. In order to compare the compute iterations the trace

ﬁles need to be aligned properly. This process is described in Section 5.3.

5.3 Alignment of Multiple Trace Files

The “Compare View” functionality to shift individual trace ﬁles in time allows to compare

areas of the data that did not occur at the same time. For instance, in order to compare

the compute iterations of the three example trace ﬁles these areas need to be aligned

to each other. This is required due to the fact that the initialization of the application

took different times on the three machines.

CHAPTER 5. COMPARISON OF TRACE FILES

Figure 5.7: Context menu controlling the time offset

Figure 5.8: Alignment in the Navigation Toolbar

There are several ways to shift the trace ﬁles in time. One way is to use the context

menu of the “Navigation Toolbar”. A right click on the toolbar reveals the menu shown

in Figure 5.7. Here the entry “Set Time Offset” allows to manually set the time offset

for the trace ﬁle. The entry “Reset Time Offset” resets the offset.

The easiest way to achieve a coarse alignment is to drag the trace ﬁle in the “Navigation

Toolbar” itself. While holding the “Ctrl” (“Cmd” on Mac OS X) modiﬁer key pressed the

5.3. ALIGNMENT OF MULTIPLE TRACE FILES

trace can be dragged to the desired position with the left mouse button. In Figure 5.8

the compute iterations of all example trace ﬁles are coarsely aligned.

Figure 5.9: Alignment in the Master Timeline

After the coarse shifting a ﬁner alignment can be done in the “Master Timeline”. There-

fore the user needs to zoom into the area to compare. Then, while keeping the “Ctrl”

(“Cmd” on Mac OS X) modiﬁer key pressed, the trace can be dragged with the left

mouse button in the “Master Timeline”. Figure 5.9 depicts the process of dragging

trace B to the compute iterations of trace A. As can be seen in the Figure 5.9, al-

though the initialization of trace A took the longest time, this machine was the fastest

in computing the iterations.

CHAPTER 5. COMPARISON OF TRACE FILES

5.4 Usage of Predeﬁned Markers

The Open Trace Format (OTF) allows to deﬁne markers pointing to particular places

of interest in the trace data. These markers can be used to navigate in the trace ﬁles.

For trace ﬁle comparison markers are interesting due to their potential to quickly locate

places in large trace data. With the help of markers it is possible to ﬁnd the same

location in multiple trace ﬁles with just a few clicks.

Figure 5.10: Open Marker View

First step in order to use markers is to open the “Marker View”. Figure 5.10 shows a

“Compare View” with an open “Marker View” for each trace ﬁle. After a click on one

marker in the “Marker View” the selected marker is highlighted in the “Master Timeline”

and the “Process Timeline”.

Another way to navigate to a marker in the timeline displays is to use the Vampir zoom.

If the user zoomed in the “Master Timeline” or “Process Timeline” into the desired

zooming level, then a click on a marker in the “Marker View” will shift the timeline zoom

to the marker position. Thus, the marker appears in the center of the timeline display,

see Figure 5.11.

5.4. USAGE OF PREDEFINED MARKERS

Figure 5.11: Jump to marker in the Master Timeline

CHAPTER 6. CUSTOMIZATION

6 Customization

The appearance of the trace ﬁle and various other application settings can be altered

in the preferences accessible via the main menu entry “File → Preferences”. Settings

concerning the trace ﬁle itself, e.g. layout or function group colors are saved individually

next to the trace ﬁle in a ﬁle with the ending “.vsettings”. This way it is possible to adjust

the colors for individual trace ﬁles without interfering with others.

The options “Import Preferences” and “Export Preferences” provide the loading and

saving of preferences of arbitrary trace ﬁles.

6.1 General Preferences

The “General” settings allow to change application and trace speciﬁc values.

“Show time as” decides whether the time format for the trace analysis is based on

seconds or ticks.

With the “Automatically open context view” option disabled Vampir does not open the

context view after the selection of an item, like a message or function.

“Use color gradient in charts” allows to switch off the color gradient used in the perfor-

mance charts.

The next option allows to change the style and size of the font.

“Show source code” enables the internal source code viewer. This viewer shows the

source code corresponding to selected locations in the trace ﬁle. In order to open a

source ﬁle ﬁrst click on the intended function in the “Master Timeline” and then on the

source code path in the “Context View”. For the source code location to work properly,

you need a trace ﬁle with source code location support. The path to the source ﬁle

can be adjusted in the “Preferences” dialog. A limit for the size of the source ﬁle to be

opened can be set, too.

In the “Analysis” section the number of analysis threads can be chosen. If this option

is disabled, Vampir determines the number automatically by the number of cores, e.g.

two analysis threads on a dual-core machine.

In the “Updates” section the user can decide if Vampir should check automatically for

new versions.

6.2. APPEARANCE

Figure 6.1: General Settings

Vampir also features a color blindness support mode.

On Linux systems there is also the “Document layout” option available. If this option

is enabled all open “Trace View” windows need to stay in one main window. If it is

disabled, the “Trace View” windows can be moved freely over the Desktop.

6.2 Appearance

In the “Appearance” settings of the “Preferences” dialog there are six different objects

for which the color options can be changed. The functions/function groups, markers,

counters, collectives, messages, and I/O events. Choose an entry and click on its color

to make a modiﬁcation. A color picker dialog opens where it is possible to adjust the

color. For messages and collectives a change of the line width is also available.

In order to quickly ﬁnd the desired item a search box is provided at the bottom of the

dialog.

CHAPTER 6. CUSTOMIZATION

Figure 6.2: Appearance Settings

6.3 Saving Policy

Vampir detects whenever changes to the various settings are made. In the “Saving

Policy” dialog it is possible to adjust the saving behavior of the different components to

the own needs.

In the dialog “Saving Behavior” you tell Vampir what to do in the case of changed

preferences. The user can choose the categories of settings, e.g., the layout, that

should be affected by the selected behavior. Possible options are that the application

automatically “Always” or “Never” saves changes. The default option is to have Vampir

asking you whether to save or discard changes.

Usually the settings are stored in the folder of the trace ﬁle. If the user has no write

access to it, it is possible to place them alternatively in the “Application Data Folder”.

All such stored settings are listed in the tab “Locally Stored Preferences” with creation

and modiﬁcation date.

Note: On loading Vampir always favors settings in the “Application Data Folder”.

6.3. SAVING POLICY

Figure 6.3: Saving Policy Settings

“Default Preferences” offers to save preferences of the current trace ﬁle as default

settings. Then they are used for trace ﬁles without settings. Another option is to restore

the default settings. Then the current preferences of the trace ﬁle are reverted.

CHAPTER 7. A USE CASE

7 A Use Case

This chapter explains by example how Vampir can be used to discover performance

problems in your code and how to correct them.

7.1 Introduction

In many cases the Vampir suite has been successfully applied to identify performance

bottlenecks and assist their correction. To show in which ways the provided toolset

can be used to ﬁnd performance problems in program code, one optimization process

is illustrated in this chapter. The following example is a three-part optimization of a

weather forecast model including simulation of cloud microphysics. Every run of the

code has been performed on 100 cores with manual function instrumentation, MPI

communication instrumentation, and recording of the number of L2 cache misses.

Figure 7.1: Master Timeline and Function Summary showing an overview of the pro-

gram run

Getting a grasp of the program’s overall behavior is a reasonable ﬁrst step. In Figure 7.1

Vampir has been set up to provide such a high-level overview of the model’s code.

This layout can be achieved through two simple manipulations. Set up the Master

Timeline to adjust the process bar height to ﬁt the chart height. All 100 processes

are now arranged into one view. Likewise, change the event category in the Function

7.2. IDENTIFIED PROBLEMS AND SOLUTIONS

Summary to show function groups. This way the many functions are condensed into

fewer function groups.

One run of the instrumented program took 290 seconds to ﬁnish. The ﬁrst half of

the trace (Figure 7.1 A) is the initialization part. Processes get started and synced,

input is read and distributed among these processes. The preparation of the cloud

microphysics (function group: MP) is done here as well.

The second half is the iteration part, where the actual weather forecasting takes place.

In a normal weather simulation this part would be much larger. But in order to keep

the recorded trace data and the overhead introduced by tracing as small as possible

only a few iterations have been recorded. This is sufﬁcient since they are all doing the

same work anyway. Therefore the simulation has been conﬁgured to only forecast the

weather 20 seconds into the future. The iteration part consists of two ”large” iterations

(Figure 7.1 B and C), each calculating 10 seconds of forecast. Each of these in turn is

partitioned into several ”smaller” iterations.

For our observations we focus on only two of these small, inner iterations, since this

is the part of the program where most of the time is spent. The initialization work

does not increase with a higher forecast duration and would only take a relatively small

amount of time in a real world run. The constant part at the beginning of each large

iteration takes less than a tenth of the whole iteration time. Therefore, by far the most

time is spent in the small iterations. Thus they are the most promising candidates for

optimization.

All screenshots starting with Figure 7.2 are in a before-and-after fashion to point out

what changed by applying the speciﬁc improvements.

7.2 Identiﬁed Problems and Solutions

7.2.1 Computational Imbalance

A varying size of work packages (thus varying processing time of this work) means

waiting time in subsequent synchronization routines. This section points out two easy

ways to recognize this problem.

Problem

As can be seen in Figure 7.2 each occurrence of the MICROPHYSICS-routine (purple

color) starts at the same time on all processes inside one iteration, but takes between

1.7 and 1.3 seconds to ﬁnish. This imbalance leads to idle time in subsequent syn-

chronization calls on the processes 1 to 4, because they have to wait for process 0 to

ﬁnish its work (marked parts in Figure 7.2). This is wasted time which could be used for

CHAPTER 7. A USE CASE

Figure 7.2: Before Tuning: Master Timeline and Function Summary identifying MICRO-

PHYSICS (purple color) as predominant and unbalanced

Figure 7.3: After Tuning: Timeline and Function Summary showing an improvement in

communication behavior

7.2. IDENTIFIED PROBLEMS AND SOLUTIONS

computational work, if all MICROPHYSICS-calls would have the same duration. An-

other hint at this overhead in synchronization is the fact that the MPI receive routine

uses 17.6% of the time of one iteration (Function Summary in Figure 7.2).

Solution

To even out this asymmetry the code which determines the size of the work packages

for each process had to be changed. To achieve the desired effect an improved ver-

sion of the domain decomposition has been implemented. Figure 7.3 shows that all

occurrences of the MICROPHYSICS-routine are vertically aligned, thus balanced. Ad-

ditionally the MPI receive routine calls are now clearly smaller than before. Comparing

the Function Summary of Figure 7.2 and Figure 7.3 shows that the relative time spent

in MPI receive has been decreased, and in turn the time spent inside MICROPHYSICS

has been increased greatly. This means that we now spend more time computing and

less time communicating, which is exactly what we want.

7.2.2 Serial Optimization

Inlining of frequently called functions and elimination of invariant calculations inside

loops are two ways to improve the serial performance. This section shows how to

detect candidate functions for serial optimization and suggests measures to speed

them up.

Problem

All performance charts in Vampir show information of the time span currently selected

in the timeline. Thus the most time-intensive routine of one iteration can be determined

by zooming into one or more iterations and having a look at the Function Summary.

The function with the largest bar takes up the most time. In this example (Figure 7.2)

the MICROPHYSICS-routine can be identiﬁed as the most costly part of an iteration.

Therefore it is a good candidate for gaining speedup through serial optimization tech-

niques.

Solution

In order to get a ﬁne-grained view of the MICROPHYSICS-routine’s inner workings we

had to trace the program using full function instrumentation. Only then it was possible

to inspect and measure subroutines and subsubroutines of MICROPHYSICS. This way

the most time consuming subroutines have been spotted, and could be analyzed for

optimization potential.

CHAPTER 7. A USE CASE

The review showed that there were a couple of small functions which were called a lot.

So we simply inlined them. With Vampir you can determine how often a functions is

called by changing the metric of the Function Summary to the number of invocations.

The second inefﬁciency we discovered had been invariant calculations being done in-

side loops. So we just moved them in front of the respective loops.

Figure 7.3 sums up the tuning of the computational imbalance and the serial optimiza-

tion. In the timeline you can see that the duration of the MICROPHYSICS-routine is

now equal among all processes. Through serial optimization the duration has been

decreased from about 1.5 to 1.0 second. A decrease in duration of about 33% is quite

good given the simplicity of the changes done.

7.2.3 High Cache Miss Rate

The latency gap between cache and main memory is about a factor of 8. Therefore

optimizing for cache usage is crucial for performance. If you don’t access your data

in a linear fashion as the cache expects, so called cache misses occur and the spe-

ciﬁc instructions have to suspend execution until the requested data arrives from main

memory. A high cache miss rate therefore indicates that performance might be im-

proved through reordering of the memory access pattern to match the cache layout of

the platform.

Problem

As can be seen in the Counter Data Timeline (Figure 7.4) the CLIPPING-routine (light

blue) causes a high amount of L2 cache misses. Also its duration is long enough to

make it a candidate for inspection. What caused these inefﬁciencies in cache usage

were nested loops, which accessed data in a very random, non-linear fashion. Data

access can only proﬁt from cache if subsequent read calls access data in the vicinity

of the previously accessed data.

Solution

After reordering the nested loops to match the memory order, the tuned version of the

CLIPPING-routine now needs only a fraction of the original time. (Figure 7.5)

7.2. IDENTIFIED PROBLEMS AND SOLUTIONS

Figure 7.4: Before Tuning: Counter Data Timeline revealing a high amount of L2 cache

misses inside the CLIPPING-routine (light blue)

Figure 7.5: After Tuning: Visible improvement of the cache usage

CHAPTER 7. A USE CASE

7.3 Conclusion

By using the Vampir toolkit, three problems have been identiﬁed. As a consequence of

addressing each problem, the duration of one iteration has been decreased from 3.5

seconds to 2.0 seconds.

Figure 7.6: Overview showing a signiﬁcant overall improvement

As is shown by the Ruler (Chapter 4.1) in Figure 7.6 two large iterations now take 84

seconds to ﬁnish. Whereas at ﬁrst (Figure 7.1) it took roughly 140 seconds, making a

total speed gain of 40%.

This huge improvement has been achieved by using the insight into the program’s

runtime behavior, provided by the Vampir toolkit, to optimize the inefﬁcient parts of the

code.

75

Document Outline

Introduction
- Event-based Performance Tracing and Profiling
- The Open Trace Format (OTF)
- Vampir and Windows HPC Server 2008
Getting Started
- Installation of Vampir
  - Unix, Linux
  - Mac OS X
  - Windows
- Generation of Trace Data on Windows Systems
  - Enabling Performance Tracing
  - Tracing an MPI Application
- Generation of Trace Data on Linux Systems
  - Enabling Performance Tracing
  - Tracing an Application
- Starting Vampir and Loading a Trace File
Basics
- Chart Arrangement
- Context Menus
- Zooming
- The Zoom Toolbar
- The Charts Toolbar
- Properties of the Trace File
Performance Data Visualization
- Timeline Charts
  - Master Timeline and Process Timeline
  - Counter Data Timeline
  - Performance Radar
- Statistical Charts
  - Function Summary
  - Process Summary
  - Message Summary
  - Communication Matrix View
  - I/O Summary
  - Call Tree
- Informational Charts
  - Function Legend
  - Marker View
  - Context View
- Information Filtering and Reduction
- Function Filtering
  - Filter Options
  - Examples
Comparison of Trace Files
- Starting the Compare View
- Usage of Charts
- Alignment of Multiple Trace Files
- Usage of Predefined Markers
Customization
- General Preferences
- Appearance
- Saving Policy
A Use Case
- Introduction
- Identified Problems and Solutions
  - Computational Imbalance
  - Serial Optimization
  - High Cache Miss Rate
- Conclusion

Yüklə 326,77 Kb.

Dostları ilə paylaş:

1 ... 6 7 8 9 10 11 12 13 14