Goals Learn about commercial performance analysis products for complex parallel systems

Yüklə 546 b.
ölçüsü546 b.

Performance Technology for Complex Parallel Systems Part 3 – Alternative Tools and Frameworks Bernd Mohr


  • Learn about commercial performance analysis products for complex parallel systems

    • Vampir event trace visualization and analysis tool
    • Vampirtrace event trace recording library
    • GuideView OpenMP performance analysis tool
    • VGV (integrated Vampir / GuideView environment)
  • Learn about future advanced components for automatic performance analysis and guidance

    • EXPERT automatic event trace analyzer
  • Discuss plans for performance tool integration


  • Visualization and Analysis of MPI PRograms

  • Originally developed by Forschungszentrum Jülich

  • Current development by Technical University Dresden

  • Distributed by PALLAS, Germany

Vampir: General Description

  • Offline trace analysis for message passing trace files

  • Convenient user–interface / easy customization

  • Scalability in time and processor–space

  • Excellent zooming and filtering

  • Display and analysis of MPI and application events:

    • User subroutines
    • Point–to–point communication
    • Collective communication
    • MPI–2 I/O operations
  • Large variety of customizable (via context menus) displays for ANY part of the trace

Vampir: Main Window

  • Trace file loading can be

    • Interrupted at any time
    • Resumed
    • Started at a specified time offset
  • Provides main menu

    • Access to global and process local displays
    • Preferences
    • Help
  • Trace file can be re–written (re–grouped symbols)

Vampir: Timeline Diagram

Vampir: Timeline Diagram (Message Info)

  • Source–code references are displayed if recorded in trace

Vampir: Support for Collective Communication

  • For each process: locally mark operation

  • Connect start/stop points by lines

Vampir: Collective Communication Display

Vampir: MPI-I/O Support

  • MPI I/O operations shown as message lines to separate I/O system time line

Vampir: Execution Statistics Displays

  • Aggregated profiling information: execution time, # calls, inclusive/exclusive

  • Available for all/any group (activity)

  • Available for all routines (symbols)

  • Available for any trace part (select in timeline diagram)

Vampir: Communication Statistics Displays

  • Bytes sent/received for collective operations

  • Message length statistics

  • Available for any trace part

Vampir: Other Features

  • Parallelism display

  • Powerful filtering and trace comparison features

  • All diagrams highly customizable (through context menus)

Vampir: Process Displays

  • Activity chart

Vampir: New Features

  • New Vampir versions (3 and 4)

    • New core (dramatic timeline speedup, significantly reduced memory footprint)
    • Load–balance analysis display
    • Hardware counter value displays
    • Thread analysis
    • show hardware and grouping structure
    • Improved statistics displays
    • Raised scalability limits: can now analyse 100s of processes/threads

Vampir: Load Balance Analysis

  • State Chart display

  • Aggregated profiling information: execution time, # calls, inclusive/exclusive

  • For all/any group (activity)

  • For all routines (symbols)

  • For any trace part

Vampir: HPM Counter

  • Counter Timeline Display

Vampir: Cluster Timeline

  • Display of whole system

Vampir: Cluster Timeline

  • SMP or Grid Nodes Display

Vampir: Cluster Timeline(2)

  • Display of messages between nodes enabled

Vampir: Improved Message Statistics Display

  • Process View

Release Schedule

  • Vampir/SX and Vampirtrace/SX

    • Version 1 available via NEC Japan
    • Version 2 is ready for release
  • Vampir/SC and Vampirtrace/SC

    • Version 3 is available from Pallas
    • Version 4 scheduled for Q4/2001
  • Vampir and Vampirtrace

    • Version 3 is scheduled for Q4/2001
    • Version 4 will follow in 2002

Vampir Feature Matrix


  • Commercial product of Pallas, Germany

  • Library for Tracing of MPI and Application Events

    • Records MPI point-to-point communication
    • Records MPI collective communication
    • Records MPI–2 I/O operations
    • Records user subroutines (on request)
    • Records source–code information (some platforms)
    • Support for shmem (Cray T3E)
  • Uses the PMPI profiling interface

  • http://www.pallas.de/pages/vampirt.htm

Vampirtrace: Usage

  • Record MPI–related information

    • Re–link a compiled MPI application (no re-compilation)
      • {f90,cc,CC} *.o -o myprog -L$(VTHOME)/lib -lVT -lpmpi -lmpi
    • Re-link with -vt option to MPICH compiler scripts
      • {mpif90,mpicc,mpiCC} -vt *.o -o myprog
    • Execute MPI binary as usual
  • Record user subroutines

    • Insert calls to Vampirtrace API (portable, but inconvenient)
    • Use automatic instrumentation (NEC SX, Fujitsu VPP, Hitachi SR)
    • Use instrumentation tool (Cray PAT, dyninst, ...)

Vampirtrace Instrumentation API (C / C++)

  • Calls for recording user subroutines

  • VT calls can only be used between MPI_Init and MPI_Finalize!

  • Event numbers used must be globally unique

  • Selective tracing: VT_traceoff(),VT_traceon()

VT++.h – C++ Class Wrapper for Vampirtrace

  • Same tricks can be used to wrap other C++ tracing APIs

  • Usage:

Vampirtrace Instrumentation API (Fortran)

  • Calls for recording user subroutines

  • Selective tracing: VTTRACEOFF(),VTTRACEON()

Vampirtrace: Runtime Configuration

  • Trace file collection and generation can be controlled by using a configuration file

    • Trace file name, location, size, flush behavior
    • Activation/deactivation of trace recording for specific processes, activities (groups of symbols), and symbols
  • Activate a configuration file with environment variables

    • VT_CONFIG name of configuration file (use absolute pathname if possible)
    • VT_CONFIG_RANK MPI rank of process which should read and process configuration file
  • Reduce trace file sizes

    • Restrict event collection in a configuration file
    • Use selective tracing functions

Vampirtrace: Configuration File Example

  • Be careful to record complete message transfers!

  • See Vampirtrace User's Guide for complete description

New Features – Tracing

  • New Vampirtrace versions (3 and 4)

    • New core (significantly reduce memory and runtime overhead)
    • Better control of trace buffering and flush files
    • New filtering options
    • Event recording by thread
    • Support of MPI–I/O
    • Hardware counter data recording (PAPI)
    • Support of process/thread groups

Vampirtrace Feature Matrix


  • Commercial product of KAI

  • OpenMP Performance Analysis Tool

  • Part of KAP/Pro Toolset for OpenMP

  • Looks for OpenMP performance problems

    • Load imbalance, synchronization, false sharing
  • Works from execution trace(s)

  • Compile with Guide, link with instrumented library

    • guidec++ -WGstats myprog.cpp -o myprog
    • guidef90 -WGstats myprog.f90 -o myprog
    • Run with real input data sets
    • View traces with guideview
  • http://www.kai.com/parallel/kappro/

GuideView: Whole Application View

  • Different

    • Number of processors
    • Datasets
    • Platforms

GuideView: Per Thread View

GuideView: Per Section View

GuideView: Analysis of hybrid Applications

  • Generate different Guide execution traces for each node

    • Run with node-local file system as current directory
    • Set trace file name with environment variable KMP_STATSFILE
      • point to file in node-local file system KMP_STATSFILE=/node-local/guide.gvs
      • use special meta-character sequences (%H: hostname, %I: pid, %P: number of threads used) KMP_STATSFILE=guide-%H.gvs
  • Use "compare-multiple-run" feature to display together

  • Just a hack, better: use VGV!

VGV – Architecture

  • Combine well–established tools

    • Guide and GuideView from KAI/Intel
    • Vampir/Vampirtrace from Pallas
  • Guide compiler inserts instrumentation

  • Guide runtime system collects thread–statistics

  • PAPI is used to collect HPM data

  • Vampirtrace handles event–based performance data acquisition and storage

  • Vampir is extended by GuideView–style displays

VGV – Architecture

VGV – Usage

  • Use Guide compilers by KAI

    • guidef77, guidef90
    • guidec, guidec++
  • Include instrumentation flags (links with Guide RTS and Vampirtrace)

  • Instrumentation can record

    • Parallel regions
    • MPI activity
    • Application routine calls
    • HPM data
  • Trace file collection and generation controlled by configuration file

Vampir: MPI Performance Analysis

GuideView: OpenMP Performance Analysis

Vampir: Detailed Thread Analysis

Availability and Roadmap

  • –version available (register with Pallas or KAI/Intel)

    • IBM SP running AIX
    • IA 32 running Linux
    • Compaq Alpha running Tru64
  • General release scheduled for Q1/2002

  • Improvements in the pipeline

    • Scalability enhancements
    • Ports to other platforms

KOJAK Overview

  • Kit for Objective Judgement and Automatic Knowledge-based detection of bottlenecks

  • Long-term goal: Design and Implementation of a

    • Portable, Generic, Automatic
  • Performance Analysis Environment

  • Current Focus

    • Event Tracing
    • Clusters of SMP
    • MPI, OpenMP, and Hybrid Programming Model
  • http://www.fz-juelich.de/zam/kojak/

Motivation Automatic Performance Analysis

  • Traditional Tools:

Motivation Automatic Performance Analysis (2)

Automatic Analysis Example: Late Sender

Automatic Analysis Example (2): Wait at NxN

EXPERT: Current Architecture

Event Tracing

  • Event Processing, Investigation, and LOGging (EPILOG)

  • Open (public) event trace format and API for reading/writing trace records

  • Event Types: region enter and exit, collective region enter and exit, message send and receive, parallel region fork and join, and lock aquire and release

  • Supports

    • Hierarchical cluster hardware
    • Source code information
    • Performance counter values
  • Thread-safe implementation


  • Instrument user application with EPILOG calls

  • Done: basic instrumentation

    • User functions and regions:
      • undocumented PGI compiler (and manual) instrumentation
    • MPI calls:
      • wrapper library utilizing PMPI
    • OpenMP:
      • source-to-source instrumentation
  • Future work:

    • Tools for Fortran, C, C++ user function instrumentation
    • Object code and dynamic instrumentation

Instrumentation of OpenMP Constructs

  • OpenMP Pragma And Region Instrumentor

  • Source-to-Source translator to insert POMP calls around OpenMP constructs and API functions

  • Done: Supports

    • Fortran77 and Fortran90, OpenMP 2.0
    • C and C++, OpenMP 1.0
    • POMP Extensions
    • EPILOG and TAU POMP implementations
    • Preserves source code information (#line line file)
  • Work in Progress: Investigating standardization through OpenMP Forum

POMP OpenMP Performance Tool Interface

  • OpenMP Instrumentation

    • OpenMP Directive Instrumentation
    • OpenMP Runtime Library Routine Instrumentation
  • POMP Extensions

    • Runtime Library Control (init, finalize, on, off)
    • (Manual) User Code Instrumentation (begin, end)
    • Conditional Compilation (#ifdef _POMP, !$P)
    • Conditional / Selective Transformations ([no]instrument)


OpenMP API Instrumentation

  • Transform

    • omp_#_lock() pomp_#_lock()
    • omp_#_nest_lock() pomp_#_nest_lock()
    • [ # = init | destroy | set | unset | test ]
  • POMP version

    • Calls omp version internally
    • Can do extra stuff before and after call

Example: TAU POMP Implementation

  • TAU_GLOBAL_TIMER(tfor, "for enter/exit", "[OpenMP]", OpenMP);

  • void pomp_for_enter(OMPRegDescr* r) { #ifdef TAU_AGGREGATE_OPENMP_TIMINGS TAU_GLOBAL_TIMER_START(tfor) #endif #ifdef TAU_OPENMP_REGION_VIEW TauStartOpenMPRegionTimer(r); #endif }

  • void pomp_for_exit(OMPRegDescr* r) { #ifdef TAU_AGGREGATE_OPENMP_TIMINGS TAU_GLOBAL_TIMER_STOP(tfor) #endif #ifdef TAU_OPENMP_REGION_VIEW TauStopOpenMPRegionTimer(r); #endif }

OPARI: Basic Usage (f90)

  • Reset OPARI state information

    • rm -f opari.rc
  • Call OPARI for each input source file

    • opari file1.f90 ... opari fileN.f90
  • Generate OPARI runtime table, compile it with ANSI C

    • opari -table opari.tab.c cc -c opari.tab.c
  • Compile modified files *.mod.f90 using OpenMP

  • Link the resulting object files, the OPARI runtime table opari.tab.o and the TAU POMP RTL

OPARI: Makefile Template (C/C++)

OPARI: Makefile Template (Fortran)

Automatic Analysis

  • EXtensible PERformance Tool (EXPERT)

  • Programmable, extensible, flexible performance property specification

  • Based on event patterns

  • Analyzes along three hierarchical dimensions

    • Performance properties (general specific)
    • Dynamic call tree position
    • Location (machine node process thread)
  • Done: fully functional demonstration prototype

Example: Late Sender (blocked receiver)

Example: Late Sender (2)

  • class LateSender(Pattern): # derived from class Pattern

  • def parent(self): # "logical" parent at property level return "P2P"

  • def recv(self, recv): # callback for recv events

  • recv_start = self._trace.event(recv['enterptr'])

  • if (self._trace.region(recv_start['regid'])['name']

  • == "MPI_Recv"):

  • send = self._trace.event(recv['sendptr'])

  • send_start = self._trace.event(send['enterptr'])

  • if (self._trace.region(send_start['regid'])['name']

  • == "MPI_Send"):

  • idle_time = send_start['time'] - recv_start['time']

  • if idle_time > 0 :

  • locid = recv_start['locid']

  • cnode = recv_start['cnodeptr']

  • self._severity.add(cnode, locid, idle_time)

Performance Properties (1)

  • [100% = (timelast event - time1st event) * number of locations]

  • Total # Execution + Idle Threads time

    • Execution # Sum of exclusive time spent in each region
    • Idle Threads # Time wasted in idle threads while executing “sequential” code
  • Execution

    • MPI # Time spent in MPI functions
    • OpenMP # Time spent in OpenMP regions and API functions
    • I/O # Time spent in (sequential) I/O

Performance Properties (2)

  • MPI

    • Communication # Sum of Collective, P2P, 1-sided
      • Collective # Time spent in MPI collective communication operations
      • P2P # Time spent in MPI point-to-point communication operations
      • 1-sided # Time spent in MPI one-sided communication operations
    • I/O # Time spent in MPI parallel I/O functions (MPI_File*)
    • Synchronization # Time spent in MPI_Barrier

Performance Properties (3)

  • Collective

    • Early Reduce # Time wasted in root of N-to-1 operation by waiting for 1st sender (MPI_Gather, MPI_Gatherv, MPI_Reduce)
    • Late Broadcast # Time wasted by waiting for root sender in 1-to-N operation (MPI_Scatter, MPI_Scatterv, MPI_Bcast)
    • Wait at N x N # Time spent waiting for last participant at NxN operation (MPI_All*, MPI_Scan, MPI_Reduce_scatter)

Performance Properties (4)

  • P2P

    • Late Receiver # Blocked sender
      • Messages in Wrong Order # Receiver too late because waiting for another message from same sender
    • Late Sender # Blocked receiver
      • Messages in Wrong Order # Receiver blocked because waiting for another message from same sender
    • Patterns related to non-blocking communication
    • Too many small messages

Performance Properties (5)

  • OpenMP

    • Synchronization # Time spent in OpenMP barrier and lock operations
      • Barrier # Time spent in OpenMP barrier operations
        • Implicit
          • Load Imbalance at Parallel Do, Single, Workshare
          • Not Enough Sections
        • Explicit
      • Lock Competition # Time wasted in omp_set_lock by waiting for lock release
    • Flush

Expert Result Presentation

  • Interconnected weighted tree browser

  • Scalable still accurate

  • Each node has weight

    • Percentage of CPU allocation time
    • I.e. time spent in subtree of call tree
  • Displayed weight depends on state of node

    • Collapsed (including weight of descendants)
    • Expanded (without weight of descendants)
  • Displayed using

    • Color: allows to easily identify hot spots (bottlenecks)
    • Numerical value: Detailed comparison

Performance Properties View

Dynamic Call Tree View

  • Property “Idle Threads”

    • Mapped to call graph location of master thread
    •  highlights phases of “sequential” execution

Locations View

  • Supports locations up to Grid scale

  • Easily allows exploration of load balance problems on different levels

  • [ Of course, Idle Thread Problem only applies to slave threads ]

Performance Properties View (2)

  • Interconnected weighted trees:

  •  Selecting another node in one tree effects tree display right of it

Dynamic Call Tree View

Locations View (2): Relative View

Automatic Performance Analysis

  • Automatic Performance Analysis: Resources and Tools

  • http://www.fz-juelich.de/apart/

  • ESPRIT Working Group 1999 - 2000

  • IST Working Group 2001 - 2004

  • 16 members worldwide

  • Prototype Tools (Paradyn, Kappa-PI, Aurora, Peridot, KOJAK/EXPERT, TAU)

Performance Analysis Tool Integration

  • Complex systems pose challenging performance analysis problems that require robust methodologies and tools

  • New performance problems will arise

    • Instrumentation and measurement
    • Data analysis and presentation
    • Diagnosis and tuning
  • No one performance tool can address all concerns

  • Look towards an integration of performance technologies

    • Evolution of technology frameworks to address problems
    • Integration support to link technologies to create performance problem solving environments

Yüklə 546 b.

Dostları ilə paylaş:

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə