MSc in High Performance Computing Computational Chemistry Module Introduction to Molecular Dynamics Bill Smith Computational Science and Engineering STFC Daresbury Laboratory Warrington WA4 4AD
MSc in High Performance Computing Computational Chemistry Module Lecture 4 – Parallel Performance Paul Sherwood and Huub J J van Dam CCLRC Daresbury Laboratory p.sherwood@daresbury.ac.uk
Outline Parallel vs. Serial Identifying bottlenecks Analyzing bottlenecks
Performance analysis: Parallel vs. Serial Performance analysis of parallel codes is similar to serial but… The new dimension in parallel codes is the performance as a function of the number of processors. In parallel codes the performance is determined by - The fraction of serial work (i.e. lack of parallelism, see Amdahl’s law below)
- The efficiency of the communication
- The balance between communication and work
The impact of the above is not only a function of the number of processors but also of the problem size Remember Amdahl’s law… - P is the proportion of the total time spend in a given step
- S is the speedup achieved on this particular step
- T then is the speedup overall
Identifying bottlenecks As in the serial case the most gain is to be had by optimising the most expensive steps. Unlike in the serial case a step that is unimportant on low processor counts may prove to be the bottleneck on high processor counts. A few tools are available including: - gprof and xprofiler
- Build-in timers (in codes developed by people who are serious about performance)
More tools are available if you are prepared to instrument the code but then it makes sense to choose and instrumentation approach that helps analysing communications as well.
Analysing bottlenecks Finding out why certain sections of the program take so long Often communication turns out to be a major component - Reports on the performance of the communications
- Can show the communication behaviour in relation to relevant sections of the program
On HPCx some of this information is accessible through mpiprof For detailed information however you’ll need to instrument your code. Tools available include: - Paraver (http://www.cepba.upc.es/)
- Vampir (was http://www.pallas.com/ now Intel Trace Analyzer)
- OPT (http://www.allinea.com/)
- No free tools though
Using vampir: introduction Vampir Vampirtrace - Trace library for MPI applications
- Uses the MPI profiling interface and is therefore independent of a given MPI implementation
- Includes instrumentation functions to identify code sections
- Has extensions to instrument one-sided communication
- Various filters available to reduce trace-file sizes
- Uses MPI to gather data from all processors, so you always need some MPI to be able to use it!
Vampirtrace API Switching tracing on/off - SUBROUTINE VTTRACEOFF( )
- SUBROUTINE VTTRACEON( )
Specifying user-defined states - SUBROUTINE VTCLASSDEF(CLASSNAME,CLASSHANDLE,IERR)
- SUBROUTINE VTFUNCDEF(FUNCNAME, CLASSHANDLE, STATEHANDLE, IERR)
Entering/leaving user-defined states - SUBROUTINE VTBEGIN(STATEHANDLE, IERR)
- SUBROUTINE VTEND(STATEHANDLE, IERR)
Logging message send/receive events (undocumented) - SUBROUTINE VTLOGSENDMSG( IME, ITO, ICNT, ITAG, ICOMMID, IERR)
- SUBROUTINE VTLOGRECVMSG( IME, IFRM, ICNT, ITAG, ICOMMID, IERR)
Instrumenting single-sided memory access - Advantage: robust and accurate
- Disadvantage: one does not always have access to the source of the data server
Approach 2: Instrument the puts and gets only, cheating on the source and destination of the messages - Advantage: no instrumentation of the data server required
- Disadvantage: timings of the messages are inaccurate in case of non-blocking operations
Runtime tracing options The tracing of states can be modified at runtime through a configuration file. Tracing of messages can not be changed. VTTRACEON and VTTRACEOFF should be used sparingly. Gotcha: if you don’t have VTTRACEOFF/VTTRACEON in your code, no states will be traced (but messages will). The location of the configuration file can be specified by an environment variable VT_CONFIG
Using Vampir After instrumenting your code you simply run as normal, but you’ll see it produces a number of files of the name .stf* Launch vampir to bring up an initial timeline view To get the full functionality working load the whole trace file (this may take a little while) - Right-click on the timeline
- Go to “Load”
- Select “Whole Trace”
Vampir views Identifying bottlenecks - Summary chart: summarizes the time spend in each class of activity.
- Summary timeline: shows how many processes are busy with a particular class of activity in a sequence of time bins.
Analysing bottlenecks - Global timeline: detailed view of all the activities as well as all the messages being passed.
- Activity chart: shows the time spend in the different activities for each processor.
- Message statistics: can display various statistics about messages being passed between pair of processors.
The Vampir software sits in /usr/local/packages/vampir Detailed documentation (PDFs) in vampir/doc The vampir analyser lives in vampir/bin There are 2 sets of vampirtrace libraries - For 32 bit codes use vampir/lib
- For 64 bit codes (compiled with –q64) use vampir/lib64
Working examples are available in vampir/examples Use the mpcc_r and mpxlf_r compilers and link libVT.a as the last library on your link line.
Dostları ilə paylaş: |