Commutativity Analysis: a new Analysis Framework for Parallelizing Compilers



Yüklə 514 b.
tarix24.12.2017
ölçüsü514 b.
#17708


Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

  • Martin C. Rinard

  • Pedro C. Diniz

  • University of California, Santa Barbara

  • Santa Barbara, California 93106

  • {martin,pedro}@cs.ucsb.edu

  • http://www.cs.ucsb.edu/~{martin,pedro}


Goal

  • Develop a Parallelizing Compiler for Object-Oriented Computations

  • Current Focus

    • Irregular Computations
    • Dynamic Data Structures
  • Future

    • Persistent Data
    • Distributed Computations
  • New Analysis Technique:



Structure of Talk

  • Model of Computation

  • Example

  • Commutativity Testing

  • Steps To Practicality

  • Experimental Results

  • Conclusion



Model of Computation



Graph Traversal Example

  • class graph {

  • int val, sum;

  • graph *left, *right;

  • };

  • void graph::traverse(int v) {

  • sum += v;

  • if (left !=NULL) left->traverse(val);

  • if (right!=NULL) right->traverse(val);

  • }



Parallel Traversal



Commuting Operations in Parallel Traversal



Model of Computation

  • Operations: Method Invocations

    • In Example: Invocations of graph::traverse
      • left->traverse(3)
      • right->traverse(2)
  • Objects: Instances of Classes

    • In Example: Graph Nodes
  • Instance Variables Implement Object State

    • In Example: val, sum, left, right


Model of Computation

  • Operations: Method Invocations

    • In Example: Invocations of graph::traverse
      • left->traverse(3)
      • right->traverse(2)
  • Objects: Instances of Classes

    • In Example: Graph Nodes


Separable Operations

  • Each Operation Consists of Two Sections



Basic Approach

  • Compiler Chooses A Computation to Parallelize

    • In Example: Entire graph::traverse Computation
  • Compiler Computes Extent of the Computation

    • Representation of all Operations in Computation
    • Current Representation: Set of Methods
    • In Example: { graph::traverse }
  • Do All Pairs of Operations in Extent Commute?



Code Generation For Each Method in Parallel Computation

  • Augments Class Declaration With Mutual Exclusion Lock

  • Generates Driver Version of Method

    • Invoked from Serial Code to Start Parallel Execution
    • Invokes Parallel Version of Operation
    • Waits for Entire Parallel Computation to Finish
  • Generates Parallel Version of Method



Driver Version



Parallel Version In Example



Compiler Structure



Traditional Approach

  • Data Dependence Analysis

    • Analyzes Reads and Writes
    • Independent Pieces of Code Execute in Parallel
  • Demonstrated Success for Array-Based Programs



Data Dependence Analysis in Example

  • For Data Dependence Analysis To Succeed in Example

    • left and right traverse Must Be Independent
    • left and right Subgraphs Must Be Disjoint
    • Graph Must Be a Tree
  • Depends on Global Topology of Data Structure

    • Analyze Code that Builds Data Structure
    • Extract and Propagate Topology Information
  • Fails For Graphs



Properties of Commutativity Analysis

  • Oblivious to Data Structure Topology

    • Local Analysis
    • Simple Analysis
  • Wide Range of Computations

  • Introduces Synchronization

  • Relies on Commuting Operations



Commutativity Testing

  • Commutativity Testing



Commutativity Testing Conditions

  • Do Two Operations A and B Commute?

  • Compiler Considers Two Execution Orders

    • A;B - A executes before B
    • B;A - B executes before A
  • Compiler Must Check Two Conditions



Commutativity Testing Conditions



Commutativity Testing Algorithm

  • Symbolic Execution:

    • Compiler Executes Operations
    • Computes with Expressions not Values
  • Compiler Symbolically Executes Operations

  • In Both Execution Orders

    • Expressions for New Values of Instance Variables
    • Expressions for Multiset of Invoked Operations


Expression Simplification and Comparison

  • Compiler Applies Rewrite Rules to Simplify Expressions

    • a*(b+c) a*b)+(a*c)
    • b+(a+c) (a+b+c)
    • a+if(b
  • Compiler Compares Corresponding Expressions

    • If All Equal - Operations Commute
    • If Not All Equal - Operations May Not Commute


Commutativity Testing Example

  • Two Operations

    • r->traverse(v1) and r->traverse(v2)
  • In Order r->traverse(v1);r->traverse(v2)



Important Special Case

  • Independent Operations Commute

  • Analysis in Current Compiler

    • Dependence Analysis
      • Operations on Objects of Different Classes
      • Independent Operations on Objects of Same Class
    • Symbolic Commutativity Testing
      • Dependent Operations on Objects of Same Class
  • Future

    • Integrate Pointer or Alias Analysis
    • Integrate Array Data Dependence Analysis


Important Special Case

  • Independent Operations Commute

  • Conditions for Independence

    • Operations Have Different Receivers
    • Neither Operation Writes an Instance Variable that Other Operation Accesses
  • Detecting Independent Operations

    • In Type-Safe Languages
      • Class Declarations
      • Instance Variable Accesses
    • Pointer or Alias Analysis


Analysis in Current Compiler

  • Dependence Analysis

    • Operations on Objects of Different Classes
    • Independent Operations on Objects of Same Class
  • Symbolic Commutativity Testing

    • Dependent Operations on Objects of Same Class
  • Future

    • Integrate Pointer or Alias Analysis
    • Integrate Array Data Dependence Analysis


Steps to Practicality

  • Steps to Practicality



Programming Model Extensions

  • Extensions for Read-Only Data

    • Allow Operations to Freely Access Read-Only Data
    • Enhances Ability of Compiler to Represent Expressions
    • Increases Set of Programs that Compiler can Analyze
  • Analysis Granularity Extensions

    • Integrate Operations into Callers for Analysis Purposes
    • Coarsens Commutativity Testing Granularity
      • Reduces Number of Pairs Tested for Commutativity
      • Enhances Effectiveness of Commutativity Testing


Optimizations

  • Synchronization Optimizations

    • Eliminate Synchronization Constructs in Methods that Only Access Read-Only Data
    • Reduce Number of Acquire and Release Constructs
  • Parallel Loop Optimization

  • Suppress Exploitation of Excess Concurrency



Extent Constants

  • Motivation: Allow Parallel Operations to Freely Access Read-Only Data

  • Extent Constant Variable Global variable or instance variable written by no operation in extent

  • Extent Constant Expression Expression whose value depends only on extent constant variables or parameters

  • Extent Constant Value Value computed by extent constant expression

  • Extent Constant Automatically generated opaque constant used to represent an extent constant value

  • Requires: Interprocedural Data Usage Analysis

    • Result Summarizes How Operations Access Instance Variables
    • Interprocedural Pointer Analysis for Reference Parameters


Extent Constant Variables In Example



Advantages of Extent Constants

  • Extent Constants Extend Programming Model

    • Enable Direct Global Variable Access
    • Enable Direct Access of Objects other than Receiver
  • Extent Constants Make Compiler More Effective

    • Enable Compact Representations of Large Expressions
    • Enable Compiler to Represent Values Computed by Otherwise Unanalyzable Constructs


Auxiliary Operations

  • Motivation: Coarsen Granularity of Commutativity Testing

  • An Operation is an Auxiliary Operation if its Entire Computation

    • Only Computes Extent Constant Values
    • Only Externally Visible Writes are to Local Variables of Caller
  • Auxiliary Operations are Conceptually Part of Caller

  • Requires:

    • Interprocedural Data Usage Analysis
    • Interprocedural Pointer Analysis for Reference Parameters
    • Intraprocedural Reaching Definition Analysis


Auxiliary Operation Example

  • int graph::square_and_add(int v) {

  • return(val*val + v);

  • }

  • void graph::traverse(int v) {

  • sum += square_and_add(v);

  • if (left != NULL) left->traverse(val);

  • if (right != NULL) right->traverse(val);

  • }



Advantages of Auxiliary Operations

  • Coarsen Granularity of Commutativity Testing

    • Reduces Number of Pairs Tested for Commutativity
    • Enhances Effectiveness of Commutativity Testing Algorithm
  • Support Modular Programming



Synchronization Optimizations

  • Goal: Eliminate or Reduce Synchronization Overhead

  • Synchronization Elimination



Data Lock Coarsening Example



Computation Lock Coarsening Example

  • class body {

  • lock mutex;

  • double phi;

  • vector acc;

  • };

  • void body::gravsub(body *b){

  • double p, v[NDIM];

  • mutex.acquire();

  • p = computeInter(b,v);

  • phi -= p;

  • acc.add(v);

  • mutex.release();

  • }

  • void body::loopsub(body *b){

  • int i;

  • for (i = 0; i < N; i++) {

  • this->gravsub(b+i);

  • }

  • }



Parallel Loops

  • Goal: Generate Efficient Code for Parallel Loops

  • If A Loop is in the Following Form

  • for (i = exp1; i < exp2; i += exp3) {

  • exp4->op(exp5,exp6, ...);

  • }

    • Where exp1, exp2, ... Extent Constant Expressions
  • Then Compiler Generates Parallel Loop Code



Parallel Loop Optimization

  • Without Parallel Loop Optimization

    • Each Loop Iteration Generates a Task
    • Tasks are Created and Scheduled Sequentially
    • Each Iteration Incurs Task Creation and Scheduling Overhead
  • With Parallel Loop Optimization

    • Generated Code Immediately Exposes All Iterations
    • Scheduler Operates on Chunks of Loop Iterations
    • Each Chunk of Iterations Incurs Scheduling Overhead
  • Advantages

    • Enables Compact Representation for Loop Computation
    • Reduces Task Creation and Scheduling Overhead
    • Parallelizes Overhead


Suppressing Excess Concurrency

  • Goal: Reduce Overhead of Exploiting Parallelism

  • Goal Achieved by Generating Computations that

    • Execute Operations Serially with No Parallelization Overhead
    • Use Synchronization Required to Execute Safely in Parallel Context
  • Mechanism: Mutex Versions of Methods



Experimental Results

  • Experimental Results



Methodology

  • Built Prototype Compiler

  • Built Run Time System

    • Concurrency Generation and Task Management
    • Dynamic Load Balancing
    • Synchronization
  • Acquired Two Complete Applications

    • Barnes-Hut N-Body Solver
    • Water Code
  • Automatically Parallelized Applications

  • Ran Applications on Stanford DASH Machine

  • Compare Performance with Highly Tuned, Explicitly Parallel Versions from SPLASH-2 Benchmark Suite



Prototype Compiler

  • Clean Subset of C++

  • Sage++ is Front End

  • Structured As a Source-To-Source Translator

    • Analysis Finds Parallel Loops and Methods
    • Compiler Generates Annotation File
      • Identifies Parallel Loops and Methods
      • Classes to Augment with Locks
    • Code Generator Reads Annotation File
      • Generates Parallel Versions of Methods
      • Inserts Synchronization and Parallelization Code
  • Parallelizes Unannotated Programs



Major Restrictions

  • Motivation: Simplify Implementation of Prototype

  • No Virtual Methods

  • No Operator or Method Overloading

  • No Multiple Inheritance or Templates

  • No typedef, struct, union or enum types

  • Global Variables must be Class Types

  • No Static Members or Pointers to Members

  • No Default Arguments or Variable Numbers of Arguments

  • No Operation Accesses a Variable Declared in a Class from which its Receiver Class Inherits



Run Time Library

  • Motivation: Provide Basic Concurrency Managment

  • Single Program, Multiple Data Execution Model

    • Single Address Space
    • Alternate Serial and Parallel Phases
  • Library Provides

    • Task Creation and Synchronization Primitives
    • Dynamic Load Balancing
  • Implemented

    • Stanford DASH Shared-Memory Multiprocessor
    • SGI Shared-Memory Multiprocessors


Applications

  • Barnes-Hut

    • O(NlgN) N-Body Solver
    • Space Subdivision Tree
    • 1500 Lines of C++ Code
  • Water

    • Simulates Liquid Water
    • O(N^2) Algorithm
    • 1850 Lines of C++ Code


Obtaining Serial C++ Version of Barnes-Hut

  • Started with Explicitly Parallel Version (SPLASH-2)

  • Removed Parallel Constructs to get Serial C

  • Converted to Clean Object-Based C++

  • Major Structural Changes

    • Eliminated Scheduling Code and Data Structures
    • Split a Loop in Force Computation Phase
    • Introduced New Field into Particle Data Structure


Obtaining Serial C++ Version of Water

  • Started with Serial C translated from FORTRAN

  • Converted to Clean Object-Based C++

  • Major Structural Change

    • Auxiliary Objects for O(N^2) phases


Commutativity Statistics for Barnes-Hut



Auxiliary Operation Statistics for Barnes-Hut



Performance Results for Barnes-Hut



Performance Analysis

  • Motivation: Understand Behavior of Parallelized Program

  • Instrumented Code to Measure Execution Time Breakdowns

    • Parallel Idle - Time Spent Idle in Parallel Section
    • Serial Idle - Time Spent Idle in a Serial Section
    • Blocked - Time Spent Waiting to Acquire a Lock Held by Another Processor
    • Parallel Compute - Time Spent Doing Useful Work in a Parallel Section
    • Serial Compute - Time Spent Doing Useful Work in a Serial Section


Performance Analysis for Barnes-Hut



Performance Results for Water



Performance Results for Computation Replication Version of Water



Commutativity Statistics for Water



Auxiliary Operation Statistics for Water



Performance Analysis for Water



Future Work

  • Relative Commutativity

  • Integrate Other Analysis Frameworks

    • Pointer or Alias Analysis
    • Array Data Dependence Analysis
  • Analysis Problems

    • Synchronization Optimizations
    • Analysis Granularity Optimizations
  • Generation of Self-Tuning Code

  • Message Passing Implementation



Related Work

  • Bernstein (IEEE Transactions on Computers 1966)

  • Dependence Analysis for Pointer-Based Data Structures

  • Reduction Analysis

    • Ghuloum and Fisher (PPOPP 95)
    • Pinter and Pinter (POPL 92)
    • Callahan (LCPC 91)
  • Commuting Operations in Parallel Languages

    • Rinard and Lam (PPOPP 91)
    • Steele (POPL 90)
    • Barth, Nikhil and Arvind (FPCA 91)




Conclusion

  • Commutativity Analysis

    • New Analysis Framework for Parallelizing Compilers
  • Basic Idea

    • Recognize Commuting Operations
    • Generate Parallel Code
  • Current Focus

    • Dynamic, Pointer-Based Data Structures
    • Good Initial Results
  • Future

    • Persistent Data
    • Distributed Computations


Latest Version of Paper

  • http://www.cs.ucsb.edu/~martin/paper/pldi96.ps



What if Operations Do Not Commute?

  • Parallel Tree Traversal

  • Example: Distance of Node from Root



Equivalent Computation with Commuting Operations



Theoretical Result

  • For Any Tree Traversal on Data With

    • A Commutative Operator (for example +) that has
    • A Zero Element (for example 0)
  • There Exists A Program P such that

    • P Computes the Traversal
    • Commutativity Analysis Can Automatically Parallelize P
  • Complexity Results:

    • Program P is asymptotically optimal if the Data Struture is a Perfectly Balanced Tree
    • Program P has complexity O(N^2) if the Data Structure is a Linked-List


Pure Object-Based Model of Computation

  • Goal

    • Obtain a Powerful, Clean Model of Computation
    • Enable Compiler to Analyze Program
  • Objects: Instances of Classes

    • Implement State with Instance Variables
      • Primitive Types from Underlying Language (int, ...)
      • References to Other Objects
      • Nested Objects
  • Operations: Invocations of Methods

    • Each Operation Has Single Receiver Object


Yüklə 514 b.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə