Martin C. Rinard Pedro C. Diniz University of California, Santa Barbara Santa Barbara, California 93106 {martin,pedro}@cs.ucsb.edu http://www.cs.ucsb.edu/~{martin,pedro}
Goal Develop a Parallelizing Compiler for Object-Oriented Computations Current Focus - Irregular Computations
- Dynamic Data Structures
Future - Persistent Data
- Distributed Computations
New Analysis Technique:
Structure of Talk Model of Computation Example Steps To Practicality Experimental Results Conclusion
Model of Computation
Graph Traversal Example class graph { int val, sum; graph *left, *right; }; void graph::traverse(int v) { sum += v; if (left !=NULL) left->traverse(val); if (right!=NULL) right->traverse(val); }
Parallel Traversal
Commuting Operations in Parallel Traversal
Model of Computation Operations: Method Invocations - In Example: Invocations of graph::traverse
- left->traverse(3)
- right->traverse(2)
Objects: Instances of Classes Instance Variables Implement Object State - In Example: val, sum, left, right
Model of Computation Operations: Method Invocations - In Example: Invocations of graph::traverse
- left->traverse(3)
- right->traverse(2)
Objects: Instances of Classes
Separable Operations Each Operation Consists of Two Sections
Basic Approach Compiler Chooses A Computation to Parallelize - In Example: Entire graph::traverse Computation
Compiler Computes Extent of the Computation - Representation of all Operations in Computation
- Current Representation: Set of Methods
- In Example: { graph::traverse }
Do All Pairs of Operations in Extent Commute?
Code Generation For Each Method in Parallel Computation Augments Class Declaration With Mutual Exclusion Lock Generates Driver Version of Method - Invoked from Serial Code to Start Parallel Execution
- Invokes Parallel Version of Operation
- Waits for Entire Parallel Computation to Finish
Generates Parallel Version of Method
Driver Version
Parallel Version In Example
Compiler Structure
Traditional Approach Data Dependence Analysis - Analyzes Reads and Writes
- Independent Pieces of Code Execute in Parallel
Demonstrated Success for Array-Based Programs
Data Dependence Analysis in Example For Data Dependence Analysis To Succeed in Example - left and right traverse Must Be Independent
- left and right Subgraphs Must Be Disjoint
- Graph Must Be a Tree
Depends on Global Topology of Data Structure - Analyze Code that Builds Data Structure
- Extract and Propagate Topology Information
Fails For Graphs
Properties of Commutativity Analysis Oblivious to Data Structure Topology - Local Analysis
- Simple Analysis
Wide Range of Computations Introduces Synchronization Relies on Commuting Operations
Commutativity Testing
Commutativity Testing Conditions Do Two Operations A and B Commute? Compiler Considers Two Execution Orders - A;B - A executes before B
- B;A - B executes before A
Compiler Must Check Two Conditions
Commutativity Testing Conditions
Commutativity Testing Algorithm Symbolic Execution: - Compiler Executes Operations
- Computes with Expressions not Values
Compiler Symbolically Executes Operations In Both Execution Orders - Expressions for New Values of Instance Variables
- Expressions for Multiset of Invoked Operations
Expression Simplification and Comparison Compiler Applies Rewrite Rules to Simplify Expressions - a*(b+c) a*b)+(a*c)
- b+(a+c) (a+b+c)
- a+if(b
Compiler Compares Corresponding Expressions - If All Equal - Operations Commute
- If Not All Equal - Operations May Not Commute
Commutativity Testing Example Two Operations - r->traverse(v1) and r->traverse(v2)
In Order r->traverse(v1);r->traverse(v2)
Important Special Case Independent Operations Commute Analysis in Current Compiler - Dependence Analysis
- Operations on Objects of Different Classes
- Independent Operations on Objects of Same Class
- Symbolic Commutativity Testing
- Dependent Operations on Objects of Same Class
Future - Integrate Pointer or Alias Analysis
- Integrate Array Data Dependence Analysis
Important Special Case Independent Operations Commute Conditions for Independence - Operations Have Different Receivers
- Neither Operation Writes an Instance Variable that Other Operation Accesses
Detecting Independent Operations - In Type-Safe Languages
- Class Declarations
- Instance Variable Accesses
- Pointer or Alias Analysis
Analysis in Current Compiler Dependence Analysis - Operations on Objects of Different Classes
- Independent Operations on Objects of Same Class
Symbolic Commutativity Testing - Dependent Operations on Objects of Same Class
Future - Integrate Pointer or Alias Analysis
- Integrate Array Data Dependence Analysis
Steps to Practicality
Programming Model Extensions Extensions for Read-Only Data - Allow Operations to Freely Access Read-Only Data
- Enhances Ability of Compiler to Represent Expressions
- Increases Set of Programs that Compiler can Analyze
Analysis Granularity Extensions - Integrate Operations into Callers for Analysis Purposes
- Coarsens Commutativity Testing Granularity
- Reduces Number of Pairs Tested for Commutativity
- Enhances Effectiveness of Commutativity Testing
Optimizations Synchronization Optimizations - Eliminate Synchronization Constructs in Methods that Only Access Read-Only Data
- Reduce Number of Acquire and Release Constructs
Parallel Loop Optimization Suppress Exploitation of Excess Concurrency
Extent Constants Motivation: Allow Parallel Operations to Freely Access Read-Only Data Extent Constant Variable Global variable or instance variable written by no operation in extent Extent Constant Expression Expression whose value depends only on extent constant variables or parameters Extent Constant Value Value computed by extent constant expression Extent Constant Automatically generated opaque constant used to represent an extent constant value Requires: Interprocedural Data Usage Analysis - Result Summarizes How Operations Access Instance Variables
- Interprocedural Pointer Analysis for Reference Parameters
Extent Constant Variables In Example
Advantages of Extent Constants Extent Constants Extend Programming Model - Enable Direct Global Variable Access
- Enable Direct Access of Objects other than Receiver
Extent Constants Make Compiler More Effective - Enable Compact Representations of Large Expressions
- Enable Compiler to Represent Values Computed by Otherwise Unanalyzable Constructs
Auxiliary Operations Motivation: Coarsen Granularity of Commutativity Testing An Operation is an Auxiliary Operation if its Entire Computation - Only Computes Extent Constant Values
- Only Externally Visible Writes are to Local Variables of Caller
Auxiliary Operations are Conceptually Part of Caller Requires: - Interprocedural Data Usage Analysis
- Interprocedural Pointer Analysis for Reference Parameters
- Intraprocedural Reaching Definition Analysis
Auxiliary Operation Example int graph::square_and_add(int v) { return(val*val + v); } void graph::traverse(int v) { sum += square_and_add(v); if (left != NULL) left->traverse(val); if (right != NULL) right->traverse(val); }
Advantages of Auxiliary Operations Coarsen Granularity of Commutativity Testing - Reduces Number of Pairs Tested for Commutativity
- Enhances Effectiveness of Commutativity Testing Algorithm
Support Modular Programming
Synchronization Optimizations Goal: Eliminate or Reduce Synchronization Overhead Synchronization Elimination
Data Lock Coarsening Example
Computation Lock Coarsening Example class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; acc.add(v); mutex.release(); } void body::loopsub(body *b){ int i; for (i = 0; i < N; i++) { this->gravsub(b+i); } }
Parallel Loops Goal: Generate Efficient Code for Parallel Loops If A Loop is in the Following Form for (i = exp1; i < exp2; i += exp3) { exp4->op(exp5,exp6, ...); } - Where exp1, exp2, ... Extent Constant Expressions
Then Compiler Generates Parallel Loop Code
Parallel Loop Optimization Without Parallel Loop Optimization - Each Loop Iteration Generates a Task
- Tasks are Created and Scheduled Sequentially
- Each Iteration Incurs Task Creation and Scheduling Overhead
With Parallel Loop Optimization - Generated Code Immediately Exposes All Iterations
- Scheduler Operates on Chunks of Loop Iterations
- Each Chunk of Iterations Incurs Scheduling Overhead
Advantages - Enables Compact Representation for Loop Computation
- Reduces Task Creation and Scheduling Overhead
- Parallelizes Overhead
Suppressing Excess Concurrency Goal: Reduce Overhead of Exploiting Parallelism Goal Achieved by Generating Computations that - Execute Operations Serially with No Parallelization Overhead
- Use Synchronization Required to Execute Safely in Parallel Context
Mechanism: Mutex Versions of Methods
Experimental Results
Methodology Built Prototype Compiler Built Run Time System - Concurrency Generation and Task Management
- Dynamic Load Balancing
- Synchronization
Acquired Two Complete Applications - Barnes-Hut N-Body Solver
- Water Code
Automatically Parallelized Applications Ran Applications on Stanford DASH Machine Compare Performance with Highly Tuned, Explicitly Parallel Versions from SPLASH-2 Benchmark Suite
Prototype Compiler Clean Subset of C++ Sage++ is Front End Structured As a Source-To-Source Translator - Analysis Finds Parallel Loops and Methods
- Compiler Generates Annotation File
- Identifies Parallel Loops and Methods
- Classes to Augment with Locks
- Code Generator Reads Annotation File
- Generates Parallel Versions of Methods
- Inserts Synchronization and Parallelization Code
Parallelizes Unannotated Programs
Motivation: Simplify Implementation of Prototype No Virtual Methods No Operator or Method Overloading No Multiple Inheritance or Templates No typedef, struct, union or enum types Global Variables must be Class Types No Static Members or Pointers to Members No Default Arguments or Variable Numbers of Arguments No Operation Accesses a Variable Declared in a Class from which its Receiver Class Inherits
Run Time Library Motivation: Provide Basic Concurrency Managment Single Program, Multiple Data Execution Model - Single Address Space
- Alternate Serial and Parallel Phases
Library Provides - Task Creation and Synchronization Primitives
- Dynamic Load Balancing
Implemented - Stanford DASH Shared-Memory Multiprocessor
- SGI Shared-Memory Multiprocessors
Applications Barnes-Hut - O(NlgN) N-Body Solver
- Space Subdivision Tree
- 1500 Lines of C++ Code
Water - Simulates Liquid Water
- O(N^2) Algorithm
- 1850 Lines of C++ Code
Obtaining Serial C++ Version of Barnes-Hut Started with Explicitly Parallel Version (SPLASH-2) Removed Parallel Constructs to get Serial C Converted to Clean Object-Based C++ Major Structural Changes - Eliminated Scheduling Code and Data Structures
- Split a Loop in Force Computation Phase
- Introduced New Field into Particle Data Structure
Obtaining Serial C++ Version of Water Converted to Clean Object-Based C++ Major Structural Change - Auxiliary Objects for O(N^2) phases
Commutativity Statistics for Barnes-Hut
Auxiliary Operation Statistics for Barnes-Hut
Performance Results for Barnes-Hut
Performance Analysis Motivation: Understand Behavior of Parallelized Program Instrumented Code to Measure Execution Time Breakdowns - Parallel Idle - Time Spent Idle in Parallel Section
- Serial Idle - Time Spent Idle in a Serial Section
- Blocked - Time Spent Waiting to Acquire a Lock Held by Another Processor
- Parallel Compute - Time Spent Doing Useful Work in a Parallel Section
- Serial Compute - Time Spent Doing Useful Work in a Serial Section
Performance Analysis for Barnes-Hut
Performance Results for Water
Performance Results for Computation Replication Version of Water
Commutativity Statistics for Water
Auxiliary Operation Statistics for Water
Performance Analysis for Water
Future Work Relative Commutativity Integrate Other Analysis Frameworks - Pointer or Alias Analysis
- Array Data Dependence Analysis
Analysis Problems - Synchronization Optimizations
- Analysis Granularity Optimizations
Generation of Self-Tuning Code Message Passing Implementation
Related Work Bernstein (IEEE Transactions on Computers 1966) Dependence Analysis for Pointer-Based Data Structures Reduction Analysis - Ghuloum and Fisher (PPOPP 95)
- Pinter and Pinter (POPL 92)
- Callahan (LCPC 91)
Commuting Operations in Parallel Languages - Rinard and Lam (PPOPP 91)
- Steele (POPL 90)
- Barth, Nikhil and Arvind (FPCA 91)
Conclusion Commutativity Analysis - New Analysis Framework for Parallelizing Compilers
Basic Idea - Recognize Commuting Operations
- Generate Parallel Code
Current Focus - Dynamic, Pointer-Based Data Structures
- Good Initial Results
Future - Persistent Data
- Distributed Computations
Latest Version of Paper http://www.cs.ucsb.edu/~martin/paper/pldi96.ps
What if Operations Do Not Commute? Parallel Tree Traversal Example: Distance of Node from Root
Equivalent Computation with Commuting Operations
Theoretical Result For Any Tree Traversal on Data With - A Commutative Operator (for example +) that has
- A Zero Element (for example 0)
There Exists A Program P such that - P Computes the Traversal
- Commutativity Analysis Can Automatically Parallelize P
Complexity Results: - Program P is asymptotically optimal if the Data Struture is a Perfectly Balanced Tree
- Program P has complexity O(N^2) if the Data Structure is a Linked-List
Pure Object-Based Model of Computation Goal - Obtain a Powerful, Clean Model of Computation
- Enable Compiler to Analyze Program
Objects: Instances of Classes - Implement State with Instance Variables
- Primitive Types from Underlying Language (int, ...)
- References to Other Objects
- Nested Objects
Operations: Invocations of Methods - Each Operation Has Single Receiver Object
Dostları ilə paylaş: |