Commutativity Analysis: a new Analysis Framework for Parallelizing Compilers

Yüklə 514 b.

tarix	24.12.2017
ölçüsü	514 b.
	#17708

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

Martin C. Rinard
Pedro C. Diniz
University of California, Santa Barbara
Santa Barbara, California 93106
{martin,pedro}@cs.ucsb.edu
http://www.cs.ucsb.edu/~{martin,pedro}

Goal

Develop a Parallelizing Compiler for Object-Oriented Computations
Current Focus

Irregular Computations
Dynamic Data Structures

Future

Persistent Data
Distributed Computations

New Analysis Technique:

Commutativity Analysis

Structure of Talk

Model of Computation
Example
Commutativity Testing
Steps To Practicality
Experimental Results
Conclusion

Model of Computation

Graph Traversal Example

class graph {
int val, sum;
graph *left, *right;
};
void graph::traverse(int v) {
sum += v;
if (left !=NULL) left->traverse(val);
if (right!=NULL) right->traverse(val);
}

Parallel Traversal

Commuting Operations in Parallel Traversal

Model of Computation

Operations: Method Invocations

In Example: Invocations of graph::traverse

left->traverse(3)
right->traverse(2)

Objects: Instances of Classes

In Example: Graph Nodes

Instance Variables Implement Object State

In Example: val, sum, left, right

Model of Computation

Operations: Method Invocations

In Example: Invocations of graph::traverse

left->traverse(3)
right->traverse(2)

Objects: Instances of Classes

In Example: Graph Nodes

Separable Operations

Each Operation Consists of Two Sections

Basic Approach

Compiler Chooses A Computation to Parallelize

In Example: Entire graph::traverse Computation

Compiler Computes Extent of the Computation

Representation of all Operations in Computation
Current Representation: Set of Methods
In Example: { graph::traverse }

Do All Pairs of Operations in Extent Commute?

No - Generate Serial Code
Yes - Generate Parallel Code
In Example: All Pairs Commute

Code Generation For Each Method in Parallel Computation

Augments Class Declaration With Mutual Exclusion Lock
Generates Driver Version of Method

Invoked from Serial Code to Start Parallel Execution
Invokes Parallel Version of Operation
Waits for Entire Parallel Computation to Finish

Generates Parallel Version of Method

Driver Version

Parallel Version In Example

Compiler Structure

Traditional Approach

Data Dependence Analysis

Analyzes Reads and Writes
Independent Pieces of Code Execute in Parallel

Demonstrated Success for Array-Based Programs

Data Dependence Analysis in Example

For Data Dependence Analysis To Succeed in Example

left and right traverse Must Be Independent
left and right Subgraphs Must Be Disjoint
Graph Must Be a Tree

Depends on Global Topology of Data Structure

Analyze Code that Builds Data Structure
Extract and Propagate Topology Information

Fails For Graphs

Properties of Commutativity Analysis

Oblivious to Data Structure Topology

Local Analysis
Simple Analysis

Wide Range of Computations

Lists, Trees and Graphs
Updates to Central Data Structure
General Reductions

Introduces Synchronization
Relies on Commuting Operations

Commutativity Testing

Commutativity Testing

Commutativity Testing Conditions

Do Two Operations A and B Commute?
Compiler Considers Two Execution Orders

A;B - A executes before B
B;A - B executes before A

Compiler Must Check Two Conditions

Commutativity Testing Conditions

Commutativity Testing Algorithm

Symbolic Execution:

Compiler Executes Operations
Computes with Expressions not Values

Compiler Symbolically Executes Operations
In Both Execution Orders

Expressions for New Values of Instance Variables
Expressions for Multiset of Invoked Operations

Expression Simplification and Comparison

Compiler Applies Rewrite Rules to Simplify Expressions

a*(b+c) a*b)+(a*c)
b+(a+c) (a+b+c)
a+if(b

Compiler Compares Corresponding Expressions

If All Equal - Operations Commute
If Not All Equal - Operations May Not Commute

Commutativity Testing Example

Two Operations

r->traverse(v1) and r->traverse(v2)

In Order r->traverse(v1);r->traverse(v2)

Important Special Case

Independent Operations Commute
Analysis in Current Compiler

Dependence Analysis

Operations on Objects of Different Classes
Independent Operations on Objects of Same Class

Symbolic Commutativity Testing

Dependent Operations on Objects of Same Class

Future

Integrate Pointer or Alias Analysis
Integrate Array Data Dependence Analysis

Important Special Case

Independent Operations Commute
Conditions for Independence

Operations Have Different Receivers
Neither Operation Writes an Instance Variable that Other Operation Accesses

Detecting Independent Operations

In Type-Safe Languages

Class Declarations
Instance Variable Accesses

Pointer or Alias Analysis

Analysis in Current Compiler

Dependence Analysis

Operations on Objects of Different Classes
Independent Operations on Objects of Same Class

Symbolic Commutativity Testing

Dependent Operations on Objects of Same Class

Future

Integrate Pointer or Alias Analysis
Integrate Array Data Dependence Analysis

Steps to Practicality

Steps to Practicality

Programming Model Extensions

Extensions for Read-Only Data

Allow Operations to Freely Access Read-Only Data
Enhances Ability of Compiler to Represent Expressions
Increases Set of Programs that Compiler can Analyze

Analysis Granularity Extensions

Integrate Operations into Callers for Analysis Purposes
Coarsens Commutativity Testing Granularity

Reduces Number of Pairs Tested for Commutativity
Enhances Effectiveness of Commutativity Testing

Optimizations

Synchronization Optimizations

Eliminate Synchronization Constructs in Methods that Only Access Read-Only Data
Reduce Number of Acquire and Release Constructs

Parallel Loop Optimization
Suppress Exploitation of Excess Concurrency

Extent Constants

Motivation: Allow Parallel Operations to Freely Access Read-Only Data
Extent Constant Variable Global variable or instance variable written by no operation in extent
Extent Constant Expression Expression whose value depends only on extent constant variables or parameters
Extent Constant Value Value computed by extent constant expression
Extent Constant Automatically generated opaque constant used to represent an extent constant value
Requires: Interprocedural Data Usage Analysis

Result Summarizes How Operations Access Instance Variables
Interprocedural Pointer Analysis for Reference Parameters

Extent Constant Variables In Example

Advantages of Extent Constants

Extent Constants Extend Programming Model

Enable Direct Global Variable Access
Enable Direct Access of Objects other than Receiver

Extent Constants Make Compiler More Effective

Enable Compact Representations of Large Expressions
Enable Compiler to Represent Values Computed by Otherwise Unanalyzable Constructs

Auxiliary Operations

Motivation: Coarsen Granularity of Commutativity Testing
An Operation is an Auxiliary Operation if its Entire Computation

Only Computes Extent Constant Values
Only Externally Visible Writes are to Local Variables of Caller

Auxiliary Operations are Conceptually Part of Caller

Analysis Integrates Auxiliary Operations into Caller
Represents Computed Values using Extent Constants

Requires:

Interprocedural Data Usage Analysis
Interprocedural Pointer Analysis for Reference Parameters
Intraprocedural Reaching Definition Analysis

Auxiliary Operation Example

int graph::square_and_add(int v) {
return(val*val + v);
}
void graph::traverse(int v) {
sum += square_and_add(v);
if (left != NULL) left->traverse(val);
if (right != NULL) right->traverse(val);
}

Advantages of Auxiliary Operations

Coarsen Granularity of Commutativity Testing

Reduces Number of Pairs Tested for Commutativity
Enhances Effectiveness of Commutativity Testing Algorithm

Support Modular Programming

Synchronization Optimizations

Goal: Eliminate or Reduce Synchronization Overhead
Synchronization Elimination

Data Lock Coarsening Example

Computation Lock Coarsening Example

class body {
lock mutex;
double phi;
vector acc;
};
void body::gravsub(body *b){
double p, v[NDIM];
mutex.acquire();
p = computeInter(b,v);
phi -= p;
acc.add(v);
mutex.release();
}
void body::loopsub(body *b){
int i;
for (i = 0; i < N; i++) {
this->gravsub(b+i);
}
}

Parallel Loops

Goal: Generate Efficient Code for Parallel Loops
If A Loop is in the Following Form
for (i = exp1; i < exp2; i += exp3) {
exp4->op(exp5,exp6, ...);
}

Where exp1, exp2, ... Extent Constant Expressions

Then Compiler Generates Parallel Loop Code

Parallel Loop Optimization

Without Parallel Loop Optimization

Each Loop Iteration Generates a Task
Tasks are Created and Scheduled Sequentially
Each Iteration Incurs Task Creation and Scheduling Overhead

With Parallel Loop Optimization

Generated Code Immediately Exposes All Iterations
Scheduler Operates on Chunks of Loop Iterations
Each Chunk of Iterations Incurs Scheduling Overhead

Advantages

Enables Compact Representation for Loop Computation
Reduces Task Creation and Scheduling Overhead
Parallelizes Overhead

Suppressing Excess Concurrency

Goal: Reduce Overhead of Exploiting Parallelism
Goal Achieved by Generating Computations that

Execute Operations Serially with No Parallelization Overhead
Use Synchronization Required to Execute Safely in Parallel Context

Mechanism: Mutex Versions of Methods

Experimental Results

Experimental Results

Methodology

Built Prototype Compiler
Built Run Time System

Concurrency Generation and Task Management
Dynamic Load Balancing
Synchronization

Acquired Two Complete Applications

Barnes-Hut N-Body Solver
Water Code

Automatically Parallelized Applications
Ran Applications on Stanford DASH Machine
Compare Performance with Highly Tuned, Explicitly Parallel Versions from SPLASH-2 Benchmark Suite

Prototype Compiler

Clean Subset of C++
Sage++ is Front End
Structured As a Source-To-Source Translator

Analysis Finds Parallel Loops and Methods
Compiler Generates Annotation File

Identifies Parallel Loops and Methods
Classes to Augment with Locks

Code Generator Reads Annotation File

Generates Parallel Versions of Methods
Inserts Synchronization and Parallelization Code

Parallelizes Unannotated Programs

Major Restrictions

Motivation: Simplify Implementation of Prototype
No Virtual Methods
No Operator or Method Overloading
No Multiple Inheritance or Templates
No typedef, struct, union or enum types
Global Variables must be Class Types
No Static Members or Pointers to Members
No Default Arguments or Variable Numbers of Arguments
No Operation Accesses a Variable Declared in a Class from which its Receiver Class Inherits

Run Time Library

Motivation: Provide Basic Concurrency Managment
Single Program, Multiple Data Execution Model

Single Address Space
Alternate Serial and Parallel Phases

Library Provides

Task Creation and Synchronization Primitives
Dynamic Load Balancing

Implemented

Stanford DASH Shared-Memory Multiprocessor
SGI Shared-Memory Multiprocessors

Applications

Barnes-Hut

O(NlgN) N-Body Solver
Space Subdivision Tree
1500 Lines of C++ Code

Water

Simulates Liquid Water
O(N^2) Algorithm
1850 Lines of C++ Code

Obtaining Serial C++ Version of Barnes-Hut

Started with Explicitly Parallel Version (SPLASH-2)
Removed Parallel Constructs to get Serial C
Converted to Clean Object-Based C++
Major Structural Changes

Eliminated Scheduling Code and Data Structures
Split a Loop in Force Computation Phase
Introduced New Field into Particle Data Structure

Obtaining Serial C++ Version of Water

Started with Serial C translated from FORTRAN
Converted to Clean Object-Based C++
Major Structural Change

Auxiliary Objects for O(N^2) phases

Commutativity Statistics for Barnes-Hut

Auxiliary Operation Statistics for Barnes-Hut

Performance Results for Barnes-Hut

Performance Analysis

Motivation: Understand Behavior of Parallelized Program
Instrumented Code to Measure Execution Time Breakdowns

Parallel Idle - Time Spent Idle in Parallel Section
Serial Idle - Time Spent Idle in a Serial Section
Blocked - Time Spent Waiting to Acquire a Lock Held by Another Processor
Parallel Compute - Time Spent Doing Useful Work in a Parallel Section
Serial Compute - Time Spent Doing Useful Work in a Serial Section

Performance Analysis for Barnes-Hut

Performance Results for Water

Performance Results for Computation Replication Version of Water

Commutativity Statistics for Water

Auxiliary Operation Statistics for Water

Performance Analysis for Water

Future Work

Relative Commutativity
Integrate Other Analysis Frameworks

Pointer or Alias Analysis
Array Data Dependence Analysis

Analysis Problems

Synchronization Optimizations
Analysis Granularity Optimizations

Generation of Self-Tuning Code
Message Passing Implementation

Related Work

Bernstein (IEEE Transactions on Computers 1966)
Dependence Analysis for Pointer-Based Data Structures
Reduction Analysis

Ghuloum and Fisher (PPOPP 95)
Pinter and Pinter (POPL 92)
Callahan (LCPC 91)

Commuting Operations in Parallel Languages

Rinard and Lam (PPOPP 91)
Steele (POPL 90)
Barth, Nikhil and Arvind (FPCA 91)

Conclusion

Commutativity Analysis

New Analysis Framework for Parallelizing Compilers

Basic Idea

Recognize Commuting Operations
Generate Parallel Code

Current Focus

Dynamic, Pointer-Based Data Structures
Good Initial Results

Future

Persistent Data
Distributed Computations

Latest Version of Paper

http://www.cs.ucsb.edu/~martin/paper/pldi96.ps

What if Operations Do Not Commute?

Parallel Tree Traversal
Example: Distance of Node from Root

Equivalent Computation with Commuting Operations

Theoretical Result

For Any Tree Traversal on Data With

A Commutative Operator (for example +) that has
A Zero Element (for example 0)

There Exists A Program P such that

P Computes the Traversal
Commutativity Analysis Can Automatically Parallelize P

Complexity Results:

Program P is asymptotically optimal if the Data Struture is a Perfectly Balanced Tree
Program P has complexity O(N^2) if the Data Structure is a Linked-List

Pure Object-Based Model of Computation

Goal

Obtain a Powerful, Clean Model of Computation
Enable Compiler to Analyze Program

Objects: Instances of Classes

Implement State with Instance Variables

Primitive Types from Underlying Language (int, ...)
References to Other Objects
Nested Objects

Operations: Invocations of Methods

Each Operation Has Single Receiver Object

Yüklə 514 b.

Dostları ilə paylaş:

Commutativity Analysis: a new Analysis Framework for Parallelizing Compilers

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

Martin C. Rinard

Pedro C. Diniz

University of California, Santa Barbara

Santa Barbara, California 93106

{martin,pedro}@cs.ucsb.edu

http://www.cs.ucsb.edu/~{martin,pedro}

Goal

Develop a Parallelizing Compiler for Object-Oriented Computations

Current Focus

Future

New Analysis Technique:

Structure of Talk

Model of Computation

Example

Commutativity Testing

Steps To Practicality

Experimental Results

Conclusion

Model of Computation

Graph Traversal Example

class graph {

int val, sum;

graph *left, *right;

};

void graph::traverse(int v) {

sum += v;

if (left !=NULL) left->traverse(val);

if (right!=NULL) right->traverse(val);

}

Parallel Traversal

Commuting Operations in Parallel Traversal

Model of Computation

Operations: Method Invocations

Objects: Instances of Classes

Instance Variables Implement Object State

Model of Computation

Operations: Method Invocations

Objects: Instances of Classes

Separable Operations

Each Operation Consists of Two Sections

Basic Approach

Compiler Chooses A Computation to Parallelize

Compiler Computes Extent of the Computation

Do All Pairs of Operations in Extent Commute?

Code Generation For Each Method in Parallel Computation

Augments Class Declaration With Mutual Exclusion Lock

Generates Driver Version of Method

Generates Parallel Version of Method

Driver Version

Parallel Version In Example

Compiler Structure

Traditional Approach

Data Dependence Analysis

Demonstrated Success for Array-Based Programs

Data Dependence Analysis in Example

For Data Dependence Analysis To Succeed in Example

Depends on Global Topology of Data Structure

Fails For Graphs

Properties of Commutativity Analysis

Oblivious to Data Structure Topology

Wide Range of Computations

Introduces Synchronization

Relies on Commuting Operations

Commutativity Testing

Commutativity Testing

Commutativity Testing Conditions

Do Two Operations A and B Commute?

Compiler Considers Two Execution Orders

Compiler Must Check Two Conditions

Commutativity Testing Conditions

Commutativity Testing Algorithm

Symbolic Execution:

Compiler Symbolically Executes Operations

In Both Execution Orders

Expression Simplification and Comparison

Compiler Applies Rewrite Rules to Simplify Expressions

Compiler Compares Corresponding Expressions

Commutativity Testing Example

graph left, right;