PowerPoint Presentation

Hy-CA Compiler Retargetable for Single-ChipHeterogeneous Multiprocessors

Philip Sweany

8/27/2010

No single architecture solves all powerproblems

Hard -wiredproxy

General PurposeProcessor

100X

SoftwareProgrammableDSP

•Industry has debated meritsof each architecture fordecades…

•Combination of allapproaches optimizespower and performance

10X

Retargetable Compilation

•Why ?

•Rocket

–C compiler, written in C++

–Retargetable for ILP computers

–Single machine description file

–Development 1989-2000

•Gnu

Hybrid Computing

•Heterogeneous processors on single chip

–“CPU”

–FPGA

–ASIC

–N “CPU”s, M FPGAs, K ASICs

•Tradeoffs of performance, power, flexibility

CPU 1

CPU 2

CPU m

Multi-CPU

FPGA 1

FPGA 2

FPGA n

Multi-FPGA

Shared

Memory

Generic Hybrid Architecture

System Specification

Partitioning

CPU

Compiler

FPGA

Synthesis

CPU

Power-Performance

Model

FPGA

Power-Performance

Model

Source Code

Generic Hy-C Tools

Optimization Control

Objectives/Constraints

Intermediate Representations

•3-address form

•Control flow graph

•SSA --- static single assignment

Control Flow Graph

•Nodes are Basic Blocks

–Single entry, single exit

–No branch exempt (possibly) at bottom

•Edges represent one possible flow ofexecution between two basic blocks

•Whole CFG represents a function

1/26/2016

Static Single Assignment

•SSA: A program is in SSA form iff

–Each variable is statically defined exactly onlyonce, and

–Each use of a variable is dominated by thatvariable’s definition.

1/26/2016

Example

•In general, how to transforman arbitrary program into SSAform?

•Does the definition of X2dominates its use in theexample?

X2 =

X4 =

X3 = (X1, X2)

1/26/2016

SSA: Motivation

• Provide a uniform basis of an IR to solve a widerange of classical dataflow problems

• Encode both dataflow and control flow information

• A SSA form can be constructed and maintainedefficiently

• Its popular

•Gcc uses SSA

Software Pipelining

•Schedule operations from multiple iterationsof a loop in parallel

•Hides latency

•Compiler “reorders” loop code to include:

–Prelude

–Kernel

–Postlude

Software Pipeline Benefit for “Typical”Architecture and MMult

•“Typical” Architecture

–8-wide Instruction-Level Parallel (ILP)

•Assuming 3000 x 3000 matrices

–Original requires 45 million cycles

–Pipelined version requires 3 million + 15

Current Compiler Projects

•Hy-C

–Build tools

–Partition algorithms

–Retargetability and constraint specification

–OMAP project

•Thread-level parallelism in imperative code

–Limit study

–Improved identification of threads

•Fast compiler-controlled memory

Application

Imaging

Video

Audio

OMAP4 Sub-System Encapsulation

Chiron

Tesla

Ducati

Multi-CPU

Shared

Memory

OMAP Resources

OMAP Processor Resources

•Chiron

–2 x 600 MHz (2 symmetric processors each at 600 MHzwith shared L2)

–Power 600uW / MHz

•Tesla

–DSP Sub-System (C64x derivative); 400 MHz, 8-wide ILP

–Power 200uW / MHz

•Ducati

–200 MHz (targeted for control, low latency code)

–Power 100uW / MHz

System Specification

Partitioning

Veyron

Ducati

Source Code

Hy-C for OMAP

Optimization Control

Objectives/Constraints

Tesla

OMAP Project, Current State

•Use gcc to generate “readable” SSA graphs for Cprograms

•Developing translator to convert SSA graphs toHy-C internal Control, Data Dependence Graphs(CDDGs).

•Translator to Hy-C CDDGs successfully tested onsmall C programs

1/26/2016

Partition Algorithm

•Examine Control Flow Graph (CFG) for afunction

–Identify software pipelining possibility

–Build Dependence Graph (combining data andcontrol dependence)

•Choose one of three resources for thefunction

Partition Algorithm (cont.)

•If software pipelining profitable, placefunction on C64 DSP resource

•Else examine Dependence Graph

–if ( number of nodes / critical path length ) > 1.5,place on double-issue ARM

–else place on single-issue ARM

Long-Term Future

•Automatic Code Generation (I don’t believe insoftware)

•Visual Programming of Components