Hy-C
A Compiler Retargetable for Single-Chip
Heterogeneous Multiprocessors
Philip Sweany
8/27/2010
2
No single architecture solves all power
problems
Hard -wired
proxy
General Purpose
Processor
100
X
Software
Programmable
DSP
•
Industry has debated merits
of each architecture for
decades…
•
Combination of all
approaches optimizes
power and performance
10
X
Retargetable Compilation
•
Why ?
•
Rocket
–
C compiler, written in C++
–
Retargetable for ILP computers
–
Single machine description file
–
Development 1989-2000
•
Gnu
Hybrid Computing
•
Heterogeneous processors on single chip
–
“CPU”
–
FPGA
–
ASIC
–
N “CPU”s, M FPGAs, K ASICs
•
Tradeoffs of performance, power, flexibility
CPU 1
CPU 2
CPU m
Multi-CPU
FPGA 1
FPGA 2
FPGA n
Multi-FPGA
Shared
Memory
Generic Hybrid Architecture
System Specification
Partitioning
CPU
Compiler
FPGA
Synthesis
CPU
Power-Performance
Model
FPGA
Power-Performance
Model
Source Code
Generic Hy-C Tools
Optimization Control
Objectives/Constraints
Intermediate Representations
•
3-address form
•
Control flow graph
•
SSA --- static single assignment
Control Flow Graph
•
Nodes are Basic Blocks
–
Single entry, single exit
–
No branch exempt (possibly) at bottom
•
Edges represent one possible flow of
execution between two basic blocks
•
Whole CFG represents a function
1/26/2016
9
Static Single Assignment
•
SSA: A program is in SSA form iff
–
Each variable is statically defined exactly only
once, and
–
Each use of a variable is dominated by that
variable’s definition.
1/26/2016
10
Example
•
In general, how to transform
an arbitrary program into SSA
form?
•
Does the definition of X
2
dominates its use in the
example?
X
1
X
2
=
X
4
=
X
3
=
(X
1
, X
2
)
=
1/26/2016
11
SSA: Motivation
•
Provide a uniform basis of an IR to solve a wide
range of classical dataflow problems
•
Encode both dataflow and control flow information
•
A SSA form can be constructed and maintained
efficiently
•
Its popular
•
Gcc uses SSA
Software Pipelining
•
Schedule operations from multiple iterations
of a loop in parallel
•
Hides latency
•
Compiler “reorders” loop code to include:
–
Prelude
–
Kernel
–
Postlude
Software Pipeline Benefit for “Typical”
Architecture and MMult
•
“Typical” Architecture
–
8-wide Instruction-Level Parallel (ILP)
•
Assuming 3000 x 3000 matrices
–
Original requires 45 million cycles
–
Pipelined version requires 3 million + 15
Current Compiler Projects
•
Hy-C
–
Build tools
–
Partition algorithms
–
Retargetability and constraint specification
–
OMAP project
•
Thread-level parallelism in imperative code
–
Limit study
–
Improved identification of threads
•
Fast compiler-controlled memory
15
15
Application
Imaging
Video
Audio
OMAP4 Sub-System Encapsulation
Chiron
Tesla
Ducati
Multi-CPU
Shared
Memory
OMAP Resources
OMAP Processor Resources
•
Chiron
–
2 x 600 MHz (2 symmetric processors each at 600 MHz
with shared L2)
–
Power 600uW / MHz
•
Tesla
–
DSP Sub-System (C64x derivative); 400 MHz, 8-wide ILP
–
Power 200uW / MHz
•
Ducati
–
200 MHz (targeted for control, low latency code)
–
Power 100uW / MHz
System Specification
Partitioning
Veyron
Ducati
Source Code
Hy-C for OMAP
Optimization Control
Objectives/Constraints
Tesla
OMAP Project, Current State
•
Use gcc to generate “readable” SSA graphs for C
programs
•
Developing translator to convert SSA graphs to
Hy-C internal Control, Data Dependence Graphs
(CDDGs).
•
Translator to Hy-C CDDGs successfully tested on
small C programs
1/26/2016
Partition Algorithm
•
Examine
Control Flow Graph
(CFG) for a
function
–
Identify
software pipelining
possibility
–
Build
Dependence Graph
(combining data and
control dependence)
•
Choose one of three resources for the
function
Partition Algorithm (cont.)
•
If software pipelining profitable, place
function on C64 DSP resource
•
Else examine Dependence Graph
–
if ( number of nodes / critical path length ) > 1.5,
place on double-issue ARM
–
else place on single-issue ARM
Long-Term Future
•
Automatic Code Generation (I don’t believe in
software)
•
Visual Programming of Components