The Cosmic Cube

The Cosmic CubeThe Cosmic Cube

Charles L. SeitzCharles L. Seitz

Presented By: Jason D. RobeyPresented By: Jason D. Robey

2 APR 032 APR 03

AgendaAgenda

•Introduction

•Message Passing

•Process Oriented

•Concurrency Paradigm

•Hardware Description

•Software Considerations

•Measurements

•Future Work

•Summary

IntroductionIntroduction

•How do we get a whole bunch ofprocessors to work together on the sameproblem in a scalable way?

•Test bed developed at Caltech for whatthey hoped to be a VLSI implementation

•Programmer controls data sharing, notcache coherency mechanisms

•Techniques for certain problems that giveclose to linear speed-up

Message PassingMessage Passing

•Communication and synchronization primitivesseen by programmer

–Barrier

–Blocking sends and receives

–Broadcasts and node to node message passing

–Explicit sharing of data through sending messages

–Programmer decides when updates are necessary

•Hardware structure is memory/processor node

–Separate consideration for memory vs. inter-processcommunication

–Optimize each

–Memory is closer to where it will be used

Message PassingMessage Passing

•Hyper-cube communications

–Scales well

•O(n lg n) cost

•O(lg n) worst case message delivery

–Simple routing

•Discrete, 2-valued, n-tuple

•Process address gives routing instructions

–Clustering

•Can use “spheres” of nodes for separate problems

Process OrientedProcess Oriented

•Abstraction from direct hardware targeting

•Processes mapped to nodes

–Multiple processes interleaved in single nodes

–Unique addresses

–Unique message channels

–Programmer not concerned w/ actual number ofnodes and node addresses

•Kernel required on each node

–Provides routing services

–Provides process management services

–Requires processing time

Process OrientedProcess Oriented

•Caltech disallows process node switching

–Prevents effective run-time load balancing

•Programmer responsibility

–Allows node ID to be included w/ process ID

•Can take advantage of hyper-cube routingsimplifications

•Issue: Interleaving may be bad in certaincases

–Context switch for message passing

Concurrency ParadigmConcurrency Paradigm

•Programmer must explicitly deal withconcurrency

•Different from other approaches wherecompiler or hardware is expected to findparallelization

•Requires a restructuring of singleprocessor ides

–Bubble sort becomes a linear solution

–A lot of solutions need to be redesignedaltogether

Concurrency ParadigmConcurrency Paradigm

•Techniques

–Exploit outer loop unrolling

•Sparse/Predictable messaging

•Good for science and engineeringproblems

–Regular loops

–Predictable flow

–SIMD—Same thing on a whole lotta data

Hardware DescriptionHardware Description

•64-node hyper-cube

–5 ft., 700 watts, $80,000

–Linear projection

–Simulation results led to hyper-cube choice

–Allowed for slow network links compared to CPUSpeed

•Node

–8086 processor w/ 8087 coprocessor

•Needed good floating-point operations

•Slowed from 8 MHz to 5 MHz for 8087

–128K RAM—Spend money on other things

–8K ROM for initialization and POSTs

Hardware DescriptionHardware Description

•Developed prototype as test bed andresource raiser

•1981-1982 for first prototype 2-cube

•Summer of 1983 to 6-cube

•First year: 560,000 node hours

–2 hard errors

–1 soft error/several days

Software ConsiderationsSoftware Considerations

•Development and testing done on traditionalmachines

•Initialization had to deal with node checks inaddition to RAM checks

•Extensions to C had to be developed to facilitatethe machine’s use by other researchers

•Kernel must be developed

–Deal with message passing constructs

–Must manage requests from intermediate host (IH)

–probe: Allows process access to message layer

–spy: Allows IH to examine and modify kernelexecution data

MeasurementsMeasurements

•Speedup = T(1)/T(n)

•Efficiency = Speedup / N

–1 is good

–<= 1/N is bad

•Only really useful to measure scalability ofan algorithm with problems requiring a lotmore processes than nodes available

MeasurementsMeasurements

•What affects efficiency? (Overhead)

–Load balancing problems

–Message start-up latency

•Big messages vs. small messages

–Hop latency

–Processor time used in message routingfunctions

MeasurementsMeasurements

•Performance

–Some apps achieved max of 3 MIPS infloating-point ops

–Many other apps reached optimal speed-upcompared to VAX11/780 with overheads of.025 - .5

•Low message frequency?

Future WorkFuture Work

•Move routing functions to network device

•Experiment with hybrid shared memoryapproach

•Allow for dynamic load-balancing

•Experiment with more programmer control ofprocess to node assignments

•Try different problem areas to expand messageprotocol

•Make interface more programmer friendly

SummarySummary

•New programming paradigm required

•Offers lots of advantages in the scientificand engineering problem set

•May be interesting to apply to otherdomains

•Achieved what appears to be excellentscalability

•Good success in limited domain

Questions?Comments?Snide Remarks?Questions?Comments?Snide Remarks?