Parallel Programming in C with MPI and OpenMP

SIMD and AssociativeComputational Models

Parallel & Distributed Algorithms

SIMD and AssociativeComputational Models

Part I: SIMD Model

Flynn’s Taxonomy

•The best known classification scheme forparallel computers.

•Depends on parallelism they exhibit with

–Instruction streams

–Data streams

•A sequence of instructions (the instructionstream) manipulates a sequence of operands(the data stream)

•The instruction stream (I) and the data stream(D) can be either single (S) or multiple (M)

•Four combinations: SISD, SIMD, MISD, MIMD

Flynn’s Taxonomy (cont.)

•SISD

–Single Instruction Stream, Single Data Stream

–Most important member is a sequential computer

–Some argue other models included as well.

•SIMD

–Single Instruction Stream, Multiple Data Streams

–One of the two most important in Flynn’s Taxonomy

•MISD

–Multiple Instruction Streams, Single Data Stream

–Relatively unused terminology. Some argue that this includespipeline computing.

•MIMD

–Multiple Instructions, Multiple Data Streams

–An important classification in Flynn’s Taxonomy

The SIMD Computer & Model

Consists of two types of processors:

•A front-end or control unit

–Stores a copy of the program

–Has a program control unit to execute program

–Broadcasts parallel program instructions to the arrayof processors.

•Array of processors of simplistic processors thatare functionally more like an ALU.

–Does not store a copy of the program nor have aprogram control unit.

–Executes the commands in parallel sent by the frontend.

SIMD (cont.)

•On a memory access, all activeprocessors must access the same locationin their local memory.

•All active processor executes the sameinstruction synchronously, but on differentdata

•The sequence of different data items isoften referred to as a vector.

Alternate Names for SIMDs

•Recall that all active processors of a SIMDcomputer must simultaneously access the samememory location.

•The value in the i-th processor can be viewed asthe i-th component of a vector.

•SIMD machines are sometimes called vectorcomputers [Jordan,et.al.] or processor arrays[Quinn 94,04] based on their ability to executevector and matrix operations efficiently.

Alternate Names (cont.)

•In particular, the Quinn Textbook for thiscourse, Quinn calls a SIMD a processorarray.

•Quinn and a few others also considers apipelined vector processor to be a SIMD

–This is a somewhat non-standard use of theterm.

–An example is the Cray-1

How to View a SIMD Machine

•Think of soldiers all in a unit.

•A commander selects certain soldiers asactive – for example, every evennumbered row.

•The commander barks out an order that allthe active soldiers should do and theyexecute the order synchronously.

SIMD Execution Style

–Collectively, the individual memories of theprocessing elements (PEs) store the (vector)data that is processed in parallel.

–When the front end encounters an instructionwhose operand is a vector, it issues acommand to the PEs to perform theinstruction in parallel.

–Although the PEs execute in parallel, someunits can be allowed to skip any particularinstruction.

SIMD Computers

•SIMD computers that focus on vectoroperations

–Support some vector and possibly matrixoperations in hardware

–Usually limit or provide less support for non-vector type operations involving data in the“vector components”.

•General purpose SIMD computers

–Support more traditional type operations (e.g.,other than for vector/matrix data types).

–Usually also provide some vector andpossibly matrix operations in hardware.

Possible Architecture for aGeneric SIMD

Interconnection Networks forSIMDs

•No specific interconnection network isspecified.

•2D mesh has been used more morefrequently than others.

•Even hybrid networks (e.g., cubeconnected cycles) have been used.

Example of a 2-D ProcessorInterconnection Network in a SIMD

Each VLSI chip has 16 processing elements.

Each PE can simultaneously send a value to a specificneighbor (e.g., their left neighbor).

PE =processorelement

SIMD Execution Style

•The traditional (SIMD, vector, processor array)execution style ([Quinn 94, pg 62], [Quinn 2004,pgs 37-43]:

–The sequential processor that broadcasts thecommands to the rest of the processors iscalled the front end or control unit.

–The front end is a general purpose CPU thatstores the program and the data that is notmanipulated in parallel.

–The front end normally executes thesequential portions of the program.

–Each processing element has a local memorythat can not be directly accessed by the hostor other processing elements.

SIMD Execution Style

–Collectively, the individual memories of theprocessing elements (PEs) store the (vector)data that is processed in parallel.

–When the front end encounters an instructionwhose operand is a vector, it issues acommand to the PEs to perform theinstruction in parallel.

–Although the PEs execute in parallel, someunits can be allowed to skip any particularinstruction.

Masking on Processor Arrays

•All the processors work in lockstep except thosethat are masked out (by setting mask register).

•The parallel if-then-else is frequently used inSIMDs to set masks,

–Every active processor tests to see if its data meetsthe negation of the boolean condition.

–If it does, it sets its mask bit so those processors willnot participate in the operation initially.

–Next the unmasked processors, execute the THENpart.

–Afterwards, mask bits (for original set of activeprocessors) are flipped and unmasked processorsperform the the ELSE part.

•Note: differs from the sequential version of “If”

if (COND) then A else B

if (COND) then A else B

if (COND) then A else B

Data Parallelism(A strength for SIMDs)

•All tasks (or processors) apply the same set ofoperations to different data.

•Example:

•. Accomplished on SIMDs by having all activeprocessors execute the operations synchronously

•MIMDs can also handle data parallel execution, but mustsynchronize more frequently.

for i  0 to 99 do

a[i]  b[i] + c[i]

endfor

Functional/Control/Job Parallelism(A Strictly-MIMD Paradigm)

•Independent tasks apply different operations todifferent data elements

•First and second statements execute concurrently

•Third and fourth statements execute concurrently

a  2

b  3

m  (a + b) / 2

s  (a2 + b2) / 2

v  s - m2

SIMD Machines

•An early SIMD computer designed forvector and matrix processing was the IlliacIV computer

–built at the University of Illinois

–See Jordan et. al., pg 7

•The MPP, DAP, the Connection MachinesCM-1 and CM-2, MasPar MP-1 and MP-2are examples of SIMD computers

–See Akl pg 8-12 and [Quinn, 94]

SIMD Machines

•Quinn [1994, pg 63-67] discusses the CM-2Connection Machine and a smaller & updatedCM-200.

•Professor Batcher was the chief architect for theSTARAN and the MPP (Massively ParallelProcessor) and an advisor for the ASPRO

–ASPRO is a small second generation STARAN usedby the Navy in the spy planes.

•Professor Batcher is best known architecturallyfor the MPP, which is at the SmithsonianInstitute & currently displayed at a D.C. airport.

Today’s SIMDs

•Many SIMDs are being embedded in SISDmachines.

•Others are being build as part of hybridarchitectures.

•Others are being build as special purposemachines, although some of them couldclassify as general purpose.

•Much of the recent work with SIMDarchitectures is proprietary.

A Company Building InexpensiveSIMD

WorldScape is producing a COTS(commodity off the shelf) SIMD

•Not a traditional SIMD as

–The PEs are full-fledged CPU’s

–the hardware doesn’t synchronize every step.

• Hardware design supports efficientsynchronization

•Their machine is programmed like a SIMD.

•The U.S. Navy has observed that their machinesprocess radar a magnitude faster than others.

•There is quite a bit of information about theirwork at http://www.wscape.com

An Example of a Hybrid SIMD

•Embedded Massively Parallel Accelerators

–Fuzion 150: 1536 processorson a single chip

–Other accelerators: Decypher, Biocellerator,GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2,BioScan

(This and next three slides are due to Prabhakar R. Gudla (Uof Maryland) at a CMSC 838T Presentation, 4/23/2003.)

–Systola 1024: PC add-on board with 1024processors

Hybrid Architecture

High speed Myrinet switch

Systola1024

Hybrid Computer–combines SIMD and MIMD paradigm within a parallelarchitecture  Hybrid Computer

Architecture of Systola1024

•Instruction SystolicArray:

–32  32 mesh ofprocessing elements

–wavefront instructionexecution

SIMDs Embedded in SISDs

•Intel's Pentium 4 includes what they call MMXtechnology to gain a significant performanceboost

•IBM and Motorola incorporated the technologyinto their G4 PowerPC chip in what they calltheir Velocity Engine.

•Both MMX technology and the Velocity Engineare the chip manufacturer's name for theirproprietary SIMD processors and parallelextensions to their operating code.

•This same approach is used by NVidia andEvans & Sutherland to dramatically accelerategraphics rendering.

Special Purpose SIMDs in theBioinformatics Arena

•Parcel

–Acquired by Celera Genomics in 2000

–Products include the sequencesupercomputer GeneMatcher, which has ahigh throughput sequence analysis capability

•Supported over a million processors earlier

–GeneMatcher was used by Celera in theirrace with U.S. government to complete thedescription of the human genome sequencing

•TimeLogic, Inc

–Has DeCypher, a reconfigurable SIMD

Advantages of SIMDs

•Reference: [Roosta, pg 10]

•Less hardware than MIMDs as they have onlyone control unit.

–Control units are complex.

•Less memory needed than MIMD

–Only one copy of the instructions need to be stored

–Allows more data to be stored in memory.

•Less startup time in communicating betweenPEs.

Advantages of SIMDs

•Single instruction stream and synchronization ofPEs make SIMD applications easier to program,understand, & debug.

–Similar to sequential programming

•Control flow operations and scalar operationscan be executed on the control unit while PEsare executing other instructions.

•MIMD architectures require explicitsynchronization primitives, which create asubstantial amount of additional overhead.

Advantages of SIMDs

•During a communication operation betweenPEs,

–PEs send data to a neighboring PE in parallel and inlock step

–No need to create a header with routing informationas “routing” is determined by program steps.

–the entire communication operation is executedsynchronously

–A tight (worst case) upper bound for the time for thisoperation can be computed.

•Less complex hardware in SIMD since nomessage decoder is needed in PEs

– MIMDs need a message decoder in each PE.

SIMD Shortcomings(with some rebuttals)

•Claims are from our textbook by Quinn.

–Similar statements are found in one of our“primary reference book” by Grama, et. al [13].

•Claim 1: Not all problems are data-parallel

–While true, most problems seem to have dataparallel solutions.

–In [Fox, et.al.], the observation was made intheir study of large parallel applications thatmost were data parallel by nature, but oftenhad points where significant branchingoccurred.

SIMD Shortcomings(with some rebuttals)

•Claim 2: Speed drops for conditionally executedbranches

–Processors in both MIMD & SIMD normally have to doa significant amount of ‘condition’ testing

–MIMDs processors can execute multiple branchesconcurrently.

–For an if-then-else statement with execution times forthe “then” and “else” parts being roughly equal, about½ of the SIMD processors are idle during its execution

•With additional branching, the average number ofinactive processors can become even higher.

•With SIMDs, only one of these branches can beexecuted at a time.

•This reason justifies the study of multiple SIMDs (orMSIMDs).

SIMD Shortcomings(with some rebuttals)

• Claim 2 (cont): Speed drops forconditionally executed code

–In [Fox, et.al.], the observation was made thatfor the real applications surveyed, theMAXIMUM number of active branches at anypoint in time was about 8.

–The cost of the extremely simple processorsused in a SIMD are extremely low

•Programmers used to worry about ‘full utilization ofmemory’ but stopped this after memory costbecame insignificant overall.

SIMD Shortcomings(with some rebuttals)

•Claim 3: Don’t adapt to multiple users well.

–This is true to some degree for all parallel computers.

–If usage of a parallel processor is dedicated to aimportant problem, it is probably best not to riskcompromising its performance by ‘sharing’

–This reason also justifies the study of multiple SIMDs(or MSIMD).

–SIMD architecture has not received the attention thatMIMD has received and can greatly benefit fromfurther research.

SIMD Shortcomings(with some rebuttals)

•Claim 4: Do not scale down well to“starter” systems that are affordable.

–This point is arguable and its ‘truth’ is likely tovary rapidly over time

–WorldScape/ClearSpeed currently sells a veryeconomical SIMD board that plugs into a PC.

SIMD Shortcomings(with some rebuttals)

Claim 5: Requires customized VLSI for processorsand expense of control units has dropped

•Reliance on COTS (Commodity, off-the-shelf parts)has dropped the price of MIMDS

•Expense of PCs (with control units) has droppedsignificantly

•However, reliance on COTS has fueled the successof ‘low level parallelism’ provided by clusters andrestricted new innovative parallel architectureresearch for well over a decade.

SIMD Shortcomings(with some rebuttals)

Claim 5 (cont.)

•There is strong evidence that the period ofcontinual dramatic increases in speed of PCsand clusters is ending.

•Continued rapid increases in parallelperformance in the future will be necessary inorder to solve important problems that arebeyond our current capabilities

•Additionally, with the appearance of the veryeconomical COTS SIMDs, this claim no longerappears to be relevant.