Slide 1

Memory

MAR

Controlunit

MBR

Figure 1 Data flow, fetch cycle

Address Data Control

Bus Bus Bus

Memory

MAR

Controlunit

MBR

Figure 2 Data flow, indirect cycle

Address Data Control

Bus Bus Bus

It has been decided to go into the increasingly lucrative Sport Utility Vehicle (SUV) manufacturingbusiness. After some intense research, we determined that there are five stages in the SUV buildingprocess, as follows:

Stage 1: build the chassis.

Stage 2: drop the engine in the chassis.

Stage 3: put doors, a hood, and coverings on the chassis.

Stage 4: attach the wheels.

Stage 5: paint the SUV.

After stage one builds the chassis they pass it to stage 2 to insert the engine and

then immediately start to build another chassis. This way all stages are working

all of the time

Code

Comments

A = A +B

Add the contents of registers A and B and place the result in A, overwritingwhatever's there.

The ALU in our simple computer would perform the following series of steps:

1. Read the contents of registers A and B.

2. Add the contents of A and B.

3. Write the result back to register A.

The four stages in a classic RISC pipeline. Here are the four stages in their abbreviated form,the form in which you'll most often see them:

•Fetch

•Decode

•Execute

•Write

• The amount of time that it takes to complete one pipeline stage is exactly one CPUclock cycle.

• Thus a faster clock means that each of the individual pipeline stages take less time.

• In terms of our assembly line analogy, we could imagine that we have a strictlyenforced rule that each crew must take at most one hour and not a minute more tocomplete their work

• If we speed up the clock by a few minutes an hour, then if we shorten each "hour"by 10 mins and the assembly line will move faster and will produce one SUV every50 minutes.

Pipelining and Branching

(i) Fetch instruction (Fl): Read the next expected instruction into a buffer.

(ii)Decode instruction (DI): Determine the opcode and the operand specifiers.

(iii)Calculate operands (CO): Calculate the effective address of each source

operand. This may involve displacement, register indirect, indirect, or other

forms of address calculation.

(vi)Fetch operands (FO): Fetch each operand from memory. Operands inregisters need not be fetched.

(v)Execute instruction (El): Perform the indicated operation and store theresult, if any, in the specified destination operand location.

(vi)Write operand (WO): Store the result in memory.

Time 

Instruction 1

Instruction 2

Instruction 3

Instruction 4

Instruction 5

Instruction 6

Instruction 7

Instruction 8

Instruction 9

Figure 6

Time   Branch penalty 

Instruction 1

Instruction 2

Instruction 3

Instruction 4

Instruction 5

Instruction 6

Instruction 7

Instruction 15

Instruction 16

Superscalar architecture

• The term superscalar is used to describe CPU’s which have multiple executionunits and are capable of completing more than one instruction during each clockcycle.

• The essence of the superscalar approach is the ability to execute instructions indifferent pipelines.

• The concept can be further exploited by allowing instructions to be executed in anorder different from the program order.

•Superscalar processing adds a bit of complexity to the processor'scontrol unit because it's now tasked not only with fetching anddecoding instructions, but with re-ordering the linear instructionstream so that some of its individual instructions could execute inparallel.

•Furthermore, once executed the instructions must be put back in theorder in which they were originally fetched, so that both theprogrammer and the rest of the system have no idea that theinstructions weren't executed in their proper sequence.

Limitations to Superscalar Execution

Three categories of limitations are:

•1. Resource conflicts:- They occur if two or more instructions competefor the same resource (register, memory, functional unit) at the sametime.

– Introducing several parallel pipelined units, superscalararchitectures try to reduce a part of possible resource conflicts.

•2. Control (procedural) dependency:- The presence of branches createsmajor problems in assuring an optimal parallelism.

–Superscalar techniques work most efficiently on to RISCarchitectures, with fixed instruction length and format.

•3. Data conflicts: - Data conflicts are produced by data dependenciesbetween instructions in the program.

–Because superscalar architectures provide a great liberty in theorder in which instructions can be issued and completed, datadependencies have to be considered with much attention.

•This interface consists of a data bus and two types of ports: the read ports andthe write ports. In order to read a value from a single register in the registerfile, the ALU accesses the register file's read port and requests that the datafrom a specific register be placed on the special internal data bus that theregister file shares with the ALU. Likewise, writing to the register file is donethrough the file's write port.

•You can’t execute the 1st 2 instructions in parallel or the 3rd and 4th .

•Out of order execution in a superscalar CPU would normally allow the 1st and3rd instructions to execute concurrently and then the 2nd and 4th instructionscould also execute concurrently.

•However, a data hazard, of sorts, also exists between the first and thirdinstructions since they use the same register – could use different register

•CPU could support an array of EAX registers; e.g EAX[0], EAX[1], EAX[2],etc.and EBX[0]..EBX[n], ECX[0]..ECX[n], etc.

•CPU can automatically choose a different register array element if doing sowould not change the overall computation and doing so could speed up theexecution of the program.

mov( 0, eax );

mov( eax, i );

mov( 50, eax );

mov( eax, j );

mov( 0, eax[0] );

mov( eax[0], i );

mov( 50, eax[1] );

mov( eax[1], j );

Since EAX[0] and EAX[1] are different registers, the CPU can execute the first andthird instructions concurrently. Likewise, the CPU can execute the second and fourthinstructions concurrently

•Very long instruction word (VLIW) allows the compiler or pre-processor tobreak program instruction down into basic operations that can be performedby the processor in parallel

•These operations are put into a very long instruction word which the processorcan then take apart without further analysis, handing each operation to anappropriate functional unit.

Very Long Instruction Word Architecture

::::Desktop:Screen shot 2011-10-10 at 17.02.20.png

VLIW chips combine two or moreinstructions into a single bundleor packet. The compilerprearranges the bundles so theVLIW chip can quickly execute theinstructions in parallel, freeing themicroprocessor from having toperform the complex andcontinual runtime analysis thatsuperscalar RISC and CISC chipsmust do.

The compilers are responsible forarranging the instructions in themost efficient manner.

Macintosh HD:Users:brianthompson:Desktop:Screen shot 2011-10-10 at 16.16.41.png

But even with the bestcompilers, there are limits tohow much parallelism a VLIWprocessor can exploit. Thisprinciple works fine in morespecial processors such asDSPs. These chip perform thesame operations over and overagain. A good RISC or CISCdesign might do just as wellwith the software that mostusers run.

Dealing with branches

One of the major problems in designing an instruction pipeline is assuring the steady flow ofinstructions to the initial stages of the pipeline. One of the main problems is the conditionalbranch. A variety of approaches have been taken for dealing with conditional branches:

Multiple streams

Prefetch branch target

Loop buffer

Branch prediction

Delayed branch

Multiple Streams

A simple pipeline suffers a penalty for a branch instruction because it must choose oneof two instructions to fetch next and may make the wrong choice. A brute-forceapproach is to replicate the initial portions of the pipeline and allow the pipeline to fetchboth instructions, making use of two streams. There are two problems with thisapproach:

(i)With multiple pipelines there are contention delays for access to the registers and tomemory.

(ii)Additional branch instructions may enter the pipeline (either stream) before theoriginal branch decision is resolved. Each such instruction needs an additional stream.

Despite these drawbacks, this strategy can improve performance.

Prefetch Branch Target

When a conditional branch is recognised, the target of the branch is prefetched, in additionto the instruction following the branch. This target is then saved until the branch instructionis executed. If the branch is taken, the target has already been prefetched.

Loop Buffer

A loop buffer is a small, very-high-speed memory maintained by the instruction fetch stage ofthe pipeline and containing the n most recently fetched instructions, in sequence. If a branchis to be taken, the hardware first checks whether the branch target is within the buffer. If so,the next instruction is fetched from the buffer. The loop buffer has three benefits:

1.With the use of prefetching, the loop buffer will contain some instructions sequentially aheadof the current instruction fetch address. Thus, instructions fetched in sequence will beavailable without the usual memory access time.

2.If a branch occurs to a target just a few locations ahead of the address of the branchinstruction, the target will already be in the buffer. This is useful for the rather commonoccurrence of IF-THEN and IF-THEN-ELSE sequences.

3. This strategy is particularly well suited to dealing with loops, or iterations; hence the nameloop buffer. If the loop buffer is large enough to contain all the instructions in a loop, thenthose instructions need to be fetched from memory only once, for the first iteration. Forsubsequent iterations, all the needed instructions are already in the buffer.