

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University **Engineering**  
Arizona State University

**CSE 520**  
**Computer Architecture II**

CPU Performance Evaluation

Prof. Michel A. Kinsky

---



---



---



---



---



---

1

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University **Engineering**  
Arizona State University

**Performance Measurement**

- Processor performance:
  - Execution time
  - Area
  - Logic complexity
  - Power

$$\frac{\text{Time}}{\text{Program}} = \frac{\text{Instructions}}{\text{Program}} * \frac{\text{Cycles}}{\text{Instruction}} * \frac{\text{Time}}{\text{Cycle}}$$

- In this class we will focus on Execution time

---



---



---



---



---



---

2

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University **Engineering**  
Arizona State University

**Datapath for Memory Instructions**

- Should program and data memory be separate?
  - Harvard style: separate (Aiken and Mark 1 influence)
    - read-only program memory
    - read/write data memory
- Princeton style: the same (von Neumann's influence)
  - single read/write memory for program and data
    - Executing a Load or Store instruction requires accessing the memory more than once

---



---



---



---



---



---

3



4

---



---



---



---



---



---



---



5

---



---



---



---



---



---



---



6

---



---



---



---



---



---



---



7



8



9

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

**Two-State Controller**

- In the Princeton Microarchitecture, a flipflop can be used to remember the phase

10

---



---



---



---



---



---

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

**Hardwired Controller**

11

---



---



---



---



---



---

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

**Clock Period**

- Princeton architecture
  - $t_{C-Princeton} > \max \{t_M, t_{RF} + t_{ALU} + t_M + t_{WB}\}$
  - $t_{C-Princeton} > t_{RF} + t_{ALU} + t_M + t_{WB}$
- while in the hardwired Harvard architecture
  - $t_{C-Harvard} > t_M + t_{RF} + t_{ALU} + t_M + t_{WB}$
- which will execute instructions faster?

12

---



---



---



---



---



---

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

### Clock Rate vs CPI

- Suppose  $t_M \gg t_{RF} + t_{ALU} + t_{WB}$ 
  - $t_{C-Princeton} = 0.5 * t_{C-Harvard}$
  - $CPI_{Princeton} = 2$
  - $CPI_{Harvard} = 1$
- No difference in performance!

---



---



---



---



---



---



---

13

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

### Princeton Microarchitecture

- Can we overlap instruction fetch and execute?

---



---



---



---



---



---



---

14

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

### Princeton Microarchitecture

- Only one of the phases is active in any cycle
  - A lot of datapath is not in use at any given time

---



---



---



---



---



---



---

15



16

---



---



---



---



---



---




---



---



---



---



---



---

17




---



---



---



---



---



---

18

- Compiler Effects on Performance
  - CPU time = Instruction count x CPI / Clock rate
  - For compiler 1:
    - $CPI_1 = (5 \times 1 + 1 \times 2 + 1 \times 3) / (5 + 1 + 1) = 10 / 7 = 1.43$
    - $CPU\ time_1 = ((50 + 10 + 10) \times 10^6 \times 1.43) / (100 \times 10^9) = 1\ second$
  - For compiler 2:
    - $CPI_2 = (10 \times 1 + 1 \times 2 + 1 \times 3) / (10 + 1 + 1) = 15 / 12 = 1.25$
    - $CPU\ time_2 = ((100 + 10 + 10) \times 10^6 \times 1.25) / (100 \times 10^9) = 1.5\ seconds$

19

**Processor Performance**

- Speed Up Equations for Pipelining

$$CPI_{\text{pipelined}} = \text{Ideal CPI} + \text{Average Stall cycle per Instruction}$$

$$\text{Speedup} = \frac{\text{Ideal CPI} \times \text{Pipeline Depth}}{\text{Ideal CPI} + \text{Pipeline stall CPI}} \times \frac{\text{Clock Cycle}_{\text{Unpipelined}}}{\text{Clock Cycle}_{\text{Pipelined}}}$$

- If Ideal CPI = 1
  - Speed Up  $\leq$  Pipeline Depth

$$\text{Speedup} = \frac{\text{Pipeline Depth}}{1 + \text{Pipeline stall CPI}} \times \frac{\text{Clock Cycle}_{\text{Unpipelined}}}{\text{Clock Cycle}_{\text{Pipelined}}}$$

20

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering  
Arizona State University

## Illustrative Example

- We want to compare the performance of two machines. Which machine is faster?
  - Machine A: Dual ported memory - so there are no memory stalls
  - Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate
- Assumptions
  - Ideal CPI = 1 for both
  - Loads are 40% of instructions executed

21

22

**STAM** Center  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering  
Arizona State University

## Amdahl's Law

- By Gene Amdahl
- This law answers the critical question:
  - How much of a speedup one can get for a given architectural improvement/enhancement?
  - The performance enhancement possible due to a given design improvement is limited by the amount that the improved feature is used
- Performance improvement or speedup due to enhancement E

$$\text{Speedup}(E) = \frac{\text{Execution Time without } E}{\text{Execution Time with } E} = \frac{\text{Performance with } E}{\text{Performance without } E}$$

23

- By Gene Amdahl
- This law answers the critical question:
  - How much of a speedup one can get for a given architectural improvement/enhancement?
  - Suppose that enhancement E accelerates a fraction F of the execution time by a factor S and the remainder of the time is unaffected then:
    - Execution Time with E =  $((1-F) + F/S) \times$  Execution Time without E
    - Hence speedup is given by:

24

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

### Amdahl's Law

- For the RISC machine with the following instruction composition:
 

| Op     | Freq | Cycles | CPI(i) | % Time |
|--------|------|--------|--------|--------|
| ALU    | 50%  | 1      | .5     | 23%    |
| Load   | 20%  | 5      | 1.0    | 45%    |
| Store  | 10%  | 3      | .3     | 14%    |
| Branch | 20%  | 2      | .4     | 18%    |
- If a CPU design enhancement improves the CPI of load instructions from 5 to 2, what is the resulting performance improvement from this enhancement

---



---



---



---



---



---

25

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

### Amdahl's Law

- For the RISC machine with the following instruction composition:
 

| Op     | Freq | Cycles | CPI(i) | % Time |
|--------|------|--------|--------|--------|
| ALU    | 50%  | 1      | .5     | 23%    |
| Load   | 20%  | 5      | 1.0    | 45%    |
| Store  | 10%  | 3      | .3     | 14%    |
| Branch | 20%  | 2      | .4     | 18%    |
- If a CPU design enhancement improves the CPI of load instructions from 5 to 2, what is the resulting performance improvement from this enhancement

Fraction enhanced =  $F = 45\% \text{ or } .45$   
 Unaffected fraction =  $100\% - 45\% = 55\% \text{ or } .55$   
 Factor of enhancement =  $5/2 = 2.5$

$$\text{Speedup}(E) = \frac{1}{(1 - F) + F/S} = \frac{1}{.55 + .45/2.5} = 1.37$$


---



---



---



---



---



---

26

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

### Amdahl's Law

- For the RISC machine with the following instruction composition:
 

| Op     | Freq | Cycles | CPI(i) | % Time |
|--------|------|--------|--------|--------|
| ALU    | 50%  | 1      | .5     | 23%    |
| Load   | 20%  | 5      | 1.0    | 45%    |
| Store  | 10%  | 3      | .3     | 14%    |
| Branch | 20%  | 2      | .4     | 18%    |
- If a CPU design enhancement improves the CPI of load instructions from 5 to 2, what is the resulting performance improvement from this enhancement

---



---



---



---



---



---

27

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

### Amdahl's Law

- For the RISC machine with the following instruction composition:
 

| Op     | Freq | Cycles | CPI(0) | % Time |
|--------|------|--------|--------|--------|
| ALU    | 50%  | 1      | .5     | 23%    |
| Load   | 20%  | 5      | 1.0    | 45%    |
| Store  | 10%  | 3      | .3     | 14%    |
| Branch | 20%  | 2      | .4     | 18%    |
- If a CPU design enhancement improves the CPI of load instructions from 5 to 2, what is the resulting performance improvement from this enhancement?  

$$\text{Old CPI} = 2.2$$

$$\text{New CPI} = .5 \times 1 + 2 \times 2 + .1 \times 3 + .2 \times 2 = 1.6$$

$$\text{Speedup}(E) = \frac{\text{Original Execution Time}}{\text{New Execution Time}} = \frac{\text{Instruction count} \times \text{old CPI} \times \text{clock cycle}}{\text{Instruction count} \times \text{new CPI} \times \text{clock cycle}}$$

$$= \frac{\text{old CPI}}{\text{new CPI}} = \frac{2.2}{1.6} = 1.37$$

28

---



---



---



---



---



---



---



---

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

### Amdahl's Law

- A program takes 100 seconds to execute on a machine with load operations responsible for 80 seconds of this time. By how much must the load operation be improved to make the program four times faster?

---



---



---



---



---



---



---



---

29

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

### Amdahl's Law

- A program takes 100 seconds to execute on a machine with load operations responsible for 80 seconds of this time. By how much must the load operation be improved to make the program four times faster?  

$$\frac{100}{\text{Execution Time with enhancement}}$$

$$\text{Desired speedup} = 4 = \frac{100}{\text{Execution Time with enhancement}}$$

$$\text{Execution time with enhancement} = 100 * (1/4) = 25 \text{ seconds}$$

$$\rightarrow 25 \text{ seconds} = (100 - 80 \text{ seconds}) + 80 \text{ seconds} / n$$

$$\rightarrow 25 \text{ seconds} = 20 \text{ seconds} + 80 \text{ seconds} / n$$

$$\rightarrow 5 = 80 \text{ seconds} / n$$

$$\rightarrow n = 80/5 = 16$$
- Load operation should be 16 times faster to get a speedup of 4!

---



---



---



---



---



---



---



---

30

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

### Amdahl's Law

- A program takes 100 seconds to execute on a machine with load operations responsible for 80 seconds of this time. By how much must the load operation be improved to make the program five times faster?

---

---

---

---

---

31

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

### Amdahl's Law

- A program takes 100 seconds to execute on a machine with load operations responsible for 80 seconds of this time. By how much must the load operation be improved to make the program five times faster?

$$\text{Desired speedup} = 5 = \frac{100}{\text{Execution Time with enhancement}}$$

Execution time with enhancement =  $100 * (1/5) = 20$  seconds

$$\begin{aligned} \rightarrow 20 \text{ seconds} &= (100 - 80 \text{ seconds}) + 80 \text{ seconds} / n \\ \rightarrow 20 \text{ seconds} &= 20 \text{ seconds} + 80 \text{ seconds} / n \\ \rightarrow 0 &= 80 \text{ seconds} / n \end{aligned}$$

- No amount of load operation improvement will be able achieve this speed

---

---

---

---

---

32

**STAM Center**  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

### Multiple Enhancements

- Suppose that enhancement  $E_i$  accelerates a fraction  $F_i$  of the execution time by a factor  $S_i$  and the remainder of the time is unaffected then:

$$\text{Speedup} = \frac{\text{Original Execution Time}}{\left( (1 - \sum_i F_i) + \sum_i \frac{F_i}{S_i} \right) \times \text{Original Execution Time}}$$

$$\text{Speedup} = \frac{1}{\left( (1 - \sum_i F_i) + \sum_i \frac{F_i}{S_i} \right)}$$


---

---

---

---

---

33

**STAM Center** SECURE, TRUSTED, AND ASSURED MICROELECTRONICS **ASU** Arizona State University **Engineering**

### Multiple Enhancements

- Three CPU performance enhancements are proposed with the following speedups and percentage of the code execution time affected:
 

|                                            |                                                |
|--------------------------------------------|------------------------------------------------|
| Speedup <sub>1</sub> = S <sub>1</sub> = 10 | Percentage <sub>1</sub> = F <sub>1</sub> = 20% |
| Speedup <sub>2</sub> = S <sub>2</sub> = 15 | Percentage <sub>2</sub> = F <sub>2</sub> = 15% |
| Speedup <sub>3</sub> = S <sub>3</sub> = 30 | Percentage <sub>3</sub> = F <sub>3</sub> = 10% |
- While all three enhancements are in place in the new design, each enhancement affects a different portion of the code and only one enhancement can be used at a time.
- What is the resulting overall speedup?

$$\text{Speedup} = \frac{1}{((1 - \sum F_i) + \sum \frac{F_i}{S_i})}$$

34

---

---

---

---

---

---

---

---

---

**STAM Center** SECURE, TRUSTED, AND ASSURED MICROELECTRONICS **ASU** Arizona State University **Engineering**

### Multiple Enhancements

- Three CPU performance enhancements are proposed with the following speedups and percentage of the code execution time affected:
 

|                                            |                                                |
|--------------------------------------------|------------------------------------------------|
| Speedup <sub>1</sub> = S <sub>1</sub> = 10 | Percentage <sub>1</sub> = F <sub>1</sub> = 20% |
| Speedup <sub>2</sub> = S <sub>2</sub> = 15 | Percentage <sub>2</sub> = F <sub>2</sub> = 15% |
| Speedup <sub>3</sub> = S <sub>3</sub> = 30 | Percentage <sub>3</sub> = F <sub>3</sub> = 10% |
- While all three enhancements are in place in the new design, each enhancement affects a different portion of the code and only one enhancement can be used at a time.
- What is the resulting overall speedup?

$$\text{Speedup} = \frac{1}{((1 - \sum F_i) + \sum \frac{F_i}{S_i})}$$

- Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30]
 
$$= 1 / [ .55 + .0333 ]$$

$$= 1 / .5833 = 1.71$$

35

---

---

---

---

---

---

---

---

---

**STAM Center** SECURE, TRUSTED, AND ASSURED MICROELECTRONICS **ASU** Arizona State University **Engineering**

### Amdahl's Law

- Key Insights**
  - The performance of any system is constrained by the speed or capacity of the slowest point
  - The impact of an effort to improve the performance of a program is primarily constrained by the amount of time that the program spends in parts of the program NOT TARGETED by the effort
  - Amdahl's Law is a statement of the maximum theoretical speed-up you can ever hope to achieve
  - The actual speed-ups are always less than the speed-up predicted by Amdahl's Law

---

---

---

---

---

---

---

---

---

36

**STAM Center** SECURE, TRUSTED, AND ASSURED MICROELECTRONICS **ASU** Arizona State University **Engineering**

### Amdahl's Law

- For software and hardware engineers MUST have a very deep understanding of Amdahl's Law if they are to avoid having unrealistic performance expectations
  - For systems folks: this law allows you to estimate the net performance benefit a new hardware feature will add to program executions
  - For software folks: this law allows you to estimate the amount of parallelism your program/algorithm can achieve before you start writing your parallel code

---



---



---



---



---



---



---

37

**STAM Center** SECURE, TRUSTED, AND ASSURED MICROELECTRONICS **ASU** Arizona State University **Engineering**

### CPU Performance

- CPU performance factors
  - Instruction count
  - Determined by ISA and compiler
  - CPI and Cycle time
  - Determined by CPU hardware
  - Longest delay determines clock period
    - Critical path: load instruction

---



---



---



---



---



---



---

38

**STAM Center** SECURE, TRUSTED, AND ASSURED MICROELECTRONICS **ASU** Arizona State University **Engineering**

### CPU Performance

- Longest delay determines clock period
  - Critical path: load instruction
    - Instruction memory
    - Register file read
    - ALU operation
    - Data memory access
    - Register file writeback
- Performance can be improved by pipelining

---



---



---



---



---



---



---

39



**STAM** Center  
SECURE, TRUSTED, AND ASSURED MICROELECTRONICS

**ASU** Arizona State University  
Engineering

Next Learning Module

- Branch Prediction

---

---

---

---

---

---

---