## 科目: 計算機系統(5009) 校系所組:中大通訊工程學系乙組 ## 清大電機工程學系丙組 (5%) We are interested in two implementations of a machine, one with and one without special floating-point hardware. Consider a program *P*, with the following mix of operations: floating-point multiply 10% floating-point add 15% floating-point divide 5% integer instructions 70% Machine MFP (Machine with Floating Point) has floating-point hardware and can therefore implement the floating point operations directly. It requires the following number of clock cycles for each instruction class: floating-point multiply 6 floating-point add 4 floating-point divide 20 integer instructions 2 Machine MNFP (Machine with No Floating Point) has no floating-point hardware and so must emulate the floating-point operations using integer instructions. The integer instructions all take 2 clock cycles. The number of integer instructions needed to implement each of the floating-point operations is as follows: floating-point multiply 30 floating-point add 20 floating-point divide 50 Both machines have a clock rate of 1000 MHz. Please find the native MIPS (Million Instructions per Second) ratings for both machines. - \_\_ · (15%) Answer the following questions briefly. (Mark you answers clearly) - (-) Consider the following MIPS instruction that loads an unsigned byte from the memory to a register, **Ibu \$s1**, 100(\$s2). Its meaning can be translated into an assignment statement: \$s1 = Memory[x], where x is an expression. What is x? (5%) - ( $\square$ ) Consider the addition of two 8-bit 2's-complement integers. Assuming that the sum needs to be fit into an 8-bit binary representation, we will say that there is an overflow when the sum is larger than max. Then, what is the value of max in its decimal form? (5%) - (≡) What does MIPS instruction srl mean? Give you answer in one sentence. (5%) - = `(15%) Consider a sequence of MIPS instructions that can be used to discover overflow when adding two signed integers stored in registers \$t1 and \$t2. In this sequence, A and C denote two instructions' names and B denotes an operand. What are A, B, and C, respectively? (Note: instruction slt performs set-on-less-than operation). (Mark you answers clearly) ``` addu $t0, $t1, $t2 // $t0 =sum of $1 and $2, but don't trap $t3, $t1, $t2 // An instruction useful to check if two operands' signs are different slt $t3, $t3, $zero // Set $t3=1 if two operands' signs are indeed different <u>C</u>____ \$t3, \underline{\textbf{\textit{B}}}, \quad NO \ overflow // if two operands' signs are different, go to No_overflow $t3, $t0, $t1 A // if two operands' signs are the same; sign of sum matches too? slt $t3, $t3, $zero // Set 13 = 1 if the sum's sign differs from that of 12 <u>C</u> $t3, B, Overflow // if overflow condition is indeed satisfied; go to Overflow ``` 注:背面有試題 科目: 計算機系統(5009) 校系所組:中大通訊工程學系乙組 清大電機工程學系丙組 (10%) During the execution of the following fragment of C code for (i = 0; i < 100; i = i + 1) { a[i] = a[i] + 5;}, the first array element is loaded from memory, added with constant 5, and then stored back to memory. This process is repeated until all elements are executed. Assume a is an array of words residing in contiguous memory locations and a[0] is the head word of a cache block. Assume only one level of cache is used and none of the array elements are in the cache initially. Also assume the cache is much larger than the size of the array. - (—)If the cache block size is 2 words, list the first 5 array elements that their accesses may exhibit cold misses. Briefly explain your answer. (Note: you will not receive any score if you do not briefly explain your answer.) - (二) Do you expect to see improvement in execution time if the cache block size is increased from 2 words to 4 words? Again, briefly explain your answer. - 五、(25%) Assume the following instruction frequencies: 24% loads, 10% stores, 10% branches, 2% jumps, and 54% ALU instructions. - (-)(5%) (For multicycle implementation) Assume the number of clock cycles for each instruction class is the following: Loads: 5 Stores: 4 ALU instructions: 4 Branches: 3 Jumps: 3 What is the CPI for this implementation? - (二) (10%) (For pipelined implementation) Assume a MIPS 5-stage pipeline with forwarding is used. Assume half of the load instructions are immediately followed by an instruction that uses the result and the load-use penalty is 1 cycle. Assume branch delay penalty on misprediction is one clock cycle, and that one-quarter of the branches are mispredicted. Also assume jumps always pay 1 full clock cycle of delay, so their average time is 2 clock cycles. Ignore cache misses and any other hazards. What is the CPI for this pipelined design? - ( $\equiv$ ) (10%) (Effect of cache misses) We have assumed no cache misses at all in ( $\cong$ ). However, in reality, we can never have a perfect cache. Cache misses can stall the pipeline further. Assume only one level of cache is used and the cache miss penalty is 40 cycles. Assume an instruction miss rate of 2% and a data access miss rate of 10%. Compute the additional CPI caused by cache misses. 注:背面有試題 科目:計算機系統(5009) 校系所組:中大通訊工程學系乙組 清大電機工程學系丙組 - $\dot{\uparrow}$ $\dot{\uparrow}$ (20%) Assume that, in MIPS processor, the times required to perform the operations of memory access, ALU operations, and register access are 150 picoseconds (ps), 80 ps, and 50 ps, respectively. And the functions of instruction execution can be divided into 5 units as instruction fetch (IF), decode and register read (ID), ALU operation (EX), data memory access (MEM), and register write back (WB). There are five instruction classes, which are R-type (ALU instructions), load, store, conditional branch, and unconditional jump, and the execution of each instruction class needs to perform part or the whole of the above functional units. Assume the following instruction mix: 30% ALU instruction, 20% load, 20% store, 20% conditional branch, and 10% unconditional jump and answer the following questions. - (--) (5%) Please specify which functional units are required for the execution of each instruction class. - (二) (5%) What are the CPI and average execution time per instruction for the single-cycle and fixed clock cycle length implementation approach? - (三) (5%) What are the CPI and average execution time per instruction for the single-cycle but variable-length clock approach (i.e., the length of clock cycle is only as long as the instruction needs to be)? - ([4]) (5%) What are the CPI and average execution time per instruction for the multi-cycle (not pipeline), where each functional unit requires one clock cycle, and fixed clock cycle length approach? - 七、(10%) Answer the following questions by considering a pipeline processor with 5 stages as instruction fetch (IF), register reading and instruction decode (ID), execution (EX), memory access (MEM), and register writing back (WB). - (-) (5%) Explain why the stall can not be solved if an instruction tries to read a register following a load instruction that writes the same register? - (二) (5%) What is the 2-bit prediction scheme used for? And please describe its operation briefly.