Computer Architecture Quiz Answer. In this post you will get Quiz Answer Of Computer Architecture Quiz Answer
Computer Architecture Quiz Answer
Offered By ”Princeton University”
Week- 5
Midterm
1.
Question 1
What Out-Of-Order processor hardware structure can be used to enforce that instructions commit in order?
5 points
Enter answer here
Re-order buffer |
========================================================
2.
Question 2
Register renaming is able overcome which of the data hazards? Select all that apply
5 points
- RAW
- RAR
- WAR
- WAW
========================================================
3.
Question 3
How many
SRAM bits are needed to implement an 8KB two-way set associative cache with 64B
block size? Assume that each line (entry)
has a single valid bit and no dirty bits.
There is one bit per set for true LRU.
Assume that the address size of the machine is 32-bits and that the
machine allows for byte addressing.
12 points
Enter answer here
68288 |
========================================================
4.
Question 4
Which of the following two processors will execute a program with the given instruction mix faster?
Name Processor A Processor B
Frequency 1GHz. 2GHz.
CPI for ALU Instructions 1 1.5
CPI for Branch Instructions 2 3
CPI for Memory Instructions 1 2
Instruction Mix:
50% ALU Instructions
10% Branch Instructions
40% Memory Instructions
12 points
- Processor A
- Processor B
- Both processors execute the program in the same amount of time
========================================================
5.
Question 5
In a
pipelined processor, a single instruction takes the following synchronous
exceptions (interrupts): Divide-by-Zero fault and Invalid Opcode. What should the interrupt cause be loaded
with?
8 points
- Bad trap
- Invalid opcode
- Divide-by-zero
- TLB miss
- The faulting PC
========================================================
6.
Question 6
Use the following architecture for questions 6-9:
Given a 3-wide in-order processor, draw the optimal pipeline diagram and answer question 6-9, showing for each instruction, what stage of the pipeline it is in for each cycle for the execution of the code sequence below. Assume full bypassing of values from the respective instruction completion stage to the Decode stage. Assume that pipeline X can execute branches and ALU operations, pipeline Y can excute loads, stores, and ALU operations, and
Code sequence for questions 6-9:
Which instructions stall due to data hazard? Check all that apply
4 points
- 4: ADD R14, R11, R15
- 7: LW R22, 4(R19)
- 8: LW R24, 8(R19)
- 10: LW R26, 16(R19)
- 11: OR R11, R26, R18
- 13: ADDIU R16, R17, 3
========================================================
7.
Question 7
Use the following architecture for questions 6-9:
Given a 3-wide in-order processor, draw the optimal pipeline diagram and answer question 6-9, showing for each instruction, what stage of the pipeline it is in for each cycle for the execution of the code sequence below. Assume full bypassing of values from the respective instruction completion stage to the Decode stage. Assume that pipeline X can execute branches and ALU operations, pipeline Y can excute loads, stores, and ALU operations, and pipeline Z can execute loads, stores, and ALU operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. The machine can fetch three instructions per cycle, decode three instructions per cycle, execute three instructions per cycle, and writeback three instructions per cycle but maintains data dependencies. The operand steering logic can steer any operand to any ALU to enable any instruction to reach any pipeline, but the pipelines have restrictions on what instructions each can execute as described above. Assume that there are no alignment restrictions on instructions which can be simultaneously fetched from the instruction memory. Also, assume that instructions stall in the decode stage if there are structural or data hazards and stalling one pipeline does not inhibit the fetching of future instructions. The figure below shows the pipeline with pipeline stage names underlined.
Code sequence for questions 6-9:
Which instructions stall due to structural hazard? Select all that apply
4 points
- 4: ADD R14, R11, R15
- 7: LW R22, 4(R19)
- 9: LW R25, 12(R19)
- 11: OR R11, R26, R18
- 12: AND R13, R17, R29
========================================================
8.
Question 8
Use the following architecture for questions 6-9:
Given a 3-wide in-order processor, draw the optimal pipeline diagram and answer question 6-9, showing for each instruction, what stage of the pipeline it is in for each cycle for the execution of the code sequence below. Assume full bypassing of values from the respective instruction completion stage to the Decode stage. Assume that pipeline X can execute branches and ALU operations, pipeline Y can excute loads, stores, and ALU operations, and pipeline Z can execute loads, stores, and ALU operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. The machine can fetch three instructions per cycle, decode three instructions per cycle, execute three instructions per cycle, and writeback three instructions per cycle but maintains data dependencies. The operand steering logic can steer any operand to any ALU to enable any instruction to reach any pipeline, but the pipelines have restrictions on what instructions each can execute as described above. Assume that there are no alignment restrictions on instructions which can be simultaneously fetched from the instruction memory. Also, assume that instructions stall in the decode stage if there are structural or data hazards and stalling one pipeline does not inhibit the fetching of future instructions. The figure below shows the pipeline with pipeline stage names underlined.
Code sequence for questions 6-9:
Which instructions stall in the fetch stage? Select all that apply
4 points
- 6: AND R19, R20, R21
- 7: LW R22, 4(R19
- 11: OR R11, R26, R18
- 12: AND R13, R17, R29
- 13: ADDIU R16, R17, 3
========================================================
9.
Question 9
Use the following architecture for questions 6-9:
Given a 3-wide in-order processor, draw the optimal pipeline diagram and answer question 6-9, showing for each instruction, what stage of the pipeline it is in for each cycle for the execution of the code sequence below. Assume full bypassing of values from the respective instruction completion stage to the Decode stage. Assume that pipeline X can execute branches and ALU operations, pipeline Y can excute loads, stores, and ALU operations, and pipeline Z can execute loads, stores, and ALU operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. The machine can fetch three instructions per cycle, decode three instructions per cycle, execute three instructions per cycle, and writeback three instructions per cycle but maintains data dependencies. The operand steering logic can steer any operand to any ALU to enable any instruction to reach any pipeline, but the pipelines have restrictions on what instructions each can execute as described above. Assume that there are no alignment restrictions on instructions which can be simultaneously fetched from the instruction memory. Also, assume that instructions stall in the decode stage if there are structural or data hazards and stalling one pipeline does not inhibit the fetching of future instructions. The figure below shows the pipeline with pipeline stage names underlined.
Code sequence for questions 6-9:
Which instructions stall in the decode stage? Select all that apply
4 points
- 4: ADD R14, R11, R15
- 6: AND R19, R20, R21
- 7: LW R22, 4(R19)
- 9: LW R25, 12(R19)
- 10: LW R26, 16(R19)
- 11: OR R11, R26, R18
========================================================
10.
Question 10
Use the following architecture for questions 10-14:
Draw the optimal pipeline diagram for the following code executing on the IO3 processor from lecture as shown below and answer questions 10-14. The IO3 processor fetches instructions in-order, issues instructions out-of-order, writes-back results out-of-order, and commits instructions out-of-order. Assume the processor can fetch one instruction per cycle, decode one instruction per cycle, issue one instruction per cycle, and writeback one result per cycle. Assume full bypassing of values from the respective instruction completion stage to the Decode stage. Assume that pipeline X can execute branches and ALU operations, pipeline M can excute loads and stores, and pipeline Y can execute multiply operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. Multiply instructions have a latency of four cycles. Use the named pipeline stages in the figure for your pipeline diagram. The register file has only one write port. Use a lower-case ‘i’ to denote if an instruction enters the issue queue, but does not immediately issue. Assume that the issue queue can hold 16 instructions and begins empty.
Code sequence for questions 10-17:
Which instructions stall in the issue queue(IQ)? Select all that apply
4 points
- 3: MUL R5, R1, R4
- 4: MUL R7, R5, R6
- 5: ADDIU R18, R11, 1
- 6: ADDIU R14, R18, 1
- 7: ADDIU R13, R18, 2
========================================================
11.
Question 11
Use the following architecture for questions 10-14:
Draw the optimal pipeline diagram for the following code executing on the IO3 processor from lecture as shown below and answer questions 10-14. The IO3 processor fetches instructions in-order, issues instructions out-of-order, writes-back results out-of-order, and commits instructions out-of-order. Assume the processor can fetch one instruction per cycle, decode one instruction per cycle, issue one instruction per cycle, and writeback one result per cycle. Assume full bypassing of values from the respective instruction completion stage to the Decode stage. Assume that pipeline X can execute branches and ALU operations, pipeline M can excute loads and stores, and pipeline Y can execute multiply operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. Multiply instructions have a latency of four cycles. Use the named pipeline stages in the figure for your pipeline diagram. The register file has only one write port. Use a lower-case ‘i’ to denote if an instruction enters the issue queue, but does not immediately issue. Assume that the issue queue can hold 16 instructions and begins empty.
Code sequence for questions 10-17:
Of those that stall in the instruction queue (IQ), which instructions stall for at least one cycle due to a structural hazard? Select all that apply.
4 points
- 3: MUL R5, R1, R4
- 4: MUL R7, R5, R6
- 5: ADDIU R18, R11, 1
- 6: ADDIU R14, R18, 1
- 7: ADDIU R13, R18, 2
========================================================
12.
Question 12
Use the following architecture for questions 10-14:
Draw the optimal pipeline diagram for the following code executing on the IO3 processor from lecture as shown below and answer questions 10-14. The IO3 processor fetches instructions in-order, issues instructions out-of-order, writes-back results out-of-order, and commits instructions out-of-order. Assume the processor can fetch one instruction per cycle, decode one instruction per cycle, issue one instruction per cycle, and writeback one result per cycle. Assume full bypassing of values from the respective instruction completion stage to the Decode stage. Assume that pipeline X can execute branches and ALU operations, pipeline M can excute loads and stores, and pipeline Y can execute multiply operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. Multiply instructions have a latency of four cycles. Use the named pipeline stages in the figure for your pipeline diagram. The register file has only one write port. Use a lower-case ‘i’ to denote if an instruction enters the issue queue, but does not immediately issue. Assume that the issue queue can hold 16 instructions and begins empty.
Code sequence for questions 10-17:
On what cycle does Instruction 4 write back its results into R7? (Assume that Instruction 0 is fetched on cycle 0 and writes back on cycle 4, Instruction 1 is fetched on cycle 1 and writes back on cycle 5, etc.)
4 points
Enter answer here
14 |
========================================================
13.
Question 13
Use the following architecture for questions 10-14:
Draw the optimal pipeline diagram for the following code executing on the IO3 processor from lecture as shown below and answer questions 10-14. The IO3 processor fetches instructions in-order, issues instructions out-of-order, writes-back results out-of-order, and commits instructions out-of-order. Assume the processor can fetch one instruction per cycle, decode one instruction per cycle, issue one instruction per cycle, and writeback one result per cycle. Assume full bypassing of values from the respective instruction completion stage to the Decode stage. Assume that pipeline X can execute branches and ALU operations, pipeline M can excute loads and stores, and pipeline Y can execute multiply operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. Multiply instructions have a latency of four cycles. Use the named pipeline stages in the figure for your pipeline diagram. The register file has only one write port. Use a lower-case ‘i’ to denote if an instruction enters the issue queue, but does not immediately issue. Assume that the issue queue can hold 16 instructions and begins empty.
Code sequence for questions 10-17:
On what cycle does Instruction 6 write back its results into R14? (Assume that Instruction 0 is fetched on cycle 0 and writes back on cycle 4, Instruction 1 is fetched on cycle 1 and writes back on cycle 5, etc.)
4 points
Enter answer here
12 |
========================================================
14.
Question 14
Use the following architecture for questions 10-14:
Draw the optimal pipeline diagram for the following code executing on the IO3 processor from lecture as shown below and answer questions 10-14. The IO3 processor fetches instructions in-order, issues instructions out-of-order, writes-back results out-of-order, and commits instructions out-of-order. Assume the processor can fetch one instruction per cycle, decode one instruction per cycle, issue one instruction per cycle, and writeback one result per cycle. Assume full bypassing of values from the respective instruction completion stage to the Decode stage. Assume that pipeline X can execute branches and ALU operations, pipeline M can excute loads and stores, and pipeline Y can execute multiply operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. Multiply instructions have a latency of four cycles. Use the named pipeline stages in the figure for your pipeline diagram. The register file has only one write port. Use a lower-case ‘i’ to denote if an instruction enters the issue queue, but does not immediately issue. Assume that the issue queue can hold 16 instructions and begins empty.
Code Sequence for Questions 10-17:
Would adding register renaming logic enable faster completion of the code sequence used in Questions 10-17 on an IO3 architecture?
5 points
- Yes
- No
========================================================
15.
Question 15
Use the following architecture for questions 15-17:
Draw the optimal pipeline diagram for the following code executing on the IO2I processor from lecture as shown below. The IO2I processor fetches instructions in-order, issues instructions out-of-order, writes-back results out-of-order, and commits instructions in-order. Assume the processor can fetch one instruction per cycle, decode one instruction per cycle, issue one instruction per cycle, writeback one result per cycle, and commit one instruction per cycle. Assume full bypassing of values from the respective instruction completion stage to the Decode stage. Assume that pipeline X can execute branches and ALU operations, pipeline L excutes loads, pipeline S executes stores, and pipeline Y can execute multiply operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. Multiply instructions have a latency of four cycles. Use the named pipeline stages in the figure for your pipeline diagram. The register file has only one write port. Use a lower-case ‘i’ to denote if an instruction enters the issue queue, but does not immediately issue. Use a lower-case ‘r’ to denote if an instruction enters the reorder buffer, but does not immediately commit. Assume that the issue queue can hold 16 instructions and begins empty.
Code Sequence for Questions 10-17:
Which instructions spend multiple cycles waiting to commit after being written back into the ROB? Select all that apply
5 points
- 3: MUL R5, R1, R4
- 4: MUL R7, R5, R6
- 5: ADDIU R18, R11, 1
- 6: ADDIU R14, R18, 1
- 7: ADDIU R13, R18, 2
========================================================
16.
Question 16
Use the following architecture for questions 15-17:
Draw the optimal pipeline diagram for the following code executing on the IO2I processor from lecture as shown below. The IO2I processor fetches instructions in-order, issues instructions out-of-order, writes-back results out-of-order, and commits instructions in-order. Assume the processor can fetch one instruction per cycle, decode one instruction per cycle, issue one instruction per cycle, writeback one result per cycle, and commit one instruction per cycle. Assume full bypassing of values from the respective instruction completion stage to the Decode stage. Assume that pipeline X can execute branches and ALU operations, pipeline L excutes loads, pipeline S executes stores, and pipeline Y can execute multiply operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. Multiply instructions have a latency of four cycles. Use the named pipeline stages in the figure for your pipeline diagram. The register file has only one write port. Use a lower-case ‘i’ to denote if an instruction enters the issue queue, but does not immediately issue. Use a lower-case ‘r’ to denote if an instruction enters the reorder buffer, but does not immediately commit. Assume that the issue queue can hold 16 instructions and begins empty.
Code Sequence for Questions 10-17:
On what cycle does the final instruction commit? (Assume that Instruction 0 is fetched on cycle 0 and writes back on cycle 4, Instruction 1 is fetched on cycle 1 and writes back on cycle 5, etc.)
5 points
Enter answer here
18 |
========================================================
17.
Question 17
Use the following architecture for questions 15-17:
Draw the optimal pipeline diagram for the following code executing on the IO2I processor from lecture as shown below. The IO2I processor fetches instructions in-order, issues instructions out-of-order, writes-back results out-of-order, and commits instructions in-order. Assume the processor can fetch one instruction per cycle, decode one instruction per cycle, issue one instruction per cycle, writeback one result per cycle, and commit one instruction per cycle. Assume full bypassing of values from the respective instruction completion stage to the Decode stage. Assume that pipeline X can execute branches and ALU operations, pipeline L excutes loads, pipeline S executes stores, and pipeline Y can execute multiply operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. Multiply instructions have a latency of four cycles. Use the named pipeline stages in the figure for your pipeline diagram. The register file has only one write port. Use a lower-case ‘i’ to denote if an instruction enters the issue queue, but does not immediately issue. Use a lower-case ‘r’ to denote if an instruction enters the reorder buffer, but does not immediately commit. Assume that the issue queue can hold 16 instructions and begins empty.
Code Sequence for Questions 10-17:
Would adding register renaming logic enable faster completion of the code sequence used in Questions 10-17 on an IO2I architecture?
1 point
- Yes
- No
========================================================
18.
Question 18
The following code is to be
executed on a processor with 32 architectural registers. The processor is able to issue instructions
out-of-order. The processor is a single
issue machine. The processor has
different functional unit latencies with multiply instructions having a latency
of 4 cycles, ALU operations having a latency of 1 cycles, and loads and stores
having a latency of 2 cycles. The
processor stalls on WAW and WAR dependencies.
Pretend that you are the compiler and perform changes to the following
code to increase the performance of the code when executing on this out-of-order
processor. Assume that all registers not
used are free to be used by the compiler.
Problem 18 Code Sequence:
Which of the following code sequences would increase the performance of the code on this OoO processor? Select all that apply.
10 points
MUL R5, R6, R7
ADD R8, R5, R6
MUL R18, R13, R8
SW R12, 0(R18)
SUB R18, R6, R4
MUL R17, R18, R15
ADDIU R15, R5, 1
MUL R5, R6, R7
ADD R8, R5, R6
MUL R18, R13, R8
SW R12, 0(R18)
SUB R10, R6, R4
MUL R17, R10, R15
ADDIU R19, R5, 1
MUL R5, R6, R7
ADD R8, R5, R6
MUL R10, R13, R8
SW R12, 0(R10)
SUB R18, R6, R4
MUL R17, R18, R15
ADDIU R19, R5, 1
MUL R19, R6, R7
ADD R8, R19, R6
MUL R10, R13, R8
SW R12, 0(R10)
SUB R10, R6, R4
MUL R17, R10, R15
ADDIU R15, R19, 1
Week- 11
Final Exam
1.
Question 1
Please check all answers that apply to Branch Target Buffer (BTB).
5 points
The BTB is indexed by the current PC.
The BTB allows the fetch stage of the pipeline to predict the address of the next instruction.
The fetch stage with BTB needs to decode the instruction before predicting address.
========================================================
2.
Question 2
For Problems 2 through 8, use the following Code Sequence:
The above program sums the odd and even numbers in array and outputs them to two registers. Conceptually, what values are kept in R2 and R3? What do R5, and R6 contain
when the loop exits?
3 points
R2 contains the loop iteration variable
R3 contains the pointer that points to the different elements in the array
R5 contains the sum of the even numbers in the array
R6 contains the sum of the odd numbers in the array
R2 contains the pointer that points to the different elements in the array
========================================================
3.
Question 3
For Problems 2 through 8, use the following Code Sequence:
In the table above, how many actual outcomes are NT when the predicted outcome is T? (Including 1. and 2.)
2 points
Enter answer here
5 |
========================================================
4.
Question 4
For Problems 2 through 8, use the following Code Sequence:
How many not taken (NT) in the predicted outcome column? (Including row 1. and 2.)
2 points
Enter answer here
3 |
========================================================
5.
Question 5
For Problems 2 through 8, use the following Code Sequence:.
Choose the correct answer for row 15 and 16.
3 points
========================================================
6.
Question 6
For Problems 2 through 8, use the following Code Sequence:
What is
the branch prediction accuracy for the above code execution for branch b_1? Put your answer in the form of a decimal rounded to the nearest 0.01
1 point
Enter answer here
.25 |
========================================================
7.
Question 7
For Problems 2 through 8, use the following Code Sequence:
What is the branch prediction accuracy for the above code execution for branch b_2? Put your answer in the form of a decimal rounded to the nearest 0.01
1 point
Enter answer here
.75 |
========================================================
8.
Question 8
For Problems 2 through 8, use the following Code Sequence:
The architecture used in the previous questions has a 1-cycle branch mispredict penalty and no penalty for correctly predicted branches. Jumps are always predicted correctly. Assuming that the above architecture is modified to support partial predication (conditional move with both move-if-zero and move-if-not-zero, the above code can be changed to:
Would it be fruitful to transform branch b_1
with partial predication?
5 points
Yes
No
Yes, but only when branch prediction accuracy for b_1 is above 50%
Yes, but only when branch prediction accuracy for b_1 is greater than or equal to the value found in question 6
========================================================
9.
Question 9
For questions 9 through 12, consider the following VLIW architecture:
The VLIW processor has 6 total functional units. The VLIW processor has two integer (ALU) units (X, Y), two multiply
units (M, N), one load unit (L) and one store unit (S). Assume that branches and jumps must execute
in the Y pipeline and that branches have no delay slots. The processor has full bypassing but no scoreboard
(the VLIW processor does not stall on data dependencies). ALU instructions have a latency of 1 (result
can be used next cycle), multiply instructions have a latency of 2 cycles,
loads have a latency of 2 cycles, and stores have a latency of 2 cycles.
Questions 9 through 12 will ask about the resulting schedule of the following task:
Optimally schedule and bundle the following (sequential) code for this processor using the Equals (EQ) scheduling module. Unroll the code where appropriate and show prolog and epilog code. Perform code motion and register renaming where appropriate. Note that the loop has a fixed number of iterations. Also, achieve peak performance while minimizing instruction storage. As a secondary constraint, minimize register usage.
Code sequence for questions 9 through 12:
What is the loop unrolling factor for the optimal scheduling of this code?
3 points
Enter answer here
1 |
========================================================
10.
Question 10
For questions 9 through 12, consider the following VLIW architecture:
The VLIW processor has 6 total functional units. The VLIW processor has two integer (ALU) units (X, Y), two multiply units (M, N), one load unit (L) and one store unit (S). Assume that branches and jumps must execute in the Y pipeline and that branches have no delay slots. The processor has full bypassing but no scoreboard (the VLIW processor does not stall on data dependencies). ALU instructions have a latency of 1 (result can be used next cycle), multiply instructions have a latency of 2 cycles, loads have a latency of 2 cycles, and stores have a latency of 2 cycles..
Code sequence for questions 9 through 12:
Which of the following is a valid loop to the optimally scheduled code ?
3 points
========================================================
11.
Question 11
For questions 9 through 12, consider the following VLIW architecture:
The VLIW processor has 6 total functional units. The VLIW processor has two integer (ALU) units (X, Y), two multiply units (M, N), one load unit (L) and one store unit (S). Assume that branches and jumps must execute in the Y pipeline and that branches have no delay slots. The processor has full bypassing but no scoreboard (the VLIW processor does not stall on data dependencies). ALU instructions have a latency of 1 (result can be used next cycle), multiply instructions have a latency of 2 cycles, loads have a latency of 2 cycles, and stores have a latency of 2 cycles.
Code sequence for questions 9 through 12:
Which of the following is a valid prologue to the optimally scheduled code ?
3 points
========================================================
12.
Question 12
For questions 9 through 12, consider the following VLIW architecture:
The VLIW processor has 6 total functional units. The VLIW processor has two integer (ALU) units (X, Y), two multiply units (M, N), one load unit (L) and one store unit (S). Assume that branches and jumps must execute in the Y pipeline and that branches have no delay slots. The processor has full bypassing but no scoreboard (the VLIW processor does not stall on data dependencies). ALU instructions have a latency of 1 (result can be used next cycle), multiply instructions have a latency of 2 cycles, loads have a latency of 2 cycles, and stores have a latency of 2 cycles.
Code sequence for questions 9 through 12:
Which of the following is a valid epilogue to the optimally scheduled code ? Assume that the prolog is the one chosen from question 11.
3 points
========================================================
13.
Question 13
For questions 13 through 15, consider the following vector processor architecture:
The 3-lane VMIPS vector processor is shown below. The vector processor fetches instructions in-order, issues instructions in-order, and writes-back results out-of-order. Assume the processor can fetch one instruction per cycle and decode one instruction per cycle. Assume that the processor has two read ports per lane and one write port per lane for a total of 6 read ports and 3 write ports. Assume that the processor has a scoreboard to detect hazards. Assume that the processor can only bypass values through the register file and that values written to the register file in one cycle are not available until the next cycle. Assume that pipeline X can execute branches and ALU operations, pipeline L can excute loads, pipeline S can execute stores, and pipeline Y can execute all multiply operations. Loads have a latency of two cycles and ALU operations have a latency of one cycle. Branches are resolved in X0 and the machine has no branch delay slots and always predicts the fallthrough path. Multiply instructions have a latency of four cycles. Use the named pipeline stages in the figure for your pipeline diagram. The processor has a dedicated register read stage, denoted as “R” in the figure.
[Figure showing Three Lane VMIPS Processor Pipeline]
Questions 13 through 15 will ask about the resulting pipeline diagram of the following task:
Draw the optimal pipeline diagram for the vector portion of the following code executing on this vector processor.
Problem 13 through 15 Code Sequence:
Which instructions stall due to data hazard? Check all that apply
========================================================
14.
Question 14
For questions 13 through 15, consider the following vector processor architecture:
[Figure showing Three Lane VMIPS Processor Pipeline]
Questions 13 through 15 will ask about the resulting pipeline diagram of the following task:
Draw the optimal pipeline diagram for the vector portion of the following code executing on this vector processor.
Problem 13 through 15 Code Sequence:
Which instructions stall in the fetch stage due to structural hazard? Check all that apply
========================================================
15.
Question 15
For questions 13 through 15, consider the following vector processor architecture:
[Figure showing Three Lane VMIPS Processor Pipeline]
Questions 13 through 15 will ask about the resulting pipeline diagram of the following task:
Draw the optimal pipeline diagram for the vector portion of the following code executing on this vector processor.
Problem 4 Code Sequence:
How many cycles does it take to execute the code above?
4 points
Enter answer here
22 |
========================================================
16.
Question 16
What
problem does vector stripmining solve? How is the Vector Length Register (VLR) involved with stripmining? Please check all that apply.
5 points
VLR is set to the maximum for most of the iterations of the loop.
Vector stripmining helps with loop executing speed.
Vector stripmining allows loops larger than the Vector length Register (VLR) to execute.
========================================================
17.
Question 17
For questions 17 through 19, consider the following instruction sequence table:
There are four processors executing the code as interleaved below. Assume a 128-byte cache line (block size). Assume all cores contain a direct mapped data cache of size 2KB.
Questions 17 through 19 will ask about the resulting state table of the following task:
Show for each cache line and per cache what state it is in on every cycle for the code above. You do not need to show lines which are invalid in the cache, but you should remove them from the table when the lines transition to the Invalid state. Assume that the caches implement a snoopy, bus-based, cache coherence protocol implementing the MESI protocol. MESI stands for the four states, Modified, Exclusive, Shared, and Invalid. Assume all lines are Invalid in all caches at the beginning of execution. Be sure to duplicate the information from time step to following time step if the data stays valid in the cache and show the state that the data is in.
Which of the following table is the correct state one at time 9?
4 points
========================================================
18.
Question 18
For questions 17 through 19, consider the following instruction sequence table:
There are four processors executing the code as interleaved below. Assume a 128-byte cache line (block size). Assume all cores contain a direct mapped data cache of size 2KB.
Questions 17 through 19 will ask about the resulting state table of the following task:
Which of the following table is the correct state one at time 11?
4 points
========================================================
19.
Question 19
For questions 17 through 19, consider the following instruction sequence table:
There are four processors executing the code as interleaved below. Assume a 128-byte cache line (block size). Assume all cores contain a direct mapped data cache of size 2KB.
Questions 17 through 19 will ask about the resulting state table of the following task:.
Which of the following time do the cache conflict miss happen? Check all that apply
========================================================
20.
Question 20
For questions 20 through 27, consider the following processor spec:
You are designing a processor with a 64KB data cache that is four-way set associative with a 128-byte block size and a not-most recently used (NMRU) replacement policy. This processor has a virtual memory system with a software managed TLB. Virtual addresses on the machine are 32-bits and physical addresses are 32-bits. The TLB contains 16 entries and is 8-way set associative. Assume that the architecture flushes the TLB and cache on process swap. Assume that the page size of the machine is 16KB. Assume that each page table entry requires a valid bit and a dirty bit. Likewise, each TLB entry contains a valid bit and a dirty bit. Assume that the architecture is byte addressable. Assume that the cache is virtually indexed and physically tagged. Assume that the TLB uses 3 bits to implement a NMRU replacement policy.
For the above described processor, what’s the bits width of tag for data cache?
1 point
Enter answer here
18 |
========================================================
21.
Question 21
For questions 20 through 27, consider the following processor spec:.
For the above described processor, what’s the bits width of index for data cache?
2 points
Enter answer here
7 |
========================================================
22.
Question 22
For questions 20 through 27, consider the following processor spec:
For the above described processor, what’s the bytes size of the data field for data cache?
2 points
Enter answer here
128 |
========================================================
23.
Question 23
For questions 20 through 27, consider the following processor spec:
For the above described processor, what’s the bits width of the page offset for TLB?
2 points
Enter answer here
14 |
========================================================
24.
Question 24
For questions 20 through 27, consider the following processor spec:
You are designing a processor with a 64KB data cache that is four-way set associative with a 128-byte block size and a not-most recently used (NMRU) replacement policy. This processor has a virtual memory system with a software managed TLB. Virtual addresses on the machine are 32-bits and physical addresses are 32-bits. The TLB contains 16 entries and is 8-way set associative. Assume that the architecture flushes the TLB and cache on process swap. Assume that the page size of the machine is 16KB. Assume that each page table entry requires a valid bit and a dirty bit. Likewise, each TLB entry contains a valid bit and a dirty bit. Assume that the architecture is byte addressable. Assume that the cache is virtually indexed and physically tagged. Assume that the TLB uses 3 bits to implement a NMRU replacement policy.
For the above described processor, what’s the bits width of the physical page number for TLB?
2 points
Enter answer here
18 |
========================================================
25.
Question 25
For questions 20 through 27, consider the following processor spec:
For the above described processor, what’s the bits width of the total TLB data payload width (not including tag, valid bit, or NMRU)?
2 points
Enter answer here
19 |
========================================================
26.
Question 26
For questions 20 through 27, consider the following processor spec:
How many levels is this page table structure?
========================================================
27.
Question 27
For questions 20 through 27, consider the following processor spec:
How many bits in each level? Check all that apply
2 points
========================================================
28.
Question 28
Which of the following code shows false sharing?
Assume that the code is executing on a shared memory multiprocessor system where the cache block size is 32 bytes and each load/store each handle 4 bytes. Assume the cache is direct mapped with 32 indices.
5 points
========================================================
29.
Question 29
Two threads are executing on two independent processors. The data stored at address ‘p’ is initialized to the value 5 and the data stored at address ‘q’ is initialized to the value 6 before the threads begin executing. Note that SW R0, 0(q) stores the value zero at address ‘q’. After both threads complete execution, is the state where the data stored at address ‘p’ contains the value 31 and the data stored at address ‘q’ contains the value 10 a sequentially consistent execution?
5 points
Yes, it is a sequentially consistent execution.
No, it is not a sequentially consistent execution.
========================================================
30.
Question 30
For questions 30 through 32, consider the following code:
For proper operation, does the code above need locking?
2 points
Yes
No
========================================================
31.
Question 31
For questions 30 through 32, consider the following code:
If locking is needed, where should we insert the function AcquireLock(&lock)? check all that apply
2 points
========================================================
32.
Question 32
For questions 30 through 32, consider the following code:
If locking is needed, where should we insert the function ReleaseLock(&lock)? check all that apply
3 points
========================================================
33.
Question 33
What is the bisection bandwidth of the following network? Give your answer rounded to the nearest 0.1 Gbps
A 4-node shared multi-drop bus which is 64-bits wide and clocks at 200MHz.
2 points
Enter answer here
12.8 |
========================================================
34.
Question 34
What is the bisection bandwidth of the following network? Give your answer rounded to the nearest 0.1 Gbps
A 2-ary 3-cube where each link is 16-bits wide, with 8-bits in each direction and the links are clocked at 100MHz
3 points
Enter answer here
6.4 |