In this lab we explain how to configure and use the performance counters available in MIPSfpga, which are extensively explained in Section 6.2.47 of . Performance counters constitute a valuable resource to test functionalities and find bottlenecks in a system. They allow us to measure different microarchitectural events in a program, such as the number of cycles, number of instructions, number of cache accesses or misses, number of taken branches, number of stall cycles, and many others. Figure shows a simple example where we monitor a simple program that performs matrix addition, accounting for the number of cycles and number of instructions.
CONFIGURE COUNTERS FOR MEASURING # OF CYCLES AND # OF INSTRUCTIONS
for (i=0; iC[i]=A[i]+B[i];
Figure . Example of performance counters use.
1.Performance counters in microAptiv
In this section we describe the performance counters available in microAptiv and explain how we can configure and interact with this resource.
Performance counters available
MicroAptiv processor provides 2 performance counters (PerfCounter-0 and PerfCounter-1). Each one has 2 CP0 registers associated with it: the controlregister, used for configuring the counter (selection of the mode, event to measure, etc.), and the counter register, which stores the value of the accounting event (number of cycles, number of instructions, number of branch instructions, etc.) selected by the control register. The specific register is chosen by means of the select number as defined in Table .
Table . Performance counters available in microAptiv
Control Register 0
Counter Register 0
Control Register 1
Counter Register 1
The control registers, which can be selected with select numbers equal to 0 and 2 for PerfCounter-0 and PerfCounter-1 respectively (as shown in Table ), allow the user to configure a wide range of parameters. Figure illustrates the fields conforming the control register (you can see the field description in Table 6.55 of ). The Event field (bits 5 to 10) determines the event measured by the performance counter. Table illustrates the first 24 Performance Counter events (12 events per counter) available in microAptiv and their encoding (the whole list can be viewed in Table 6.56 of , and Table 6.57 of  describes each event in detail). In this lab and in the labs related with the memory system we will mainly use the number of cycles (event 0 in both counters), the number of instructions completed (event 1 in both counters) and the number of D$ accesses (PerfCounter-0, event 10) and number ofD$ misses (event 11 in both counters).
Instructions and macros for accessing the performance counters
In this subsection we first describe two instructions available to configure and read the performance counters, and then we explain a more convenient way of working with these instructions by using two macros defined in file …/Toolchains/mips-mti-elf/2015.06-05/mips-mti-elf/include/mips/m32c0.h.
mtc0 rt, rd, sel
Description: Moves the contents of a general purpose register (Rt) to a Coprocessor-0 register, identified by the pair (Rd, Sel)
mfc0 rt, rd, sel
Description: Moves the contents of a Coprocessor-0 register, identified by the pair (Rd, Sel), to a general purpose register (Rt)
Using these instructions directly is one possible way of accessing the performance counter registers. A more convenient option, though, is to use the following macros.
_m32c0_mtc0(reg, sel, value)
Description: Moves value to a Coprocessor-0 register identified by the pair (reg, sel). For instance, the following line configures PerfCounter-1, by writing the control register associated to it, to count in kernel mode (Kfield of the control register shown in Figure ) and to measure the number of instructions completed (event field of the control register shown in Figure ).
_m32c0_mtc0($25, 2, (1 << 5) | (1 << 1));
Description: Assigns the value of a Coprocessor-0 register identified by the pair (reg, sel) to a variable. For instance, the following line reads the value of the counter register of PerfCounter-0 into variable cntval.
cntval = _m32c0_mfc0($25, 1);
In this section we describe the skeleton code that you will use in this and forthcoming labs. This code is provided in folder Lab13_PerfCntrs\SimulationSources-Skeleton. Observe first that we have removed all optimization options in the makefile for this skeleton code. Then, open and analyze file main.c.
Two macros used for initiating and reading the performance counters are defined: INIT_PERF_COUNTS() and READ_PERF_COUNTS() respectively. Note that the performance counters are configured by default for accounting for the number ofcycles and number ofinstructions completed, using the _m32c0_mtc0 macro as follows (C0_PERFCNT is defined as register $25 in file m32c0.h):
A struct called test_result_t is defined, that stores the two values measured by the performance counters: event1 and event2.
Then, function main is implemented.
Initially, the code that we want to test is included (you must substitute comment “Place your test here, either in C or in MIPS assembly language” with your program). Note that before that code the performance counters are initialized (macro INIT_PERF_COUNTS()) and after the invocation they are read (macro READ_PERF_COUNTS(test_result->event1,test_result->event2)).
Then, an infinite while loop is implemented that shows the results of the performance counter events on the 7-segment displays. For that purpose, function writeValTo7Segs, also implemented at file main.c, is invoked. Depending on the value encoded on the switches, event1, event2 or 0 is displayed. Recall that MIPSfpga does not include by default the support for interacting with the 7-segment displays, thus you first have to modify the MIPSfpga system according to Lab 5.
In the following exercises you will use the performance counters for evaluating the behavior of several programs.
Exercise 1. Analytical and experimental study of Exercise 7.33 in 
Exercise 7.33 from  asks the student to calculate the CPI of the following code, taking into account the microarchitectural characteristics of the processor explained in that book.
add $s0, $0, $0 # i = 0
add $s1, $0, $0 # sum = 0
addi $t0, $0, 1000 # $t0 = 1000
slt $t1, $s0, $t0 # if (i < 1000), $t1 = 1, else $t1 = 0
Compute analytically the CPI of this program both in the pipelined processor form  and in microAptiv. Recall that there are several differences between microAptiv and the processor from . Specifically:
MicroAptiv does not stall due to a RAW dependence between an Arithmetic-Logic instruction and a subsequent beq instruction.
MicroAptiv implements delayed branches.
The Instruction Memory is not ideal in microAptiv, thus, I$ misses introduce some delay.
Once you have performed the analytical study, resolve the same exercise empirically, by means of the performance counters, and compare the results of both approximations. You can follow the next steps:
Copy the skeleton code (folder Lab13_PerfCntrs\SimulationSources-Skeleton) in a new folder named OriginalCode.
Go into the new folder and open file main.c.
Replace the comment “// Place your test here, either in C or in MIPS assembly language” for the following lines, that define the code that we are going to evaluate:
" add $s0, $0, $0;"
" add $s1, $0, $0;"
" addi $t0, $0, 1000;"
" slt $t1, $s0, $t0;"
" beq $t1, $0, done;"
" add $s1, $s1, $s0;"
" addi $s0, $s0, 1;"
" j loop;"
To compile this program, open a shell (i.e., cmd.exe from the Start menu), go into the new folder, and type “make“ in the shell (as we said above, the makefile is not using any optimization options). You can see the compiled program in file program.dis, if you look for “:”. Observe that the compiler introduces a nop instruction in the delay slot of every branch/jump instruction. Also, note that, when using Performance Counters, some instructions are added before and after the program as a result of the accounting instructions.
Use the provided bitfile (Lab13_PerfCntrs\mfp_nexys4_ddr.bit) or generate your own bitfile by creating a new project in Vivado as explained in Lab 1 – Step 1. As we warned you in Section 3, it is essential to modify the original MIPSfpga system according to Lab 5, in order to expand the capability of the system so that it can write to the 7-segment displays.
Compile the project as explained in Lab 1 – Step 3: Click on the Generate Bitstream button at the top of the window. Now wait for synthesis, placement, routing, and bitstream generation to complete. This typically takes around 10-20 minutes or more, depending on your computer speed.
Program the FPGA board, as explained in Lab 1 – Step 4: Click on Open Hardware Manager in the Flow Navigator window on the left. Make sure that the Nexys4 DDR FPGA board is turned on and connected to your computer, and click on Open Target → Auto Connect. Finally, click on Program Device → xc7a100t_0, select the bitfile if it is not selected yet, and click on Program.
Download the program to the board using the script loadMIPSfpga.bat as explained in Section 7.5 of the Getting Started Guide. The program executes, thus, if you set the switches to 0 or 1, the results of the events measured by the performance counters (cycles and instructions completed respectively) should be shown on the 7-segment displays.
Once you have performed this exercise, reorder manually the program shown above, trying to fill the delay slot with useful instructions instead of nops, and redo the same analytical and empirical analysis. For that purpose you must insert directive ".set noreorder;" right before your assembly program and directive ".set reorder;" right after it. This directive tells the assembler that the programmer is in control and thus it must not move instructions about (i.e. the compiler will not insert nop instructions after the branch/jump instructions but will maintain the instruction that we place after them).
Finally, use a –O3 optimization option, analyze briefly the assembly program generated by the compiler, and evaluate it in MIPSfpga as explained above, comparing the results with the previous versions.
Exercise 2. Analytical and experimental study of a simple loop
The following program (available in folder SimulationSources_Exercise2) computes the addition of the elements of an array (test_array) into variable Addition. Before executing this program the array is initialized (it is brought into the D$), thus the lw and the sw instructions will never miss (we can assume ideal instruction and data memories in MIPSfpga).
LUI $t0, 0x8000
ADDIU $t0, $t0, test_array
Loop1: BEQ $t3,$t4,OutLoop1
OutLoop1: LUI $t3, 0x8000
ADDIU $t3, $t3, Addition
Compute the CPI both for the execution of this program in the processor from  and in microAptiv. Then, execute the program on the board, and test it without compiler optimizations, with compiler optimizations (including, for example, the –O3 option), and with manual reordering. Compare and explain all these analysis and experiments.
Observe in file Lab13_PerfCntrs\SimulationSources_Exercise2\main.c that we are using the fast debug channel (using fdc_printf()) in order to test the correctness of the solution.
Exercise 3. Analytical and experimental study of an array computation
The following program (available in folder Lab13_PerfCntrs\SimulationSources_Exercise3) computes the third part of the array of integers test_array based on the first two parts of the same array. Before executing this program the array is initialized (it is brought into the D$), thus the lw and the sw instructions contained in the loop will never miss.
lui $t6, 0x8000;
addiu $t6, $t6, test_array;
LOOP: lw $t2,0($t6);
Evaluate the behavior of this program in the processor from  and in MIPSfpga, first analytically and then experimentally. Test the original program in MIPSfpga with and without compiler optimizations and then reorder the code trying to optimize performance. Compare and justify the results.
Check with the fast debug channel (using fdc_printf()) if your reordered program obtains the correct solution.
 “MIPS32® microAptiv™ UP Processor Core Family Software User’s Manual -- MD00942”.
 “Digital Design and Computer Architecture”. David Money Harris and Sarah L. Harris. Morgan Kaufmann.