Writing a CPU Simulator in AWK

2020-09-25

/posts/awk_cpu/ map[name:Charles Daniels]

Table of Contents

Recently, I implemented a RISC-V simulator and assembler in AWK: awk-riscv. This started as a joke, evolved into wanting to prove a point (namely, that AWK being a Turing-complete language, can compute anything which is compute-able) , and ended up being a lot of fun. I will provide a narrative overview of the riscv.awk program, and also discuss more broadly the basics of writing a CPU simulator.

Even if you don’t know AWK, but want to learn about CPU simulators, you’ll probably be able to follow this article. AWK is an interpreted, record-oriented language with a C-like syntax, and I think you will agree that most of the code is quite readable to anyone with experience in C-like languages. Writing a CPU simulator is a great learning experience, and I think every programmer should do it at least once.

This article does assume basic familiarity with computer processors, and with basic assembly programming. If you have worked with MIPS or ARM in the past, you should have little trouble with RV32I.

NOTE: At time of writing, the current commit hash of the awk-riscv project is ce30c930be0f157a14442b738993e54d8d0dc52f. If you are reading this in the future, you might want to roll back to that commit.

# Anatomy of a CPU Simulator

CPU simulators are quite simple at their core, consisting of two major components:

One to store and interact with the state of the CPU at a given instant.
One to calculate the next state of the CPU based on the current state.

The time between the current state and the next state constitutes one “clock cycle” in hardware terms. When implementing a processor as a digital logic circuit, this will be limited by the propagation delay of the slowest path through the circuit (critical path). In a software simulator, our limitation is simply how fast our code can run.

When writing a cycle-accurate, pipelined processor, one must be careful to store various values of intermediary registers and update them in the proper order – simulating a digital logic circuit where the values of all output signals are calculated simultaneously by different circuit paths (in a typical programming language where each statement happens in sequence, one after the other) can prove very challenging.

Fortunately for us, awk-riscv and this article is only an ISA (Instruction Set Architecture) level simulator for RISC-V, it does not simulate the pipeline, which greatly simplifies our task. If you want to build a pipeline-accurate simulator, I would encourage you to take a graduate level computer architecture course, and/or read Computer Architecture: A Quantitative Approach 6th edition by Hennessy and Patterson (older editions are fine if you just want to learn the basics of pipeline CPU architectures, but 6/e as been re-worked to be based around RISC-V).

Thus, our state-storage mechanism is very simple, we have an “array” mem which stores each byte of memory, indexed by its address, an array regs which stores the value of each of RISC-V’s 31 general-purpose 32-bit registers, one “zero” register which always contains he value zero, and a variable PC which stores the current Program Counter value (the PC is the memory address of the instruction to be executed this cycle). I put arrays in quotes because AWK has only two data types: strings, and hash tables, although math operations are well supported (internally, AWK most likely has separate types in its interpreter for numbers and text, but this is an implementation detail). It also does not require us to pre-declare any variables, thus we can say mem[123] = 7 out of the blue and this will work as one would expect. This has the happy side-effect of allowing us to address the 2³² possible addresses that RISC-V (specifically, the RV32I instruction set) can address without having to allocate a whole 4GB of memory up front.

The actual next-state logic for our processor is also quite straightforward. An RISC-V instruction can operate on some combinations of a destination register, one or two source registers, an immediate value, a source memory address, or a destination memory address (not all of these at once though!). Each instruction identifies some subset of these possible operands, and some arithmetic or logical operation to be performed with them.

The more difficult aspect of creating our next-state logic is instruction decoding. Processors are usually implemented with digital logic circuits, and are specified using a Hardware Description Language (HDL) such as Verilog, and an HDL AWK is not. In Verilog, it is very straightforward to pull specific bits out of instructions and put them back together in a different order, and to deal with values of unusual widths like 12 bits, or 7 bits.

All ISAs, including RV32I, will specify one or more instruction encoding. RV32I includes 6 types of instruction: R, I, S, B, U, and J. Each type trades off between what operands can be specified and the size of immediate values that can be used. I will not cover all 6 types in this article, but they can be found on page 12 of the RISC-V specification if you want to learn more.

Each RISC-V instruction is stored as a 32-bit value, divided into fields of fewer bits. How the instruction is divided up depends on the instruction type. Let’s consider I-type as an example. This type of instruction is used to describe operations involving a single source register, a 12-bit immediate value, and a destination register. A 32-bit I-type instruction is divided up into the following fields:

Bits 31 to 20 (12 bits): immediate value (imm)
Bits 19 to 15 (5 bits): source register 1 (rs1)
Bits 14 to 12 (3 bits): function code (funct3)
Bits 11 to 7 (5 bits): destination register (rd)
Bits 6 to 0 (7 bits): operation code (opcode)

Meanwhile for comparison, an R-type instruction would use the bits comprising the 12-bit immediate value to instead specify an additional source register (rs2) and an additional 7-bit function code (funct7). The instruction type can be unambiguously determined by the instruction’s opcode field; the opcode field always occupies the lowest-order 7 bits of the instruction, irrespective of type.

Determining how to extract all of these fields, identify the operation, instruction type, and so on, is accomplished using lookup tables. In some languages, that might mean case statements, in others big chains of if else blocks, or an actual table.

Now that we have a general idea of how a CPU simulator works, let’s dive further into riscv.awk…

# Instruction Decoding

## Extracting Fields

To decode each instruction, riscv.awk defines a collection of functions, each named after a field that it extracts. For example, the opcode() function returns the opcode of an instruction, while the immI() field extracts the immediate value of an I-type instruction (there are several possible ways immediate values can be encoded for various instruction types).

The immI function looks like this:

function immI(v)   { return signextend(rshift(and(v, 0xfff00000), 20), 11) }

Notice that it simply operates on a value – if one were to pass, for example, an R-type instruction to immI(), it would extract the exact same set of bits as it would for an I-type instruction… except that the result would be meaningless junk rather than a useful value.

Let’s take a look at the instruction addi x5 x0 0x123, which sums the value of the zero register (x0, always evaluates to 0) and the immediate value 0x123, storing the result in register 5. This instruction is encoded as 0x12300293.

Observe how the field is extracted:

and(0x12300293, 0xfff00000) = 0x12300000 – this masks out just the bits containing the immediate value.
rshift(0x12300000, 20) = 0x123 – shift those bits to be right-aligned, so the LSB (least significant bit) of the immediate value is the LSB of the whole value now.
signextend(0x123, 11) = 0x123 – signextend is a helper method I wrote which performs two’s complement sign-extension. The 11 is the index at which it is to apply the sign-extension.

Sign-extension is used when storing a smaller two’s complement value in a larger “slot” (such as a field, variable, etc.). It simply copies the most-significant bit (MSB) of the smaller value into all of the higher-order fields of the larger space in which it is to be stored, thus preserving its sign.

For example, the value -1 could be represented as 0xf in a 4-bit two’s complement integer. If we wished to store this in an 8-bit field, we would duplicate the MSB (1) into the four new higher-order fields, obtaining 0xff as the result. Why this works is out of scope of this article, but you can trust that it does.

immI() is a straightforward enough function, but some of the field-extraction functions can contain a lot of surface area for off-by-one errors. For example, consider this function that decodes the immediate value from J-type functions (used for jump instructions):

function immJ(v)   { return signextend(or( \
 rshift(and(v, 0x80000000), 11), \
 rshift(and(v, 0x7fe00000), 20),  \
 rshift(and(v, 0x00100000), 9), \
 rshift(and(v, 0x000ff000), 0)), 20) }

In RISC-V, J-type immediates are split into multiple separate fields, which are also out of order, so our simulator must mask each one out, shift them into the right positions, and re-combine all of them into a final value.

Determining an Instruction's Type ¶

The next essential component of instruction decoding is the ability to determine an instruction’s type by its opcode – this is critical because we do not know what fields we can decode from the instruction until we know it’s type.

riscv.awk solves this problem with a function type() which takes in an instruction, and returns a string corresponding to one of the six types. To avoid human error, I generated the body of this function with a simple script (which uses this file as input). It looks something like this:

function type(v) {
  if (opcode(v) == 0x37) { return "U" }
  if (opcode(v) == 0x17) { return "U" }
  // ... trimmed for brevity
}

This builds on top of the field-extraction functions we discussed in the previous section.

## Determine an Instruction’s Operation

Much like determining the instruction type, its operation is also decided by a (much larger) lookup table, also generated by script.

The op2str() function looks like this:

function op2str(v) {
 if (opcode(v) == 0x37) { return "LUI" }
 if (opcode(v) == 0x17) { return "AUIPC" }
 if (opcode(v) == 0x6f) { return "JAL" }
 if ((opcode(v) == 0x67) && (funct3(v) == 0x0)) {return "JALR"}
 if ((opcode(v) == 0x63) && (funct3(v) == 0x0)) {return "BEQ"}
 // ... trimmed for brevity ...
}

In RISC-V, an instructions operation is determined by its opcode, but also potentially by its function code fields, funct3 and funct7 for instruction types that have them. For example, the ADD and SUB (add and subtract) instructions differ only in the funct7 (0x0 and 0x20, respectively), but have the same values for opcode and fucnt3 (0x33, and 0x0, respectively).

# Calculating The Next State

The final critical piece of our puzzle is the next-state function. Here, we use the op2str() function to determine which instruction is being executed, then perform an appropriate operation. The nextstate() function is hand-written, though it is also structured like a big lookup table much like instruction decoding functions are.

Let’s consider how the ADD instruction is implemented:

  // formatting changed for readability
  else if (op2str(inst) == "ADD") {
    regwrite(
      rd(inst),
      dec2two(two2dec(regread(rs1(inst))) + two2dec(regread(rs2(inst))))
    )
  }

rs1(inst) and rs2(inst) extract the source register 1, and source register 2 fields of the instruction, respectively. two2dec() is a helper method which converts a two’s complement number into a decimal representation – this is important because on many modern 64-bit systems, AWK will have stored our values as either strings, or as 64-bit integers, and therefore a value like -1 packed into a 32-bit two’s complement value, but interpreted as a 64-bit integer would be treated as 268435455, which is not what we want.

rd(inst) extracts the destination register of our instruction, and regwrite() is used to modify a register’s value. Access to the regs and mem arrays discussed earlier is mediated by the functions regread(), regwrite(), memread(), and memwrite(), which are used in both tracing and to catch value-out-of-bounds errors.

# Using the Simulator

The simulator processes standard input, one line at a time. In “normal mode” (the default), each line is treated as a “directive”, or command, split into fields along whitespace characters. How this works internally would be a lesson in AWK, rather than in writing CPU simulators, and so is omitted.

## Running A Program

To load a program into the simulator, we can use the poke directive, which accepts a memory address, and a value to write to it. So if we had the program:

addi x5 x0 1
addi x6 x0 0xff
addi x10 x5 1

We could load it into the simulator’s memory via the following poke commands:

poke 0x00 0x00100293   # addi x5 x0 1
poke 0x04 0x0ff00313   # addi x6 x0 0xff
poke 0x08 0x0012f513   # andi x10 x5 1

Of course, if we feed this as input into risc.awk, it won’t produce any output – we need to use the step directive to tell it to advance the simulation by a specific number of cycles. If add step 5 to the input, it will advance through the end of our program. It still won’t output anything though. In true UNIX tradition, riscv.awk assumes you may want to parse its output with another program (in some cases, you may even want riscv.awk to parse its own output). Thus, it won’t print anything until we ask it to. We can do this with rpeek 10, which will output the contents of register 10. Let’s put all this together:

$ cat directives.txt
poke 0x00 0x00100293   # addi x5 x0 1
poke 0x04 0x0ff00313   # addi x6 x0 0xff
poke 0x08 0x0012f513   # andi x10 x5 1
step 5
rpeek 10
$ awk -f riscv.awk < directives.txt
0x00000001

Not very exciting, but at least it prints the correct result! We could have also used the showregs directive to display the whole register file, rather than just register x10.

When debugging either a RISC-V program, or riscv.awk itself, it’s often useful to see what’s happening step-by-step. For this purpose, a number of tracing modes are available. traceregs will display all register reads and writes, traceinst will display each instruction that is run, and tracemem displays all memory accesses. Let’s try turning these on and see what happens:

$ cat directives.txt
traceregs 1
traceinst 1
tracemem 1
poke 0x00 0x00100293   # addi x5 x0 1
poke 0x04 0x0ff00313   # addi x6 x0 0xff
poke 0x08 0x0012f513   # andi x10 x5 1
step 5
rpeek 10
$ awk -f riscv.awk < directives.txt
# memwrite(0x00000000, 0x00100293)
# memwrite(0x00000004, 0x0ff00313)
# memwrite(0x00000008, 0x0012f513)
# memread(0x00000000) -> 0x00100293
# PC=0x00000000, inst=0x00100293, type=I, disasm='0x00100293: ADDI x5 x0 1', funct7=0x00, funct3=0x0, rs1=0x00, rs2=0x01, rd=0x05, opcode=0x13, imm=0x00000001
# regread(x0) -> 0x00000000
# regwrite(x5, 0x00000001)
# memread(0x00000004) -> 0x0ff00313
# PC=0x00000004, inst=0x0ff00313, type=I, disasm='0x0ff00313: ADDI x6 x0 255', funct7=0x07, funct3=0x0, rs1=0x00, rs2=0x1f, rd=0x06, opcode=0x13, imm=0x000000ff
# regread(x0) -> 0x00000000
# regwrite(x6, 0x000000ff)
# memread(0x00000008) -> 0x0012f513
# PC=0x00000008, inst=0x0012f513, type=I, disasm='0x0012f513: ANDI x10 x5 1', funct7=0x00, funct3=0x7, rs1=0x05, rs2=0x01, rd=0x0a, opcode=0x13, imm=0x00000001
# regread(x5) -> 0x00000001
# regwrite(x10, 0x00000001)
# memread(0x0000000c) -> 0x00000000
# PC=0x0000000c, inst=0x00000000, type=E, disasm='0x00000000: UNKNOWN', funct7=0x00, funct3=0x0, rs1=0x00, rs2=0x00, rd=0x00, opcode=0x00
# memread(0x00000010) -> 0x00000000
# PC=0x00000010, inst=0x00000000, type=E, disasm='0x00000000: UNKNOWN', funct7=0x00, funct3=0x0, rs1=0x00, rs2=0x00, rd=0x00, opcode=0x00
# regread(x10) -> 0x00000001
0x00000001

Notice that all of the lines of tracing begin with # . This is so that when processing the output of riscv.awk with itself, they will be treated as comments. Let’s break this down line by line:

# memwrite(0x00000000, 0x00100293)
# memwrite(0x00000004, 0x0ff00313)
# memwrite(0x00000008, 0x0012f513)

First, we see our three poke commands writing into memory. Remember that in RISC-V, memory is byte-addressable, hence why we need to advance by 4 addresses at a time rather than 1.

# memread(0x00000000) -> 0x00100293
# PC=0x00000000, inst=0x00100293, type=I, disasm='0x00100293: ADDI x5 x0 1', funct7=0x00, funct3=0x0, rs1=0x00, rs2=0x01, rd=0x05, opcode=0x13, imm=0x00000001
# regread(x0) -> 0x00000000
# regwrite(x5, 0x00000001)

Here we see our first instruction, addi x5 x0 1 being executed. It is first fetched from memory address 0x0, which we can see in the memread(), as well as in the following line where the PC is 0. The second line in this block shows us the values of each field in the instruction. Notice that even though it’s an I-type, it still prints out the funct7 and rs2 fields. This might seem strange, but it’s useful to see all the fields while debugging, especially if an un-implemented instruction is encountered. Finally, we see that this instruction read from register 0 (the rs1 register), and wrote to register 5.

# PC=0x00000004, inst=0x0ff00313, type=I, disasm='0x0ff00313: ADDI x6 x0 255', funct7=0x07, funct3=0x0, rs1=0x00, rs2=0x1f, rd=0x06, opcode=0x13, imm=0x000000ff
# regread(x0) -> 0x00000000
# regwrite(x6, 0x000000ff)
# memread(0x00000008) -> 0x0012f513
# PC=0x00000008, inst=0x0012f513, type=I, disasm='0x0012f513: ANDI x10 x5 1', funct7=0x00, funct3=0x7, rs1=0x05, rs2=0x01, rd=0x0a, opcode=0x13, imm=0x00000001
# regread(x5) -> 0x00000001
# regwrite(x10, 0x00000001)

These lines are very similar to the previous chunk we looked at, but pertain to the following two instructions.

# memread(0x0000000c) -> 0x00000000
# PC=0x0000000c, inst=0x00000000, type=E, disasm='0x00000000: UNKNOWN', funct7=0x00, funct3=0x0, rs1=0x00, rs2=0x00, rd=0x00, opcode=0x00
# memread(0x00000010) -> 0x00000000
# PC=0x00000010, inst=0x00000000, type=E, disasm='0x00000000: UNKNOWN', funct7=0x00, funct3=0x0, rs1=0x00, rs2=0x00, rd=0x00, opcode=0x00
# regread(x10) -> 0x00000001

Here we see that execution has “fallen off the end” of our program. risv.awk assumes memory is just a big 4GB blob, and thus doesn’t know that the program has finished. However, unset memory values are treated as 0, which produces an invalid instruction, which risv.awk treats as a NO-OP. Thus these two spurious instructions don’t actually have any effect.

0x00000001

Finally, we see the result of our rpeek instruction, just as we did before we turned on tracing.

## Assembling a Program

You might wonder how we got the three instructions we poke-ed into memory in the previous section. For that, we need an assembler. We could use a pre-existing one such as RARS, but what’s the fun in that? riscv.awk actually includes a complete assembler, with label resolution. How the assembler works is beyond the scope of this article, but it sure comes in handy when using the simulator.

To assemble a chunk of code, we set riscv.awk into assemble mode, which will cause it begin interpreting further lines of input as assembly instructions, to be assembled and inserted sequentially into memory starting at 0x0. Let’s look at an example:

$ cat directives.txt
mode assemble

# this program sums the numbers 0 ... 10
addi x10 x0 0  # i = 0
addi x11 x0 10 # k = 10
addi x12 x0 0  # n = 0

loop:
add x12 x12 x10  # n += i
addi x10 x10 1   # i++
bge x11 x10 loop # if (k >= i) { goto loop; }

# exit assemble mode
mode normal

# resolve the jump
resolve our branch target

# dump the contents of memory, so we can see the result of assembling our
# program
dump

# run the program and print the resulting value of n in register 12
step 100
rpeek 12
$ awk -f riscv.awk < directives.txt
# BEGINNING OF RISCV.AWK DUMP
pcpoke 0x00000000
poke 0x00000000 0x00000513   # 0x00000513: ADDI x10 x0 0
poke 0x00000004 0x00a00593   # 0x00a00593: ADDI x11 x0 10
poke 0x00000008 0x00000613   # 0x00000613: ADDI x12 x0 0
poke 0x0000000c 0x00a60633   # 0x00a60633: ADD x12 x12 x10
poke 0x00000010 0x00150513   # 0x00150513: ADDI x10 x10 1
poke 0x00000014 0xfea5dce3   # 0xfea5dce3: BGE x11 x10 0xfffffff8
# END OF RISCV.AWK DUMP
0x00000037

As you can see, the program is assembled and placed in memory address 0, and the branch target for the bge instruction is resolved to -8.

# Conclusion

This has been a high-level overview of what goes into writing a basic CPU simulator, using my own RV32I simulator written in AWK as a case study. If you want more, I would suggest reading the source code for yourself; I have tried to comment it well and to structure it clearly, so I hope it will be easy to follow.

I found AWK to be a surprisingly pleasant language in which to tackle this project, and it turned out easier than I expected. That said, I wouldn’t suggest using this code in production, as I’ve left a lot of things you would want from a real CPU simulator out, such as interrupts and exceptions. Although AWK wasn’t bad for this purpose, Python, Go, or even C would have produced an easier to extend codebase that would be easier to integrate into other projects as a library.