Writing a CPU Simulator in AWK
Recently, I implemented a
RISC-V
simulator and assembler in AWK:
awk-riscv. This started as a joke,
evolved into wanting to prove a point (namely, that AWK being a Turing-complete
language, can compute anything which is compute-able) , and ended up being a
lot of fun. I will provide a narrative overview of the
riscv.awk
program, and also discuss more broadly the basics of writing a CPU simulator.
Even if you don’t know AWK, but want to learn about CPU simulators, you’ll probably be able to follow this article. AWK is an interpreted, record-oriented language with a C-like syntax, and I think you will agree that most of the code is quite readable to anyone with experience in C-like languages. Writing a CPU simulator is a great learning experience, and I think every programmer should do it at least once.
This article does assume basic familiarity with computer processors, and with basic assembly programming. If you have worked with MIPS or ARM in the past, you should have little trouble with RV32I.
NOTE: At time of writing, the current commit hash of the awk-riscv project is ce30c930be0f157a14442b738993e54d8d0dc52f. If you are reading this in the future, you might want to roll back to that commit.
Anatomy of a CPU Simulator
CPU simulators are quite simple at their core, consisting of two major components:
- One to store and interact with the state of the CPU at a given instant.
- One to calculate the next state of the CPU based on the current state.
The time between the current state and the next state constitutes one “clock cycle” in hardware terms. When implementing a processor as a digital logic circuit, this will be limited by the propagation delay of the slowest path through the circuit (critical path). In a software simulator, our limitation is simply how fast our code can run.
When writing a cycle-accurate, pipelined processor, one must be careful to store various values of intermediary registers and update them in the proper order – simulating a digital logic circuit where the values of all output signals are calculated simultaneously by different circuit paths (in a typical programming language where each statement happens in sequence, one after the other) can prove very challenging.
Fortunately for us, awk-riscv and this article is only an ISA (Instruction Set Architecture) level simulator for RISC-V, it does not simulate the pipeline, which greatly simplifies our task. If you want to build a pipeline-accurate simulator, I would encourage you to take a graduate level computer architecture course, and/or read Computer Architecture: A Quantitative Approach 6th edition by Hennessy and Patterson (older editions are fine if you just want to learn the basics of pipeline CPU architectures, but 6/e as been re-worked to be based around RISC-V).
Thus, our state-storage mechanism is very simple, we have an “array” mem
which stores each byte of memory, indexed by its address, an array regs
which
stores the value of each of RISC-V’s 31 general-purpose 32-bit registers, one
“zero” register which always contains he value zero, and a variable PC
which
stores the current Program Counter value (the PC is the memory address of the
instruction to be executed this cycle). I put arrays in quotes because AWK has
only two data types: strings, and hash tables, although math operations are
well supported (internally, AWK most likely has separate types in its
interpreter for numbers and text, but this is an implementation detail). It
also does not require us to pre-declare any variables, thus we can say
mem[123] = 7
out of the blue and this will work as one would expect. This has
the happy side-effect of allowing us to address the 232 possible
addresses that RISC-V (specifically, the RV32I instruction set) can address
without having to allocate a whole 4GB of memory up front.
The actual next-state logic for our processor is also quite straightforward. An RISC-V instruction can operate on some combinations of a destination register, one or two source registers, an immediate value, a source memory address, or a destination memory address (not all of these at once though!). Each instruction identifies some subset of these possible operands, and some arithmetic or logical operation to be performed with them.
The more difficult aspect of creating our next-state logic is instruction decoding. Processors are usually implemented with digital logic circuits, and are specified using a Hardware Description Language (HDL) such as Verilog, and an HDL AWK is not. In Verilog, it is very straightforward to pull specific bits out of instructions and put them back together in a different order, and to deal with values of unusual widths like 12 bits, or 7 bits.
All ISAs, including RV32I, will specify one or more instruction encoding. RV32I includes 6 types of instruction: R, I, S, B, U, and J. Each type trades off between what operands can be specified and the size of immediate values that can be used. I will not cover all 6 types in this article, but they can be found on page 12 of the RISC-V specification if you want to learn more.
Each RISC-V instruction is stored as a 32-bit value, divided into fields of fewer bits. How the instruction is divided up depends on the instruction type. Let’s consider I-type as an example. This type of instruction is used to describe operations involving a single source register, a 12-bit immediate value, and a destination register. A 32-bit I-type instruction is divided up into the following fields:
- Bits 31 to 20 (12 bits): immediate value (
imm
) - Bits 19 to 15 (5 bits): source register 1 (
rs1
) - Bits 14 to 12 (3 bits): function code (
funct3
) - Bits 11 to 7 (5 bits): destination register (
rd
) - Bits 6 to 0 (7 bits): operation code (
opcode
)
Meanwhile for comparison, an R-type instruction would use the bits comprising
the 12-bit immediate value to instead specify an additional source register
(rs2
) and an additional 7-bit function code (funct7
). The instruction type
can be unambiguously determined by the instruction’s opcode
field; the
opcode
field always occupies the lowest-order 7 bits of the instruction,
irrespective of type.
Determining how to extract all of these fields, identify the operation,
instruction type, and so on, is accomplished using lookup tables. In some
languages, that might mean case statements, in others big chains of if else
blocks, or an actual table.
Now that we have a general idea of how a CPU simulator works, let’s dive
further into riscv.awk
…
Instruction Decoding
Extracting Fields
To decode each instruction, riscv.awk
defines a collection of
functions,
each named after a field that it extracts. For example, the opcode()
function
returns the opcode
of an instruction, while the immI()
field extracts the
immediate value of an I-type instruction (there are several possible ways
immediate values can be encoded for various instruction types).
The immI
function looks like this:
function immI(v) { return signextend(rshift(and(v, 0xfff00000), 20), 11) }
Notice that it simply operates on a value – if one were to pass, for example,
an R-type instruction to immI()
, it would extract the exact same set of bits
as it would for an I-type instruction… except that the result would be
meaningless junk rather than a useful value.
Let’s take a look at the instruction addi x5 x0 0x123
, which sums the value
of the zero register (x0
, always evaluates to 0) and the immediate value
0x123
, storing the result in register 5. This instruction is encoded as
0x12300293
.
Observe how the field is extracted:
and(0x12300293, 0xfff00000) = 0x12300000
– this masks out just the bits containing the immediate value.rshift(0x12300000, 20) = 0x123
– shift those bits to be right-aligned, so the LSB (least significant bit) of the immediate value is the LSB of the whole value now.signextend(0x123, 11) = 0x123
–signextend
is a helper method I wrote which performs two’s complement sign-extension. The11
is the index at which it is to apply the sign-extension.
Sign-extension is used when storing a smaller two’s complement value in a larger “slot” (such as a field, variable, etc.). It simply copies the most-significant bit (MSB) of the smaller value into all of the higher-order fields of the larger space in which it is to be stored, thus preserving its sign.
For example, the value -1 could be represented as 0xf
in a 4-bit two’s
complement integer. If we wished to store this in an 8-bit field, we would
duplicate the MSB (1) into the four new higher-order fields, obtaining 0xff
as the result. Why this works is out of scope of this article, but you can
trust that it does.
immI()
is a straightforward enough function, but some of the field-extraction
functions can contain a lot of surface area for off-by-one errors. For example,
consider this function that decodes the immediate value from J-type functions
(used for jump instructions):
function immJ(v) { return signextend(or( \
rshift(and(v, 0x80000000), 11), \
rshift(and(v, 0x7fe00000), 20), \
rshift(and(v, 0x00100000), 9), \
rshift(and(v, 0x000ff000), 0)), 20) }
In RISC-V, J-type immediates are split into multiple separate fields, which are also out of order, so our simulator must mask each one out, shift them into the right positions, and re-combine all of them into a final value.
The next essential component of instruction decoding is the ability to determine an instruction’s type by its opcode – this is critical because we do not know what fields we can decode from the instruction until we know it’s type.
riscv.awk
solves this problem with a function
type()
which takes in an instruction, and returns a string corresponding to one of the
six types. To avoid human error, I generated the body of this function with a
simple
script
(which uses this
file
as input). It looks something like this:
function type(v) {
if (opcode(v) == 0x37) { return "U" }
if (opcode(v) == 0x17) { return "U" }
// ... trimmed for brevity
}
This builds on top of the field-extraction functions we discussed in the previous section.
Determine an Instruction’s Operation
Much like determining the instruction type, its operation is also decided by a (much larger) lookup table, also generated by script.
The
op2str()
function looks like this:
function op2str(v) {
if (opcode(v) == 0x37) { return "LUI" }
if (opcode(v) == 0x17) { return "AUIPC" }
if (opcode(v) == 0x6f) { return "JAL" }
if ((opcode(v) == 0x67) && (funct3(v) == 0x0)) {return "JALR"}
if ((opcode(v) == 0x63) && (funct3(v) == 0x0)) {return "BEQ"}
// ... trimmed for brevity ...
}
In RISC-V, an instructions operation is determined by its opcode
, but also
potentially by its function code fields, funct3
and funct7
for instruction
types that have them. For example, the ADD
and SUB
(add and subtract)
instructions differ only in the funct7
(0x0 and 0x20, respectively), but have
the same values for opcode
and fucnt3
(0x33, and 0x0, respectively).
Calculating The Next State
The final critical piece of our puzzle is the next-state function. Here, we use
the op2str()
function to determine which instruction is being executed, then
perform an appropriate operation. The
nextstate()
function is hand-written, though it is also structured like a big lookup table
much like instruction decoding functions are.
Let’s consider how the ADD
instruction is implemented:
// formatting changed for readability
else if (op2str(inst) == "ADD") {
regwrite(
rd(inst),
dec2two(two2dec(regread(rs1(inst))) + two2dec(regread(rs2(inst))))
)
}
rs1(inst)
and rs2(inst)
extract the source register 1, and source register
2 fields of the instruction, respectively. two2dec()
is a helper method
which converts a two’s complement number into a decimal representation –
this is important because on many modern 64-bit systems, AWK will have stored
our values as either strings, or as 64-bit integers, and therefore a value
like -1
packed into a 32-bit two’s complement value, but interpreted as a
64-bit integer would be treated as 268435455, which is not what we want.
rd(inst)
extracts the destination register of our instruction, and
regwrite()
is used to modify a register’s value. Access to the regs
and
mem
arrays discussed earlier is mediated by the functions regread()
,
regwrite()
, memread()
, and memwrite()
, which are used in both tracing and
to catch value-out-of-bounds errors.
Using the Simulator
The simulator processes standard input, one line at a time. In “normal mode” (the default), each line is treated as a “directive”, or command, split into fields along whitespace characters. How this works internally would be a lesson in AWK, rather than in writing CPU simulators, and so is omitted.
Running A Program
To load a program into the simulator, we can use the poke
directive, which
accepts a memory address, and a value to write to it. So if we had the program:
addi x5 x0 1
addi x6 x0 0xff
addi x10 x5 1
We could load it into the simulator’s memory via the following poke commands:
poke 0x00 0x00100293 # addi x5 x0 1
poke 0x04 0x0ff00313 # addi x6 x0 0xff
poke 0x08 0x0012f513 # andi x10 x5 1
Of course, if we feed this as input into risc.awk
, it won’t produce any
output – we need to use the step
directive to tell it to advance the
simulation by a specific number of cycles. If add step 5
to the input, it
will advance through the end of our program. It still won’t output anything
though. In true UNIX tradition, riscv.awk
assumes you may want to parse its
output with another program (in some cases, you may even want riscv.awk
to
parse its own output). Thus, it won’t print anything until we ask it to. We can
do this with rpeek 10
, which will output the contents of register 10. Let’s
put all this together:
$ cat directives.txt
poke 0x00 0x00100293 # addi x5 x0 1
poke 0x04 0x0ff00313 # addi x6 x0 0xff
poke 0x08 0x0012f513 # andi x10 x5 1
step 5
rpeek 10
$ awk -f riscv.awk < directives.txt
0x00000001
Not very exciting, but at least it prints the correct result! We could have
also used the showregs
directive to display the whole register file, rather
than just register x10
.
When debugging either a RISC-V program, or riscv.awk
itself, it’s often
useful to see what’s happening step-by-step. For this purpose, a number of
tracing modes are available. traceregs
will display all register reads and
writes, traceinst
will display each instruction that is run, and tracemem
displays all memory accesses. Let’s try turning these on and see what happens:
$ cat directives.txt
traceregs 1
traceinst 1
tracemem 1
poke 0x00 0x00100293 # addi x5 x0 1
poke 0x04 0x0ff00313 # addi x6 x0 0xff
poke 0x08 0x0012f513 # andi x10 x5 1
step 5
rpeek 10
$ awk -f riscv.awk < directives.txt
# memwrite(0x00000000, 0x00100293)
# memwrite(0x00000004, 0x0ff00313)
# memwrite(0x00000008, 0x0012f513)
# memread(0x00000000) -> 0x00100293
# PC=0x00000000, inst=0x00100293, type=I, disasm='0x00100293: ADDI x5 x0 1', funct7=0x00, funct3=0x0, rs1=0x00, rs2=0x01, rd=0x05, opcode=0x13, imm=0x00000001
# regread(x0) -> 0x00000000
# regwrite(x5, 0x00000001)
# memread(0x00000004) -> 0x0ff00313
# PC=0x00000004, inst=0x0ff00313, type=I, disasm='0x0ff00313: ADDI x6 x0 255', funct7=0x07, funct3=0x0, rs1=0x00, rs2=0x1f, rd=0x06, opcode=0x13, imm=0x000000ff
# regread(x0) -> 0x00000000
# regwrite(x6, 0x000000ff)
# memread(0x00000008) -> 0x0012f513
# PC=0x00000008, inst=0x0012f513, type=I, disasm='0x0012f513: ANDI x10 x5 1', funct7=0x00, funct3=0x7, rs1=0x05, rs2=0x01, rd=0x0a, opcode=0x13, imm=0x00000001
# regread(x5) -> 0x00000001
# regwrite(x10, 0x00000001)
# memread(0x0000000c) -> 0x00000000
# PC=0x0000000c, inst=0x00000000, type=E, disasm='0x00000000: UNKNOWN', funct7=0x00, funct3=0x0, rs1=0x00, rs2=0x00, rd=0x00, opcode=0x00
# memread(0x00000010) -> 0x00000000
# PC=0x00000010, inst=0x00000000, type=E, disasm='0x00000000: UNKNOWN', funct7=0x00, funct3=0x0, rs1=0x00, rs2=0x00, rd=0x00, opcode=0x00
# regread(x10) -> 0x00000001
0x00000001
Notice that all of the lines of tracing begin with #
. This is so that when
processing the output of riscv.awk
with itself, they will be treated as
comments. Let’s break this down line by line:
# memwrite(0x00000000, 0x00100293)
# memwrite(0x00000004, 0x0ff00313)
# memwrite(0x00000008, 0x0012f513)
First, we see our three poke commands writing into memory. Remember that in RISC-V, memory is byte-addressable, hence why we need to advance by 4 addresses at a time rather than 1.
# memread(0x00000000) -> 0x00100293
# PC=0x00000000, inst=0x00100293, type=I, disasm='0x00100293: ADDI x5 x0 1', funct7=0x00, funct3=0x0, rs1=0x00, rs2=0x01, rd=0x05, opcode=0x13, imm=0x00000001
# regread(x0) -> 0x00000000
# regwrite(x5, 0x00000001)
Here we see our first instruction, addi x5 x0 1
being executed. It is first
fetched from memory address 0x0
, which we can see in the memread()
, as well
as in the following line where the PC
is 0. The second line in this block
shows us the values of each field in the instruction. Notice that even though
it’s an I-type, it still prints out the funct7
and rs2
fields. This might
seem strange, but it’s useful to see all the fields while debugging, especially
if an un-implemented instruction is encountered. Finally, we see that this
instruction read from register 0 (the rs1
register), and wrote to register 5.
# PC=0x00000004, inst=0x0ff00313, type=I, disasm='0x0ff00313: ADDI x6 x0 255', funct7=0x07, funct3=0x0, rs1=0x00, rs2=0x1f, rd=0x06, opcode=0x13, imm=0x000000ff
# regread(x0) -> 0x00000000
# regwrite(x6, 0x000000ff)
# memread(0x00000008) -> 0x0012f513
# PC=0x00000008, inst=0x0012f513, type=I, disasm='0x0012f513: ANDI x10 x5 1', funct7=0x00, funct3=0x7, rs1=0x05, rs2=0x01, rd=0x0a, opcode=0x13, imm=0x00000001
# regread(x5) -> 0x00000001
# regwrite(x10, 0x00000001)
These lines are very similar to the previous chunk we looked at, but pertain to the following two instructions.
# memread(0x0000000c) -> 0x00000000
# PC=0x0000000c, inst=0x00000000, type=E, disasm='0x00000000: UNKNOWN', funct7=0x00, funct3=0x0, rs1=0x00, rs2=0x00, rd=0x00, opcode=0x00
# memread(0x00000010) -> 0x00000000
# PC=0x00000010, inst=0x00000000, type=E, disasm='0x00000000: UNKNOWN', funct7=0x00, funct3=0x0, rs1=0x00, rs2=0x00, rd=0x00, opcode=0x00
# regread(x10) -> 0x00000001
Here we see that execution has “fallen off the end” of our program. risv.awk
assumes memory is just a big 4GB blob, and thus doesn’t know that the program
has finished. However, unset memory values are treated as 0, which produces an
invalid instruction, which risv.awk
treats as a NO-OP. Thus these two
spurious instructions don’t actually have any effect.
0x00000001
Finally, we see the result of our rpeek
instruction, just as we did before we
turned on tracing.
Assembling a Program
You might wonder how we got the three instructions we poke
-ed into memory in
the previous section. For that, we need an assembler. We could use a
pre-existing one such as RARS, but
what’s the fun in that? riscv.awk
actually includes a complete assembler,
with label resolution. How the assembler works is beyond the scope of this
article, but it sure comes in handy when using the simulator.
To assemble a chunk of code, we set riscv.awk
into assemble
mode, which
will cause it begin interpreting further lines of input as assembly
instructions, to be assembled and inserted sequentially into memory starting at
0x0
. Let’s look at an example:
$ cat directives.txt
mode assemble
# this program sums the numbers 0 ... 10
addi x10 x0 0 # i = 0
addi x11 x0 10 # k = 10
addi x12 x0 0 # n = 0
loop:
add x12 x12 x10 # n += i
addi x10 x10 1 # i++
bge x11 x10 loop # if (k >= i) { goto loop; }
# exit assemble mode
mode normal
# resolve the jump
resolve our branch target
# dump the contents of memory, so we can see the result of assembling our
# program
dump
# run the program and print the resulting value of n in register 12
step 100
rpeek 12
$ awk -f riscv.awk < directives.txt
# BEGINNING OF RISCV.AWK DUMP
pcpoke 0x00000000
poke 0x00000000 0x00000513 # 0x00000513: ADDI x10 x0 0
poke 0x00000004 0x00a00593 # 0x00a00593: ADDI x11 x0 10
poke 0x00000008 0x00000613 # 0x00000613: ADDI x12 x0 0
poke 0x0000000c 0x00a60633 # 0x00a60633: ADD x12 x12 x10
poke 0x00000010 0x00150513 # 0x00150513: ADDI x10 x10 1
poke 0x00000014 0xfea5dce3 # 0xfea5dce3: BGE x11 x10 0xfffffff8
# END OF RISCV.AWK DUMP
0x00000037
As you can see, the program is assembled and placed in memory address 0, and
the branch target for the bge
instruction is resolved to -8
.
Conclusion
This has been a high-level overview of what goes into writing a basic CPU simulator, using my own RV32I simulator written in AWK as a case study. If you want more, I would suggest reading the source code for yourself; I have tried to comment it well and to structure it clearly, so I hope it will be easy to follow.
I found AWK to be a surprisingly pleasant language in which to tackle this project, and it turned out easier than I expected. That said, I wouldn’t suggest using this code in production, as I’ve left a lot of things you would want from a real CPU simulator out, such as interrupts and exceptions. Although AWK wasn’t bad for this purpose, Python, Go, or even C would have produced an easier to extend codebase that would be easier to integrate into other projects as a library.