Forist | The Depressed Milkman

Introduction

Forist is an instruction-set architecture (ISA) designed for modern desktop CPUs. By encoding a better understanding of how programs are executed into the instruction set, Forist aims to improve the usefulness of the information provided to CPUs.

In particular, here are some distinguishing features of Forist’s model of execution:

Instructions are executed out-of-order. Rather than waiting for each instruction to finish before executing the next, instructions are executed in parallel; they begin executing once the instructions they depend upon finish.
Conditional execution is speculated. CPUs predict the code they have to execute, then execute it speculatively. If a prediction is incorrect, then the effects of the speculation are canceled, to the greatest extent possible.
Memory accesses are cached. If a portion of RAM is accessed, then it is cached so that nearby accesses are temporarily fast. Accesses to data that is already cached can be canceled in the context of speculative execution.
CPUs are multicore. Cores operate independently, but can communicate and synchronize with one another. All cores share access to RAM, but can have independent caches – they may have different cache data for the same memory.
CPUs execute statically-verified input. The instructions passed to CPUs are guaranteed to be safe to execute: however, they can make use of side channels to attack the system. Programs are expected to protect secret data of their own accord so that it is not leaked through side channels.

Forist is an ISA (instruction set architecture) designed for modern desktop CPUs.

Features

Forist is designed around the following premises.

Speculative execution is difficult. The Spectre vulnerability showed that instructions with side effects (such as interactions with memory) are very difficult to speculatively execute safely. However, speculative execution is critical to performance – it is the CPU’s ability to dynamically learn and apply information about code. Forist provides safe alternatives to many cases where the speculative execution of side effects is necessary – for example, it supports cache-guaranteeing pointers, reading from and writing to which is free of side effects. Forist also allows “disabling” speculative execution for a branch – if the branch condition is known far enough in advance, performance is unaffected.
The use of registers hinders out-of-order execution. Modern CPUs begin executing instructions even before previous instructions are complete, by tracking how different instructions depend upon each other. However, register-based ISAs such as x86 communicate data between instructions indirectly (via registers), which obfuscates this information. Forist does not use registers – instead, instructions directly specify which instructions need their results. This significantly simplifies the CPU’s work, and provides even more information than is available with register-based ISAs.
Register renaming obviates register allocation. Compilers for register-based ISAs have to make decisions about which values to place in which registers. However, modern CPUs are making the same decisions dynamically anyways, due to the need for additional registers for out-of-order execution. In Forist, the CPU is wholly responsible for register allocation, and is expected to automatically store data on the stack when necessary. Compilers for Forist thus skip the problem of register allocation entirely.
Most values are used once or twice. When complex calculations are performed, many intermediary values are calculated which are each used once. Forist’s instruction encoding is specially optimized for this: instructions can only have one or two destination instructions (the copy instruction allows specifying more destinations). If just one destination is used, the space for the second destination is reused for storing additional data.
Atomics are difficult to implement and use. Atomic memory operations have subtle semantics that are difficult to understand. In hardware, they are often too powerful: they require cooperation between different cores and usually operate at cache-line granularity, which is larger than intended. This leads to unexpected and complex performance implications. Forist operates upon a completely different model, based on inter-process communication. CPUs are informed about running tasks, and special instructions are provided to communicate individual values between them. Cache coherency is not maintained automatically: two different tasks may believe that two different values are stored at the same memory location. Explicit instructions for flushing the cache are provided, which tasks are expected to use before and after communicating with each other.
Untrusted code should not be executed. Traditional architectures support the execution of untrusted and possibly-malicious code, by distinguishing between different contexts (typically just “kernel space”, for the operating system, and “user space”, for user applications). However, transitions between these modes are difficult and slow because they clean up exploitable side channels like caches. Forist does not support the execution of untrusted code at all: instead, it is expected that static analysis and language-based software verification (as in Rust) are used to verify the safety of programs before they are executed.

The Basics

CPUs execute instructions. Instructions are instances of operations: they associate operations with a set of inputs and a set of outputs. Operations denote transformations from a set of input values to a set of output values. Values are individual units of data, and have fixed types which specify their size, semantics, and storage. Standard types are integers and pointers, both of which have unspecified sizes. Operations are generic: they accept various numbers and types of inputs, and the exact semantics of the operation and the types of the outputs can be determined from them. Instructions need not use all the outputs of the operation, and can copy the output values to multiple destination instructions.

To be executed by the CPU, instructions have to be encoded into a format called machine code. Machine code is designed to be easy for CPUs to execute, but not necessarily easy for other uses of instructions. Within a compiler, instructions may be represented differently: their functionality is fundamentally the same.

Instructions can be represented as a graph, where each node is an instruction, every incoming edge is an input, and every outgoing edge is an output. This is the mathematical representation of the instructions, and it is useful for mathematical analyses of the properties of the instructions. Depending upon the context, various constraints may be placed upon the instructions in this graph format: for example, when encoded, instructions can only have at most 4 inputs, and can only have 2 outputs (including copies).

Instructions are organized into blocks, which are organized into functions. Both functions and blocks represent independent groups of instructions which can be executed multiple times. Blocks can only be called by other blocks within the same function, whereas functions can be called from anywhere.

Types

Forist has a number of types for values. The semantics of operations vary depending upon the types of the inputs involved, in a manner equivalent to function overloading in C++ and other such languages.

void is the special type which has exactly one valid value. It is the equivalent of the return type void in C-like languages. It takes up no space. Instructions ignore any void inputs they receive, but the instructions computing voids will be evaluated before the instructions using the voids are.
bool is a regular boolean value. It effectively takes up just one bit of space, and must be either true or false.
xnat is a category of integer types, all of which have a “natural width”. This means that they all have the same unknown number of bits N (commonly 32 or 64). Integers of this size can be operated upon most efficiently by the implementation.
- unat and snat are the unsigned and signed integer types, respectively. unat has a range of 0 .. 2^N and snat has a range of -2^(N-1) .. +2^(N-1) (in both cases, the lower bound is inclusive and the upper bound is exclusive).
  Both of these integer types are range-checked; by default, if any computation resulting in a unat or snat value that is out of their respective ranges, then an overflow exception occurs, and the executing program is terminated.
- cnat is the carrying integer type. This is intended to be used for arithmetic with multi-word values; the most-significant word should be of unat or snat type, which all other words should be of cnat type.
  Unlike unat and snat, cnat effectively acts like an integer modulo 2^N. When requested, addition and subtraction with cnat outputs a carry bit, which can be used to perform regular multi-word arithmetic. Although it is fairly similar to a unat, arithmetic with it will not fail upon overflow.

Operations

Forist’s operations can be organized into the following categories:

Computation: integer arithmetic, bitwise operations, and comparisons.
Control Flow: conditional selection, branch predication, and blocks.
Memory: reading, writing, stack allocations, and cache control.

More operations (and more categories) can be defined in the future.

Computation

Addition.
- add lhs, rhs
  Add together lhs and rhs.
- adc lhs, rhs, carry
  Add together lhs and rhs, and increment the result if carry is true.
- add.mul lhs, rhs, mul
  Adds lhs to the product of rhs and mul.
  This is an alias for mul.add rhs, mul, lhs.
- add.shl lhs, rhs, shl
  Adds lhs to the left-shift of rhs by shl.
  This is an alias for shl.add rhs, shl, lhs.
Subtraction.
- sub lhs, rhs
  Subtract lhs from rhs.
- sbb lhs, rhs, borrow
  Subtract rhs from lhs, and decrement the result if borrow is true.
- sub.mul lhs, rhs, mul
  Subtract the product of rhs and mul from lhs.
- sub.shl lhs, rhs, shl
  Subtract the left-shift of rhs by shl from lhs.
Multiplication.
- mul lhs, rhs
  Multiply lhs by rhs.
- mul.add lhs, rhs, add
  Multiply lhs by rhs, then add add.
- mul.sub lhs, rhs, sub
  Multiply lhs by rhs, then subtract sub`.
Division.
- div lhs, rhs
  Divide lhs by rhs.
- div.dbl lhs, top, rhs
  Divide the double-word consisting of lhs (low word) and top (high word) by rhs.
- div.add lhs, add, rhs
  Divide the sum of lhs and add by rhs.
- div.sub lhs, sub, rhs
  Divide the difference between lhs and sub by rhs.
Remainder.
- rem lhs, rhs
  Compute the remainder of dividing lhs by rhs.
- rem.dbl lhs, top, rhs
  Compute the remainder of dividing the double-word consisting of lhs (low word) and top (high word) by rhs.
Shift left.
- shl lhs, rhs
  Shift lhs to the left by rhs.
- shl.add lhs, rhs, add
  Shift lhs to the left by rhs, then add add.
- shl.sub lhs, rhs, sub
  Shift lhs to the left by rhs, then subtract sub`.
Shift right.
- shr lhs, rhs
  Shift lhs to the right by rhs.
- shr.add lhs, add, rhs
  Shift the sum of lhs and add to the right by rhs.
- shr.sub lhs, sub, rhs.
  Shift the difference between lhs and sub to the right by rhs.
Extract trailing bits.
- etb lhs, rhs
  Extract the rhs least-significant bits of lhs.
Elementary integer arithmetic.
- inc val
  Increment val.
- dec val
  Decrement val
- neg val
  Negate val.
- abs val
  Compute the signed absolute value of val.
- min lhs, rhs
  Compute the minimum of lhs and rhs.
- max lhs, rhs
  Compute the maximum of lhs and rhs.
Common bitwise arithmetic.
- not val
  Compute the bitwise NOT of val.
- rev val
  Reverse the bits in val.
- and lhs, rhs
  Compute the bitwise AND of lhs and rhs.
- and.not lhs, rhs
  Compute the bitwise AND of lhs and the bitwise NOT of rhs.
- not.and lhs, rhs
  Compute the bitwise NOT of the bitwise AND of lhs and rhs.
- ior lhs, rhs
  Compute the bitwise inclusive-OR of lhs and rhs.
- ior.not lhs, rhs
  Compute the bitwise inclusive-OR of lhs and the bitwise NOT of rhs.
- not.ior lhs, rhs
  Compute the bitwise NOT of the bitwise inclusive-OR of lhs and rhs.
- xor lhs, rhs
  Compute the bitwise exclusive-OR of lhs and rhs.
- xor.not lhs, rhs
  Compute the bitwise exclusive-OR of lhs and the bitwise NOT of rhs.
- not.xor lhs, rhs
  Compute the bitwise NOT of the bitwise exclusive-OR of lhs and rhs.
  This is an alias for xor.not lhs, rhs.
Bit counting.
- cnt.ao val
  Count the number of one-bits in val.
- cnt.az val
  Count the number of zero-bits in val.
- cnt.tz val
  Count the number of trailing consecutive least-significant zero-bits in val.
- cnt.lz val
  Count the number of leading consecutive most-significant zero-bits in val.
Comparisons.
- cmp.iz val
  Checks if val is zero.
- cmp.nz val
  Checks if val is not zero.
- cmp.eq lhs, rhs
  Checks if lhs and rhs are equal.
- cmp.ne lhs, rhs
  Checks if lhs and rhs are not equal.
- cmp.lt lhs, rhs
  Checks if lhs is strictly lesser than rhs.
- cmp.gt lhs, rhs
  Checks if lhs is strictly greater than rhs.
- cmp.le lhs, rhs
  Checks if lhs is lesser than or equal to rhs.
- cmp.ge lhs, rhs
  Checks if lhs is greater than or equal to rhs.
- cmp.in lhs, mid, rhs
  Checks that lhs is lesser than or equal to mid and that mid is strictly lesser than rhs; i.e. that mid fits in the range between lhs and rhs.

Control Flow

Value handling.
- dup val, ...
  Return val, ignoring all other inputs.
- cst type, val, ...
  Return val converted to type type, ignoring all other inputs.
Branch predication.
Forist provides the ability to conditionally execute branches which do not have any side effects without leaking the branch condition through side-channels. This is achieved by laying out the code of all involved branches together, but using the instruction asif to transparently disable the code for all branches except one. When executed, the code for all branches will be evaluated, but only the output from the enabled branch will be visible, as if no other branch was computed (their results were discarded).
To achieve this, all values must be accompanied by an additional predication bit, which indicates whether each value has not been disabled by an asif. Instructions will be executed as usual, but if the predication bit of any input is not set, then that of the instruction’s outputs will not be set either.
- asif cnd
  This instruction has two void outputs; the predication bit of the first will be set if cnd is true, and the predication bit of the second will be set if cnd is false.
Conditional selection.
- sel cnd, lhs, rhs
  Select between lhs and rhs, based on cnd. If it is true, lhs is selected; otherwise rhs is.
Function calls.
A major advantage of Forist is its ability to select registers entirely dynamically, so that data does not have to be pushed to the stack unnecessarily. To this end, limiting the number of parameters that can be passed directly to functions is unhelpful, because it forces additional parameters to always be pushed to the stack rather than determining where to place them at run-time. However, all other instructions in Forist have a fixed number of inputs.
To work around this, the call.n instruction declares a new block; these instructions will be executed even as the function being called is loaded, preventing the occurrence of CPU pipeline stalls if the block is big enough (i.e. if the function in question is known early enough). This block contains multiple return instructions; the values that these instructions return are the parameters passed to the called function.
If a call instruction returns from the containing block, then it is a tail call, and is guaranteed to be executed as such. This is important for preventing stack overflows.
- call.n func, num
  Call the function func, passing it num arguments. The next num returns after this instruction will be used as the parameters to pass to func.

NOTE: This section is incomplete.

Instruction Encoding

When encoded, each instruction takes up 4 bytes (32 bits) of space, and is treated as a little-endian integer. Each instruction must be aligned to 4 bytes. There are multiple encoding formats: each one assigns different meanings to the bits in the instruction. The format of an instruction can be deduced from its bits.

Here is an overview of all instruction encoding formats:

+-----------------------------------------+
| 0000 XXXX XXXX XXXX XXXX XXXX XXXX XXXX | nop
| 0001 ???? ???? ???? ???? ???? ???? ???? | spec
| 001I AAAA AABB BBBB 0000 CCCX XXDD DXXX | comp-m-m
| 001I AAAA AABB BBBB TT00 CCCE EEDD DXXX | comp-mm-m
| 001I AAAA AABB BBBB 00TT CCCX XXDD DFFF | comp-m-mm
| 001I AAAA AABB BBBB TTTT CCCE EEDD DFFF | comp-mm-mm
| 010I ROOO OOOO OOOO SSSS SSXX XXSS SSSS | 2out-ss
| 011I ROOO OOOO OOOO LLLL LLLL LLSS SSSS | 2out-ls
| 100I OOOO OOOO OOOO SSSS SSXX XXXX XXXX | 1out-s
| 101I OOOO OOOO OOOO LLLL LLLL LLXX XXXX | 1out-l
| 110I OOOO OOOO OOOO SSSS SSXX XXSS SSSS | 1out-copy-ss
| 111I OOOO OOOO OOOO LLLL LLLL LLSS SSSS | 1out-copy-ls
+-----------------------------------------+

Across all formats, bits labelled X are used as immediate values, via a process called constant smearing; they are not used by the instruction itself. A bit labelled I specifies that the instruction actually uses an immediate value as input. Specifics about each format are discussed below.

nop instructions have no purpose except to provide immediate value bits for future instructions.
spec instructions have undefined semantics. The meanings of the bits in this format have not yet been specified, and so they must not be used.
Regular instructions are categorized by the number of outputs the internal operation has. If the internal operation has two outputs, then it follows the 2out format. If the operation has one output, and this output is only used once, then the 1out format is used. If the operation has one output, but it is copied and used twice, the 1out-copy format is used.
Each of these formats comes in -s (short) and -l (long) variants, determining how many bits are in the first output operand of each instruction. Short variants only have 6 bits; long variants have 10 bits, providing a range 16 times larger.
```
| 010I ROOO OOOO OOOO SSSS SSXX XXSS SSSS | 2out-ss
| 011I ROOO OOOO OOOO LLLL LLLL LLSS SSSS | 2out-ls
| 100I OOOO OOOO OOOO SSSS SSXX XXXX XXXX | 1out-s
| 101I OOOO OOOO OOOO LLLL LLLL LLXX XXXX | 1out-l
| 110I OOOO OOOO OOOO SSSS SSXX XXSS SSSS | 1out-copy-ss
| 111I OOOO OOOO OOOO LLLL LLLL LLSS SSSS | 1out-copy-ls
```
The O bits specify the generic operation the instruction executes. The S bits represent data from a short output operand. The L bits represent data from a long output operand. The R bit, provided only for the 2out format, specify that the outputs of the internal operation are reversed (so the second output of the internal operation is the first output of the instruction).
comp instructions are compressed; they contain two independent instructions within them, and are useful for encoding common instructions. Each sub-instruction has all the same data as a regular uncompressed instruction, except that it has less space to store that data. Instead of long and short output operands, it only has mini output operands, which have 3 bits.
```
| 001I AAAA AABB BBBB 0000 CCCX XXDD DXXX | comp-m-m
| 001I AAAA AABB BBBB TT00 CCCE EEDD DXXX | comp-mm-m
| 001I AAAA AABB BBBB 00UU CCCX XXDD DFFF | comp-m-mm
| 001I AAAA AABB BBBB TTUU CCCE EEDD DFFF | comp-mm-mm
```
The A and B bits specify the generic operation for each instruction. The C and D bits specify the data for the first output operand, and the E and F bits specify the data for the second output operand (if any). The T and U bits specify the number and handling of the output operands: 00 matches 1out, and 01 matches 1out-copy; 10 matches 2out, and 11 with outputs reversed.
Like regular instructions, comp instructions can take an immediate value; however, this immediate is provided specifically to the first sub-instruction.

Output operands specify destination instructions, which the associated value will be sent to. They are simple non-negative offsets relative to the current instruction; thus an offset of one refers to the next instruction. A zero offset, which would refer to the current instruction itself and thus be nonsensical, is interpreted differently. If the first output operand of a regular instruction is a zero offset, then the instruction terminates the containing bock, and its output value is the return value. The second output operand of a regular/compressed instruction is interpreted as an offset relative to the instruction referenced by the first output; if it is a zero offset, it refers to the same instruction as the first output. However, compressed instructions do not have this zero-offset interpretation; instead, a zero offset in the first output of either sub-part of a compressed instruction is interpreted as a offset of eight instructions.

Instructions which do not use immediate values can still specify bits for immediates in their format. These extraneous immediate value bits are accumulated, and the first following instruction to use an immediate value will receive these accumulated bits in addition to any it carries. This system is called constant smearing, and it reduces the wastage of unused bits in instructions.

More specifically, during decoding, all the immediate value bits in the instruction are extracted (in the same order they appear) and shifted in as the new least-significant bits of a special immediate value accumulator register. When an instruction which uses an immediate value is being decoded, this register’s contents are used as the immediate value, and the register is zeroed. If the register is zero when new bits are inserted, then the most-significant new bit is duplicated to all the remaining register bits; this is known as sign-extension.

Output operands do not specify which input of the destination instruction their computed output will be; they only specify which instruction the output will be sent to. As such, the inputs to each instruction are accumulated in the order they appear, and this is the order in which the instruction receives them. Operations with multiple inputs thus need many variants to account for the various orderings their inputs could appear in.