Introduction
Forist is an instruction-set architecture (ISA) designed for modern desktop CPUs. By encoding a better understanding of how programs are executed into the instruction set, Forist aims to improve the usefulness of the information provided to CPUs.
In particular, here are some distinguishing features of Forist’s model of execution:
Instructions are executed out-of-order. Rather than waiting for each instruction to finish before executing the next, instructions are executed in parallel; they begin executing once the instructions they depend upon finish.
Conditional execution is speculated. CPUs predict the code they have to execute, then execute it speculatively. If a prediction is incorrect, then the effects of the speculation are canceled, to the greatest extent possible.
Memory accesses are cached. If a portion of RAM is accessed, then it is cached so that nearby accesses are temporarily fast. Accesses to data that is already cached can be canceled in the context of speculative execution.
CPUs are multicore. Cores operate independently, but can communicate and synchronize with one another. All cores share access to RAM, but can have independent caches – they may have different cache data for the same memory.
CPUs execute statically-verified input. The instructions passed to CPUs are guaranteed to be safe to execute: however, they can make use of side channels to attack the system. Programs are expected to protect secret data of their own accord so that it is not leaked through side channels.
Forist is an ISA (instruction set architecture) designed for modern desktop CPUs.
Features
Forist is designed around the following premises.
Speculative execution is difficult. The Spectre vulnerability showed that instructions with side effects (such as interactions with memory) are very difficult to speculatively execute safely. However, speculative execution is critical to performance – it is the CPU’s ability to dynamically learn and apply information about code. Forist provides safe alternatives to many cases where the speculative execution of side effects is necessary – for example, it supports cache-guaranteeing pointers, reading from and writing to which is free of side effects. Forist also allows “disabling” speculative execution for a branch – if the branch condition is known far enough in advance, performance is unaffected.
The use of registers hinders out-of-order execution. Modern CPUs begin executing instructions even before previous instructions are complete, by tracking how different instructions depend upon each other. However, register-based ISAs such as x86 communicate data between instructions indirectly (via registers), which obfuscates this information. Forist does not use registers – instead, instructions directly specify which instructions need their results. This significantly simplifies the CPU’s work, and provides even more information than is available with register-based ISAs.
Register renaming obviates register allocation. Compilers for register-based ISAs have to make decisions about which values to place in which registers. However, modern CPUs are making the same decisions dynamically anyways, due to the need for additional registers for out-of-order execution. In Forist, the CPU is wholly responsible for register allocation, and is expected to automatically store data on the stack when necessary. Compilers for Forist thus skip the problem of register allocation entirely.
Most values are used once or twice. When complex calculations are performed, many intermediary values are calculated which are each used once. Forist’s instruction encoding is specially optimized for this: instructions can only have one or two destination instructions (the
copy
instruction allows specifying more destinations). If just one destination is used, the space for the second destination is reused for storing additional data.Atomics are difficult to implement and use. Atomic memory operations have subtle semantics that are difficult to understand. In hardware, they are often too powerful: they require cooperation between different cores and usually operate at cache-line granularity, which is larger than intended. This leads to unexpected and complex performance implications. Forist operates upon a completely different model, based on inter-process communication. CPUs are informed about running tasks, and special instructions are provided to communicate individual values between them. Cache coherency is not maintained automatically: two different tasks may believe that two different values are stored at the same memory location. Explicit instructions for flushing the cache are provided, which tasks are expected to use before and after communicating with each other.
Untrusted code should not be executed. Traditional architectures support the execution of untrusted and possibly-malicious code, by distinguishing between different contexts (typically just “kernel space”, for the operating system, and “user space”, for user applications). However, transitions between these modes are difficult and slow because they clean up exploitable side channels like caches. Forist does not support the execution of untrusted code at all: instead, it is expected that static analysis and language-based software verification (as in Rust) are used to verify the safety of programs before they are executed.
The Basics
CPUs execute instructions. Instructions are instances of operations: they associate operations with a set of inputs and a set of outputs. Operations denote transformations from a set of input values to a set of output values. Values are individual units of data, and have fixed types which specify their size, semantics, and storage. Standard types are integers and pointers, both of which have unspecified sizes. Operations are generic: they accept various numbers and types of inputs, and the exact semantics of the operation and the types of the outputs can be determined from them. Instructions need not use all the outputs of the operation, and can copy the output values to multiple destination instructions.
To be executed by the CPU, instructions have to be encoded into a format called machine code. Machine code is designed to be easy for CPUs to execute, but not necessarily easy for other uses of instructions. Within a compiler, instructions may be represented differently: their functionality is fundamentally the same.
Instructions can be represented as a graph, where each node is an instruction, every incoming edge is an input, and every outgoing edge is an output. This is the mathematical representation of the instructions, and it is useful for mathematical analyses of the properties of the instructions. Depending upon the context, various constraints may be placed upon the instructions in this graph format: for example, when encoded, instructions can only have at most 4 inputs, and can only have 2 outputs (including copies).
Instructions are organized into blocks, which are organized into functions. Both functions and blocks represent independent groups of instructions which can be executed multiple times. Blocks can only be called by other blocks within the same function, whereas functions can be called from anywhere.
Types
Forist has a number of types for values. The semantics of operations vary depending upon the types of the inputs involved, in a manner equivalent to function overloading in C++ and other such languages.
void
is the special type which has exactly one valid value. It is the equivalent of the return typevoid
in C-like languages. It takes up no space. Instructions ignore anyvoid
inputs they receive, but the instructions computingvoid
s will be evaluated before the instructions using thevoid
s are.bool
is a regular boolean value. It effectively takes up just one bit of space, and must be eithertrue
orfalse
.xnat
is a category of integer types, all of which have a “natural width”. This means that they all have the same unknown number of bitsN
(commonly 32 or 64). Integers of this size can be operated upon most efficiently by the implementation.unat
andsnat
are the unsigned and signed integer types, respectively.unat
has a range of0 .. 2^N
andsnat
has a range of-2^(N-1) .. +2^(N-1)
(in both cases, the lower bound is inclusive and the upper bound is exclusive).Both of these integer types are range-checked; by default, if any computation resulting in a
unat
orsnat
value that is out of their respective ranges, then an overflow exception occurs, and the executing program is terminated.cnat
is the carrying integer type. This is intended to be used for arithmetic with multi-word values; the most-significant word should be ofunat
orsnat
type, which all other words should be ofcnat
type.Unlike
unat
andsnat
,cnat
effectively acts like an integer modulo2^N
. When requested, addition and subtraction withcnat
outputs a carry bit, which can be used to perform regular multi-word arithmetic. Although it is fairly similar to aunat
, arithmetic with it will not fail upon overflow.
Operations
Forist’s operations can be organized into the following categories:
- Computation: integer arithmetic, bitwise operations, and comparisons.
- Control Flow: conditional selection, branch predication, and blocks.
- Memory: reading, writing, stack allocations, and cache control.
More operations (and more categories) can be defined in the future.
Computation
Addition.
add lhs, rhs
Add together
lhs
andrhs
.adc lhs, rhs, carry
Add together
lhs
andrhs
, and increment the result ifcarry
istrue
.add.mul lhs, rhs, mul
Adds
lhs
to the product ofrhs
andmul
.This is an alias for
mul.add rhs, mul, lhs
.add.shl lhs, rhs, shl
Adds
lhs
to the left-shift ofrhs
byshl
.This is an alias for
shl.add rhs, shl, lhs
.
Subtraction.
sub lhs, rhs
Subtract
lhs
fromrhs
.sbb lhs, rhs, borrow
Subtract
rhs
fromlhs
, and decrement the result ifborrow
istrue
.sub.mul lhs, rhs, mul
Subtract the product of
rhs
andmul
fromlhs
.sub.shl lhs, rhs, shl
Subtract the left-shift of
rhs
byshl
fromlhs
.
Multiplication.
mul lhs, rhs
Multiply
lhs
byrhs
.mul.add lhs, rhs, add
Multiply
lhs
byrhs
, then addadd
.mul.sub lhs, rhs, sub
Multiply
lhs
byrhs, then subtract
sub`.
Division.
div lhs, rhs
Divide
lhs
byrhs
.div.dbl lhs, top, rhs
Divide the double-word consisting of
lhs
(low word) andtop
(high word) byrhs
.div.add lhs, add, rhs
Divide the sum of
lhs
andadd
byrhs
.div.sub lhs, sub, rhs
Divide the difference between
lhs
andsub
byrhs
.
Remainder.
rem lhs, rhs
Compute the remainder of dividing
lhs
byrhs
.rem.dbl lhs, top, rhs
Compute the remainder of dividing the double-word consisting of
lhs
(low word) andtop
(high word) byrhs
.
Shift left.
shl lhs, rhs
Shift
lhs
to the left byrhs
.shl.add lhs, rhs, add
Shift
lhs
to the left byrhs
, then addadd
.shl.sub lhs, rhs, sub
Shift
lhs
to the left byrhs, then subtract
sub`.
Shift right.
shr lhs, rhs
Shift
lhs
to the right byrhs
.shr.add lhs, add, rhs
Shift the sum of
lhs
andadd
to the right byrhs
.shr.sub lhs, sub, rhs
.Shift the difference between
lhs
andsub
to the right byrhs
.
Extract trailing bits.
etb lhs, rhs
Extract the
rhs
least-significant bits oflhs
.
Elementary integer arithmetic.
inc val
Increment
val
.dec val
Decrement
val
neg val
Negate
val
.abs val
Compute the signed absolute value of
val
.min lhs, rhs
Compute the minimum of
lhs
andrhs
.max lhs, rhs
Compute the maximum of
lhs
andrhs
.
Common bitwise arithmetic.
not val
Compute the bitwise NOT of
val
.rev val
Reverse the bits in
val
.and lhs, rhs
Compute the bitwise AND of
lhs
andrhs
.and.not lhs, rhs
Compute the bitwise AND of
lhs
and the bitwise NOT ofrhs
.not.and lhs, rhs
Compute the bitwise NOT of the bitwise AND of
lhs
andrhs
.ior lhs, rhs
Compute the bitwise inclusive-OR of
lhs
andrhs
.ior.not lhs, rhs
Compute the bitwise inclusive-OR of
lhs
and the bitwise NOT ofrhs
.not.ior lhs, rhs
Compute the bitwise NOT of the bitwise inclusive-OR of
lhs
andrhs
.xor lhs, rhs
Compute the bitwise exclusive-OR of
lhs
andrhs
.xor.not lhs, rhs
Compute the bitwise exclusive-OR of
lhs
and the bitwise NOT ofrhs
.not.xor lhs, rhs
Compute the bitwise NOT of the bitwise exclusive-OR of
lhs
andrhs
.This is an alias for
xor.not lhs, rhs
.
Bit counting.
cnt.ao val
Count the number of one-bits in
val
.cnt.az val
Count the number of zero-bits in
val
.cnt.tz val
Count the number of trailing consecutive least-significant zero-bits in
val
.cnt.lz val
Count the number of leading consecutive most-significant zero-bits in
val
.
Comparisons.
cmp.iz val
Checks if
val
is zero.cmp.nz val
Checks if
val
is not zero.cmp.eq lhs, rhs
Checks if
lhs
andrhs
are equal.cmp.ne lhs, rhs
Checks if
lhs
andrhs
are not equal.cmp.lt lhs, rhs
Checks if
lhs
is strictly lesser thanrhs
.cmp.gt lhs, rhs
Checks if
lhs
is strictly greater thanrhs
.cmp.le lhs, rhs
Checks if
lhs
is lesser than or equal torhs
.cmp.ge lhs, rhs
Checks if
lhs
is greater than or equal torhs
.cmp.in lhs, mid, rhs
Checks that
lhs
is lesser than or equal tomid
and thatmid
is strictly lesser thanrhs
; i.e. thatmid
fits in the range betweenlhs
andrhs
.
Control Flow
Value handling.
dup val, ...
Return
val
, ignoring all other inputs.cst type, val, ...
Return
val
converted to typetype
, ignoring all other inputs.
Branch predication.
Forist provides the ability to conditionally execute branches which do not have any side effects without leaking the branch condition through side-channels. This is achieved by laying out the code of all involved branches together, but using the instruction
asif
to transparently disable the code for all branches except one. When executed, the code for all branches will be evaluated, but only the output from the enabled branch will be visible, as if no other branch was computed (their results were discarded).To achieve this, all values must be accompanied by an additional predication bit, which indicates whether each value has not been disabled by an
asif
. Instructions will be executed as usual, but if the predication bit of any input is not set, then that of the instruction’s outputs will not be set either.asif cnd
This instruction has two
void
outputs; the predication bit of the first will be set ifcnd
is true, and the predication bit of the second will be set ifcnd
is false.
Conditional selection.
sel cnd, lhs, rhs
Select between
lhs
andrhs
, based oncnd
. If it istrue
,lhs
is selected; otherwiserhs
is.
Function calls.
A major advantage of Forist is its ability to select registers entirely dynamically, so that data does not have to be pushed to the stack unnecessarily. To this end, limiting the number of parameters that can be passed directly to functions is unhelpful, because it forces additional parameters to always be pushed to the stack rather than determining where to place them at run-time. However, all other instructions in Forist have a fixed number of inputs.
To work around this, the
call.n
instruction declares a new block; these instructions will be executed even as the function being called is loaded, preventing the occurrence of CPU pipeline stalls if the block is big enough (i.e. if the function in question is known early enough). This block contains multiple return instructions; the values that these instructions return are the parameters passed to the called function.If a
call
instruction returns from the containing block, then it is a tail call, and is guaranteed to be executed as such. This is important for preventing stack overflows.call.n func, num
Call the function
func
, passing itnum
arguments. The nextnum
returns after this instruction will be used as the parameters to pass tofunc
.
NOTE: This section is incomplete.
Instruction Encoding
When encoded, each instruction takes up 4 bytes (32 bits) of space, and is treated as a little-endian integer. Each instruction must be aligned to 4 bytes. There are multiple encoding formats: each one assigns different meanings to the bits in the instruction. The format of an instruction can be deduced from its bits.
Here is an overview of all instruction encoding formats:
+-----------------------------------------+
| 0000 XXXX XXXX XXXX XXXX XXXX XXXX XXXX | nop
| 0001 ???? ???? ???? ???? ???? ???? ???? | spec
| 001I AAAA AABB BBBB 0000 CCCX XXDD DXXX | comp-m-m
| 001I AAAA AABB BBBB TT00 CCCE EEDD DXXX | comp-mm-m
| 001I AAAA AABB BBBB 00TT CCCX XXDD DFFF | comp-m-mm
| 001I AAAA AABB BBBB TTTT CCCE EEDD DFFF | comp-mm-mm
| 010I ROOO OOOO OOOO SSSS SSXX XXSS SSSS | 2out-ss
| 011I ROOO OOOO OOOO LLLL LLLL LLSS SSSS | 2out-ls
| 100I OOOO OOOO OOOO SSSS SSXX XXXX XXXX | 1out-s
| 101I OOOO OOOO OOOO LLLL LLLL LLXX XXXX | 1out-l
| 110I OOOO OOOO OOOO SSSS SSXX XXSS SSSS | 1out-copy-ss
| 111I OOOO OOOO OOOO LLLL LLLL LLSS SSSS | 1out-copy-ls
+-----------------------------------------+
Across all formats, bits labelled X
are used as immediate values, via a process called
constant smearing; they are not used by the instruction itself. A bit labelled I
specifies that the instruction actually uses an immediate value as input. Specifics about
each format are discussed below.
nop
instructions have no purpose except to provide immediate value bits for future instructions.spec
instructions have undefined semantics. The meanings of the bits in this format have not yet been specified, and so they must not be used.Regular instructions are categorized by the number of outputs the internal operation has. If the internal operation has two outputs, then it follows the
2out
format. If the operation has one output, and this output is only used once, then the1out
format is used. If the operation has one output, but it is copied and used twice, the1out-copy
format is used.Each of these formats comes in
-s
(short) and-l
(long) variants, determining how many bits are in the first output operand of each instruction. Short variants only have 6 bits; long variants have 10 bits, providing a range 16 times larger.| 010I ROOO OOOO OOOO SSSS SSXX XXSS SSSS | 2out-ss | 011I ROOO OOOO OOOO LLLL LLLL LLSS SSSS | 2out-ls | 100I OOOO OOOO OOOO SSSS SSXX XXXX XXXX | 1out-s | 101I OOOO OOOO OOOO LLLL LLLL LLXX XXXX | 1out-l | 110I OOOO OOOO OOOO SSSS SSXX XXSS SSSS | 1out-copy-ss | 111I OOOO OOOO OOOO LLLL LLLL LLSS SSSS | 1out-copy-ls
The
O
bits specify the generic operation the instruction executes. TheS
bits represent data from a short output operand. TheL
bits represent data from a long output operand. TheR
bit, provided only for the2out
format, specify that the outputs of the internal operation are reversed (so the second output of the internal operation is the first output of the instruction).comp
instructions are compressed; they contain two independent instructions within them, and are useful for encoding common instructions. Each sub-instruction has all the same data as a regular uncompressed instruction, except that it has less space to store that data. Instead of long and short output operands, it only has mini output operands, which have 3 bits.| 001I AAAA AABB BBBB 0000 CCCX XXDD DXXX | comp-m-m | 001I AAAA AABB BBBB TT00 CCCE EEDD DXXX | comp-mm-m | 001I AAAA AABB BBBB 00UU CCCX XXDD DFFF | comp-m-mm | 001I AAAA AABB BBBB TTUU CCCE EEDD DFFF | comp-mm-mm
The
A
andB
bits specify the generic operation for each instruction. TheC
andD
bits specify the data for the first output operand, and theE
andF
bits specify the data for the second output operand (if any). TheT
andU
bits specify the number and handling of the output operands:00
matches1out
, and01
matches1out-copy
;10
matches2out
, and11
with outputs reversed.Like regular instructions,
comp
instructions can take an immediate value; however, this immediate is provided specifically to the first sub-instruction.
Output operands specify destination instructions, which the associated value will be sent to. They are simple non-negative offsets relative to the current instruction; thus an offset of one refers to the next instruction. A zero offset, which would refer to the current instruction itself and thus be nonsensical, is interpreted differently. If the first output operand of a regular instruction is a zero offset, then the instruction terminates the containing bock, and its output value is the return value. The second output operand of a regular/compressed instruction is interpreted as an offset relative to the instruction referenced by the first output; if it is a zero offset, it refers to the same instruction as the first output. However, compressed instructions do not have this zero-offset interpretation; instead, a zero offset in the first output of either sub-part of a compressed instruction is interpreted as a offset of eight instructions.
Instructions which do not use immediate values can still specify bits for immediates in their format. These extraneous immediate value bits are accumulated, and the first following instruction to use an immediate value will receive these accumulated bits in addition to any it carries. This system is called constant smearing, and it reduces the wastage of unused bits in instructions.
More specifically, during decoding, all the immediate value bits in the instruction are extracted (in the same order they appear) and shifted in as the new least-significant bits of a special immediate value accumulator register. When an instruction which uses an immediate value is being decoded, this register’s contents are used as the immediate value, and the register is zeroed. If the register is zero when new bits are inserted, then the most-significant new bit is duplicated to all the remaining register bits; this is known as sign-extension.
Output operands do not specify which input of the destination instruction their computed output will be; they only specify which instruction the output will be sent to. As such, the inputs to each instruction are accumulated in the order they appear, and this is the order in which the instruction receives them. Operations with multiple inputs thus need many variants to account for the various orderings their inputs could appear in.