Forist

updated

Introduction

Forist is an instruction-set architecture (ISA) designed for modern desktop CPUs. By encoding a better understanding of how programs are executed into the instruction set, Forist aims to improve the usefulness of the information provided to CPUs.

In particular, here are some distinguishing features of Forist’s model of execution:


Forist is an ISA (instruction set architecture) designed for modern desktop CPUs.

Features

Forist is designed around the following premises.

  1. Speculative execution is difficult. The Spectre vulnerability showed that instructions with side effects (such as interactions with memory) are very difficult to speculatively execute safely. However, speculative execution is critical to performance – it is the CPU’s ability to dynamically learn and apply information about code. Forist provides safe alternatives to many cases where the speculative execution of side effects is necessary – for example, it supports cache-guaranteeing pointers, reading from and writing to which is free of side effects. Forist also allows “disabling” speculative execution for a branch – if the branch condition is known far enough in advance, performance is unaffected.

  2. The use of registers hinders out-of-order execution. Modern CPUs begin executing instructions even before previous instructions are complete, by tracking how different instructions depend upon each other. However, register-based ISAs such as x86 communicate data between instructions indirectly (via registers), which obfuscates this information. Forist does not use registers – instead, instructions directly specify which instructions need their results. This significantly simplifies the CPU’s work, and provides even more information than is available with register-based ISAs.

  3. Register renaming obviates register allocation. Compilers for register-based ISAs have to make decisions about which values to place in which registers. However, modern CPUs are making the same decisions dynamically anyways, due to the need for additional registers for out-of-order execution. In Forist, the CPU is wholly responsible for register allocation, and is expected to automatically store data on the stack when necessary. Compilers for Forist thus skip the problem of register allocation entirely.

  4. Most values are used once or twice. When complex calculations are performed, many intermediary values are calculated which are each used once. Forist’s instruction encoding is specially optimized for this: instructions can only have one or two destination instructions (the copy instruction allows specifying more destinations). If just one destination is used, the space for the second destination is reused for storing additional data.

  5. Atomics are difficult to implement and use. Atomic memory operations have subtle semantics that are difficult to understand. In hardware, they are often too powerful: they require cooperation between different cores and usually operate at cache-line granularity, which is larger than intended. This leads to unexpected and complex performance implications. Forist operates upon a completely different model, based on inter-process communication. CPUs are informed about running tasks, and special instructions are provided to communicate individual values between them. Cache coherency is not maintained automatically: two different tasks may believe that two different values are stored at the same memory location. Explicit instructions for flushing the cache are provided, which tasks are expected to use before and after communicating with each other.

  6. Untrusted code should not be executed. Traditional architectures support the execution of untrusted and possibly-malicious code, by distinguishing between different contexts (typically just “kernel space”, for the operating system, and “user space”, for user applications). However, transitions between these modes are difficult and slow because they clean up exploitable side channels like caches. Forist does not support the execution of untrusted code at all: instead, it is expected that static analysis and language-based software verification (as in Rust) are used to verify the safety of programs before they are executed.

The Basics

CPUs execute instructions. Instructions are instances of operations: they associate operations with a set of inputs and a set of outputs. Operations denote transformations from a set of input values to a set of output values. Values are individual units of data, and have fixed types which specify their size, semantics, and storage. Standard types are integers and pointers, both of which have unspecified sizes. Operations are generic: they accept various numbers and types of inputs, and the exact semantics of the operation and the types of the outputs can be determined from them. Instructions need not use all the outputs of the operation, and can copy the output values to multiple destination instructions.

To be executed by the CPU, instructions have to be encoded into a format called machine code. Machine code is designed to be easy for CPUs to execute, but not necessarily easy for other uses of instructions. Within a compiler, instructions may be represented differently: their functionality is fundamentally the same.

Instructions can be represented as a graph, where each node is an instruction, every incoming edge is an input, and every outgoing edge is an output. This is the mathematical representation of the instructions, and it is useful for mathematical analyses of the properties of the instructions. Depending upon the context, various constraints may be placed upon the instructions in this graph format: for example, when encoded, instructions can only have at most 4 inputs, and can only have 2 outputs (including copies).

Instructions are organized into blocks, which are organized into functions. Both functions and blocks represent independent groups of instructions which can be executed multiple times. Blocks can only be called by other blocks within the same function, whereas functions can be called from anywhere.

Types

Forist has a number of types for values. The semantics of operations vary depending upon the types of the inputs involved, in a manner equivalent to function overloading in C++ and other such languages.

Operations

Forist’s operations can be organized into the following categories:

More operations (and more categories) can be defined in the future.

Computation

  • Addition.

    • add lhs, rhs

      Add together lhs and rhs.

    • adc lhs, rhs, carry

      Add together lhs and rhs, and increment the result if carry is true.

    • add.mul lhs, rhs, mul

      Adds lhs to the product of rhs and mul.

      This is an alias for mul.add rhs, mul, lhs.

    • add.shl lhs, rhs, shl

      Adds lhs to the left-shift of rhs by shl.

      This is an alias for shl.add rhs, shl, lhs.

  • Subtraction.

    • sub lhs, rhs

      Subtract lhs from rhs.

    • sbb lhs, rhs, borrow

      Subtract rhs from lhs, and decrement the result if borrow is true.

    • sub.mul lhs, rhs, mul

      Subtract the product of rhs and mul from lhs.

    • sub.shl lhs, rhs, shl

      Subtract the left-shift of rhs by shl from lhs.

  • Multiplication.

    • mul lhs, rhs

      Multiply lhs by rhs.

    • mul.add lhs, rhs, add

      Multiply lhs by rhs, then add add.

    • mul.sub lhs, rhs, sub

      Multiply lhs by rhs, then subtract sub`.

  • Division.

    • div lhs, rhs

      Divide lhs by rhs.

    • div.dbl lhs, top, rhs

      Divide the double-word consisting of lhs (low word) and top (high word) by rhs.

    • div.add lhs, add, rhs

      Divide the sum of lhs and add by rhs.

    • div.sub lhs, sub, rhs

      Divide the difference between lhs and sub by rhs.

  • Remainder.

    • rem lhs, rhs

      Compute the remainder of dividing lhs by rhs.

    • rem.dbl lhs, top, rhs

      Compute the remainder of dividing the double-word consisting of lhs (low word) and top (high word) by rhs.

  • Shift left.

    • shl lhs, rhs

      Shift lhs to the left by rhs.

    • shl.add lhs, rhs, add

      Shift lhs to the left by rhs, then add add.

    • shl.sub lhs, rhs, sub

      Shift lhs to the left by rhs, then subtract sub`.

  • Shift right.

    • shr lhs, rhs

      Shift lhs to the right by rhs.

    • shr.add lhs, add, rhs

      Shift the sum of lhs and add to the right by rhs.

    • shr.sub lhs, sub, rhs.

      Shift the difference between lhs and sub to the right by rhs.

  • Extract trailing bits.

    • etb lhs, rhs

      Extract the rhs least-significant bits of lhs.

  • Elementary integer arithmetic.

    • inc val

      Increment val.

    • dec val

      Decrement val

    • neg val

      Negate val.

    • abs val

      Compute the signed absolute value of val.

    • min lhs, rhs

      Compute the minimum of lhs and rhs.

    • max lhs, rhs

      Compute the maximum of lhs and rhs.

  • Common bitwise arithmetic.

    • not val

      Compute the bitwise NOT of val.

    • rev val

      Reverse the bits in val.

    • and lhs, rhs

      Compute the bitwise AND of lhs and rhs.

    • and.not lhs, rhs

      Compute the bitwise AND of lhs and the bitwise NOT of rhs.

    • not.and lhs, rhs

      Compute the bitwise NOT of the bitwise AND of lhs and rhs.

    • ior lhs, rhs

      Compute the bitwise inclusive-OR of lhs and rhs.

    • ior.not lhs, rhs

      Compute the bitwise inclusive-OR of lhs and the bitwise NOT of rhs.

    • not.ior lhs, rhs

      Compute the bitwise NOT of the bitwise inclusive-OR of lhs and rhs.

    • xor lhs, rhs

      Compute the bitwise exclusive-OR of lhs and rhs.

    • xor.not lhs, rhs

      Compute the bitwise exclusive-OR of lhs and the bitwise NOT of rhs.

    • not.xor lhs, rhs

      Compute the bitwise NOT of the bitwise exclusive-OR of lhs and rhs.

      This is an alias for xor.not lhs, rhs.

  • Bit counting.

    • cnt.ao val

      Count the number of one-bits in val.

    • cnt.az val

      Count the number of zero-bits in val.

    • cnt.tz val

      Count the number of trailing consecutive least-significant zero-bits in val.

    • cnt.lz val

      Count the number of leading consecutive most-significant zero-bits in val.

  • Comparisons.

    • cmp.iz val

      Checks if val is zero.

    • cmp.nz val

      Checks if val is not zero.

    • cmp.eq lhs, rhs

      Checks if lhs and rhs are equal.

    • cmp.ne lhs, rhs

      Checks if lhs and rhs are not equal.

    • cmp.lt lhs, rhs

      Checks if lhs is strictly lesser than rhs.

    • cmp.gt lhs, rhs

      Checks if lhs is strictly greater than rhs.

    • cmp.le lhs, rhs

      Checks if lhs is lesser than or equal to rhs.

    • cmp.ge lhs, rhs

      Checks if lhs is greater than or equal to rhs.

    • cmp.in lhs, mid, rhs

      Checks that lhs is lesser than or equal to mid and that mid is strictly lesser than rhs; i.e. that mid fits in the range between lhs and rhs.

Control Flow

  • Value handling.

    • dup val, ...

      Return val, ignoring all other inputs.

    • cst type, val, ...

      Return val converted to type type, ignoring all other inputs.

  • Branch predication.

    Forist provides the ability to conditionally execute branches which do not have any side effects without leaking the branch condition through side-channels. This is achieved by laying out the code of all involved branches together, but using the instruction asif to transparently disable the code for all branches except one. When executed, the code for all branches will be evaluated, but only the output from the enabled branch will be visible, as if no other branch was computed (their results were discarded).

    To achieve this, all values must be accompanied by an additional predication bit, which indicates whether each value has not been disabled by an asif. Instructions will be executed as usual, but if the predication bit of any input is not set, then that of the instruction’s outputs will not be set either.

    • asif cnd

      This instruction has two void outputs; the predication bit of the first will be set if cnd is true, and the predication bit of the second will be set if cnd is false.

  • Conditional selection.

    • sel cnd, lhs, rhs

      Select between lhs and rhs, based on cnd. If it is true, lhs is selected; otherwise rhs is.

  • Function calls.

    A major advantage of Forist is its ability to select registers entirely dynamically, so that data does not have to be pushed to the stack unnecessarily. To this end, limiting the number of parameters that can be passed directly to functions is unhelpful, because it forces additional parameters to always be pushed to the stack rather than determining where to place them at run-time. However, all other instructions in Forist have a fixed number of inputs.

    To work around this, the call.n instruction declares a new block; these instructions will be executed even as the function being called is loaded, preventing the occurrence of CPU pipeline stalls if the block is big enough (i.e. if the function in question is known early enough). This block contains multiple return instructions; the values that these instructions return are the parameters passed to the called function.

    If a call instruction returns from the containing block, then it is a tail call, and is guaranteed to be executed as such. This is important for preventing stack overflows.

    • call.n func, num

      Call the function func, passing it num arguments. The next num returns after this instruction will be used as the parameters to pass to func.

NOTE: This section is incomplete.

Instruction Encoding

When encoded, each instruction takes up 4 bytes (32 bits) of space, and is treated as a little-endian integer. Each instruction must be aligned to 4 bytes. There are multiple encoding formats: each one assigns different meanings to the bits in the instruction. The format of an instruction can be deduced from its bits.

Here is an overview of all instruction encoding formats:

+-----------------------------------------+
| 0000 XXXX XXXX XXXX XXXX XXXX XXXX XXXX | nop
| 0001 ???? ???? ???? ???? ???? ???? ???? | spec
| 001I AAAA AABB BBBB 0000 CCCX XXDD DXXX | comp-m-m
| 001I AAAA AABB BBBB TT00 CCCE EEDD DXXX | comp-mm-m
| 001I AAAA AABB BBBB 00TT CCCX XXDD DFFF | comp-m-mm
| 001I AAAA AABB BBBB TTTT CCCE EEDD DFFF | comp-mm-mm
| 010I ROOO OOOO OOOO SSSS SSXX XXSS SSSS | 2out-ss
| 011I ROOO OOOO OOOO LLLL LLLL LLSS SSSS | 2out-ls
| 100I OOOO OOOO OOOO SSSS SSXX XXXX XXXX | 1out-s
| 101I OOOO OOOO OOOO LLLL LLLL LLXX XXXX | 1out-l
| 110I OOOO OOOO OOOO SSSS SSXX XXSS SSSS | 1out-copy-ss
| 111I OOOO OOOO OOOO LLLL LLLL LLSS SSSS | 1out-copy-ls
+-----------------------------------------+

Across all formats, bits labelled X are used as immediate values, via a process called constant smearing; they are not used by the instruction itself. A bit labelled I specifies that the instruction actually uses an immediate value as input. Specifics about each format are discussed below.

Output operands specify destination instructions, which the associated value will be sent to. They are simple non-negative offsets relative to the current instruction; thus an offset of one refers to the next instruction. A zero offset, which would refer to the current instruction itself and thus be nonsensical, is interpreted differently. If the first output operand of a regular instruction is a zero offset, then the instruction terminates the containing bock, and its output value is the return value. The second output operand of a regular/compressed instruction is interpreted as an offset relative to the instruction referenced by the first output; if it is a zero offset, it refers to the same instruction as the first output. However, compressed instructions do not have this zero-offset interpretation; instead, a zero offset in the first output of either sub-part of a compressed instruction is interpreted as a offset of eight instructions.

Instructions which do not use immediate values can still specify bits for immediates in their format. These extraneous immediate value bits are accumulated, and the first following instruction to use an immediate value will receive these accumulated bits in addition to any it carries. This system is called constant smearing, and it reduces the wastage of unused bits in instructions.

More specifically, during decoding, all the immediate value bits in the instruction are extracted (in the same order they appear) and shifted in as the new least-significant bits of a special immediate value accumulator register. When an instruction which uses an immediate value is being decoded, this register’s contents are used as the immediate value, and the register is zeroed. If the register is zero when new bits are inserted, then the most-significant new bit is duplicated to all the remaining register bits; this is known as sign-extension.

Output operands do not specify which input of the destination instruction their computed output will be; they only specify which instruction the output will be sent to. As such, the inputs to each instruction are accumulated in the order they appear, and this is the order in which the instruction receives them. Operations with multiple inputs thus need many variants to account for the various orderings their inputs could appear in.