Language Design

by tdmm on

I love programming languages. In particular, I’m interested in the safety that languages can provide, specifically through typesystems. But I want to see this safety combined with performance. If you haven’t guessed it, Rust is among my favorites. But naturally, I have some of my own ideas that I would love to implement at some point. So here’s a (relatively quick) rundown of the most well-formed of them.

goodint

Integer overflow is a particularly difficult issue because it is extremely subtle and completely unexpected (a good example of this is (a + b) / 2 as a method for computing the average of two fixed-size integers; it is incorrect for roughly half of all inputs). I believe I’ve come up with a typesystem for integers, which I call goodint, for resolving this issue and making integer overflow explicit.

The core idea of goodint is to guarantee that integer overflow will never happen implicitly. This is achieved by having the result of an integer operation have a different type from the operands. For example, adding a u4 to a u4 will result in a u5, because the range of the result does not fit within a u4.

These overflow guarantees are only made at the language level. Optimizing compilers can safely rewrite expressions such as (a + b) / 2 into equivalent ones which do not cause overflow, such as m + (M-m)/2 (where m and M are the minimum and maximum of a and b, respectively). This allows programmers to trust the language to do the right thing.

However, there are cases where integer overflow must be handled. For example, when adding integers together in a loop, one may assume that the type of the accumulator integer variable can fit the sum. In this case, the result of the sum between the accumulator and each integer (which has a type bigger than either operand) can be explicitly cast back to the accumulator integer type. By explicitly inserting the cast, the programmer indicates their awareness of the possibility of overflow. This translates into an explicit runtime overflow check (one which is kept even in release modes). More advanced and unsafe casts may also be available.

The above semantics are useful for integer types where overflow is unwanted, but there are situations where overflow is necessary (e.g. modular arithmetic). To this end, there are actually four different classes of integers defined in goodint: unsigned integers (uN), signed integers (iN), modular integers (mN, which operate exclusively modulo 2^N), and bitstrings (bN, which support bitwise operations that do not have “nice” integer interpretations).

There are also several options for integer type sizes. When the range of a variable is architecture-independent (or, I suppose, if code is being developed specifically for a single architecture, although this is not a great idea in most cases), fixed-size integer types should be used, where the exact size of the integer is specified in bits (e.g. u12 for an unsigned integer between 0 and 4,095, inclusive). However, several architecture-dependent integer sizes are provided: len (which is capable of representing the length of an object in memory), ptr (which is capable of representing a pointer to memory), and nat (which matches the size of a “natural” machine register). Furthermore, there is the dyn integer size, which is used for dynamically-sized integers, whose length is known at runtime (akin to slices in Rust).

Languages like Zig provide bit-sized integer types, but no language I know of has an integer typesystem even remotely similar to goodint. General-purpose programming languages, like Python, often have dynamically-sized integer types which naturally prevent overflow but are more inefficient.

Link-Time Evaluation

The standard build process for compiled languages like C is to compile each source file into an object file, then link the object files together into a shared library or executable. Shared executables are then run by loading the shared libraries they depend upon, resolve referenced symbols by modifying the machine code and inserting addresses, then executing the code. Typically, optimizations occur at the compilation stage, but not at later stages. However, a new trend is Link-Time Optimization (LTO), wherein compilers output an intermediary representation (typically that of LLVM) which is then optimized and compiled by the linker. The benefit of LTO is that more information is available to the linker than to the compiler, and so more aggressive optimizations are available.

To facilitate greater control over what happens at compile-time, and what code is evaluated and optimized by the compiler, some languages have introduced the concept of compile-time evaluation, where code can be explicitly written for execution at compile-time. Languages like C++, Zig, D, and Rust (at least partially) support this. However, because this occurs at the compilation stage, it does not have access to all the information that goes into the final executable: in particular, no information about external objects that will be linked to is available to compile-time evaluation. I propose link-time evaluation, wherein code can be executed at link-time, and can access all information known at this stage.

Because linking technically occurs twice (once for linking together a library or executable, and once again when shared libraries are loaded), link-time evaluation actually extends through the entire build and execute process. Link-time evaluation is useful, for example, to access internal information about libraries used by an executable: for example, the size of an internal structure of a library, which is part of its ABI but not its API, is exposed at link-time and not compile-time. Libraries and executables can use such a structure directly, without knowing its size at compile-time, because it will be known at link-time, which is enough.

Link-time evaluation is a stronger form of link-time optimization, analogous to how optimizing C compilers will often evaluate fixed integer expressions at compile-time without the language guaranteeing this behavior. To some extent, link-time evaluation is available through link-time optimization, because simple integer expressions will almost certainly be evaluated during this optimization process. However, true link-time evaluation would provide stronger guarantees about this possibility, and would provide direct control over it to the programmer. Regardless, the basic infrastructure necessary for link-time evaluation is already in place, because it is the same as that of link-time optimization.

Link-time evaluation is an extension of compile-time evaluation, and it subsumes all such use cases. As such, languages can implement link-time evaluation by treating link-time as a part of compile-time: I recommend this route, because it makes the language more powerful and simpler at the same time.

This is more of a change in the infrastructure used with programming languages, but I believe that it is just as important as the other language-specific ideas discussed here.

Enhanced References

Rust already has reference types, and I think they’re great. However, I wish there were two additional kinds of references: &own, which represents an owning reference, and &out, which represents an out-reference. The two have slightly-niche but still useful use cases, and although they are not absolutely necessary, implementing them manually requires obtuse and unsafe code.

These two kinds of references represent different sides of the same coin: an owning reference refers to an initialized value that will be de-initialized, and an out-reference refers to a de-initialized value that will be initialized. In fact, an owning reference can be transformed into an out-reference (by extracting or destroying the referenced value), and an out-reference can be transformed into an owning reference (by inserting or constructing a new value at the referenced location).

At the same time, an out-reference is very dissimilar to the other kinds of references, because the rest all guarantee that the referenced location contains an initialized value. It would be the rarest kind of reference, used in niche cases like indirect initialization (think memcpy(3), read(2), or just a replacement to an aggregate return type).

These kinds of references are useful in cases where data has to be re-initialized or de-initialized indirectly. They replace a large number of the use cases of the MaybeUninit wrapper type. With the appropriate design, they can be used on slices and other aggregate types as well.

Adding these kinds of references would essentially separate the ownership of a value from the ownership of the memory location of that value. This may simplify the implementation of allocating container types such as Vec<T>, which currently use unsafe pointer types and the like.

Reference kinds can be instead viewed as references to different versions of a type. For every type T, there would be a mutable version mut T (identical to Cell<T>), an owning version own T (which does not appear to have an existing analogue), and an out-version out T (identical to MaybeUninit<T>). Then there would be a single kind of reference &T, with varying semantics depending upon the version of type. I don’t know if this is a particularly good idea yet, but it is neat enough to warrant further investigation.

Static Stacks

The Zig programming language has some interesting typesystem considerations around asynchronous functions. Every function has a function frame type: for synchronous functions, this is a stub type with no information, but for asynchronous functions, it contains the state of the function between polls. The size of the function frame is static, and is known at compile-time. However, to allow asynchronous functions to be called indirectly (via function pointers), the size of this frame is also stored within the produced executable. This information allows the language to guarantee that enough space is allocated to hold the frame for asynchronous function calls. I propose extending this concept to measure the total amount of stack space that could potentially used by a function.

Every function would be guaranteed to use a fixed amount of stack space. This size would be stored as a part of the function’s ABI. By calling a function, its maximum stack usage contributes to the maximum of the current function. Thus, every function would be associated with a maximum stack usage value. This information can be represented statically via traits and the like, and it is available dynamically (i.e. when function pointers are used).

Information about maximum stack usage is very useful because it allows more accurate stack space allocation. By default, different systems will allocate different amounts of stack space, and programs will crash when they cross that limit. But if information about the maximum stack usage of the top-level function in a program is available, it would be possible to allocate exactly as much space as is necessary, and thus guarantee the absence of stack overflows. This applies to threading as well: different threading libraries use different defaults for the stack sizes of new threads, and this can cause difficult stack overflows.

However, taking advantage of these abilities would likely require a custom program interpreter (which is not wholly unwarranted, particularly when the ideas of link-time evaluation are taken into consideration), which can find and use this information.

Note that this disallows C-style variable-length arrays, but more importantly, recursion. It may still be possible to support recursion to a limited extent (e.g. by setting a maximum number of nesting on a per-function basis), but indirect function calls make the situation much worse. It may be possible to resolve this issue, but I don’t yet know how.