Newtyped Indices are Proofs
It's pretty well-known at this point that the solution for complex webs of ownership in Rust is just to use newtyped indices. Graphs are the prototypical example; the earliest post on this subject I'm aware of was posted by Niko Matsakis in 2015. For my part, I first came across the idea in a 2018 post from Matklad, which also points out that the std::ops::Index
trait can provide a uniform interface for this pattern.
But most discussion seems to stop here, concluding that newtyped indices are a nice way to get around some ownership constraints with careful API design. I'd like to extend this a little further, with the idea that we can reason about these types as proofs about their referents.
Setting Up
Briefly, let's put together a workable example. This code is fairly similar to a compiler project I've been working on, but with some irrelevant bits cut out. At this point in the compiler we've assembled all our ASTs into a single structure called the environment.
// i'll omit the usual derive macros, you know what they are
struct FunId(usize);
struct TypeId(usize);
struct Env {
fun_decls: HashMap<FunId, FunDecl>,
type_decls: HashMap<TypeId, TypeDecl>,
// ... some other fields
}
I'll leave FunDecl
as an opaque type for now, but we'll need to know some details about TypeDecl
for later sections.
enum TypeDecl {
Alias(TypeAliasDecl),
Extern(ExternTypeDecl),
Adt(AdtDecl),
}
// we also care a little about AdtDecl
struct AdtDecl {
name: Symbol // an interned string
constrs: Box<[ConstrDecl]>,
}
In all, we have two distinct kinds of declarations, and there are three subkinds of type declaration. One particular type declaration subkind has two known fields, but we have no idea about the other subkinds.
Semantics
As was pointed out to me by @without.boats on Bluesky, everything that follows here relies on the implicit assumption that there can only be at most one
Env
at any one time. If you have more than oneEnv
, you can use an index for one in the other, potentially crashing as a result. Using something like a GhostCell might be a solution, but it's arguably not worth the effort here.
If you're familiar with the Pointers Are Complicated posts by Ralf Jung, you might be familiar with the idea that pointers are more than just integers; they have an implicit zero-sized field whose type conveys their properties. The std::ptr
docs are also a good source for this idea, where that invisible field is characterised as the provenance of the pointer.
References are distinguished from pointers in that they can never be null—they always point to a proper instance of the referent type. That is, if you have a &T
, you are guaranteed that a corresponding T
value exists somewhere; in other words, it's a proof! The existence of a &T
is evidence of the fact that a T
value exists.
With that context in mind, what does it mean to have a FunId
value? I think it should be obvious that it isn't just a usize
, since there's no reason to create a FunId
value unless you intend for it to represent something about the usize
it contains. I'd argue1 that if you have a FunId
, you also have evidence about the existence of a FunDecl
in the Env
; it confers the same sort of provenance as a &FunDecl
. And just like a reference, a FunId
grants its holder the necessary permission to read data from the referent FunDecl
.
Basic Accessors
I'll write explicit methods here rather than using the Index
trait, since it makes the resulting code a little easier to understand at a glance. Let's start with FunId
and TypeId
, for which we have obvious access patterns.
impl Env {
fn get_fun(&self, id: FunId) -> &FunDecl {
self.fun_decls.get(&id).unwrap()
}
fn get_type(&self, id: TypeId) -> &TypeDecl {
self.type_decls.get(&id).unwrap()
}
}
The fact that we're calling Option::unwrap
here is incidental; if these accessor methods ever panic then something has already gone wrong in the creation of a TypeId
or FunId
. 2
From the perspective of proofs, these methods are theorems; for example Env::get_fun
says something to the effect of "given an Env
and a FunId
, there exists a FunDecl
in the Env
."
Composite Indices
Let's say we have some function that needs to deal with both kinds of declaration; we might handle this with a pair of types like the following.
enum DeclId {
Fun(FunId),
Type(TypeId),
}
enum DeclRef<'a> {
Fun(&'a FunDecl)
Type(&'a TypeDecl)
}
The corresponding access pattern is so obvious it's almost not worth mentioning (but here it is anyway).
impl Env {
fn get_decl(&self, id: DeclId) -> DeclRef<'_> {
match id {
DeclId::Fun(id) => DeclRef::Fun(self.get_fun(&id)),
DeclId::Type(id) => DeclRef::Type(self.get_type(&id)),
}
}
}
If you're familiar with some type theory, you might be aware that coproducts correspond to logical disjunction (the logical OR operation); consider that a DeclId
is either a FunId
or a TypeId
.
A DeclId
is therefore the disjunction of two existence proofs; it can be constructed with a proof that a FunDecl
exists (the Fun
variant) or a proof that a TypeDecl
exists (the Type
variant). This gives us a reasonable interpretation of get_decl
as a theorem: given an Env
and a DeclId
, either there exists a FunDecl
in the Env
or there exists a TypeDecl
in the Env
.
Is there an equivalent way to represent logical conjunction (the logical AND operator)? Yes! It's the product of TypeId
and FunId
, which we would write as (TypeId, FunId)
. 3
impl Env {
fn get_decl_pair(&self, id: (TypeId, FunId)) -> (&TypeDecl, &FunDecl) {
let type_decl = self.get_type(id.0);
let fun_decl = self.get_fun(id.1);
(type_decl, fun_decl)
}
}
Abstracted from this example, the broader point is that you can combine newtyped indices to get composite index types with reasonable meanings, and moreover that these composites inherit the semantics of their subcomponents.
An Involved Example
Let's imagine now we want to define a ConstrId
type to refer to some ConstrDecl
somewhere in the environment. We need to know its parent TypeDecl
, as well as the index it appears at in the constrs
field, so let's put those together in a struct.
struct ConstrId {
type_id: TypeId,
index: usize,
}
Maybe you can already see the problem? Let's try to write its accessor method.
impl Env {
fn get_constr(&self, id: ConstrId) -> &ConstrDecl {
let type_decl = self.get_type(id.type_id);
match type_decl {
TypeDecl::Alias(type_alias_decl) => todo!(),
TypeDecl::Extern(extern_type_decl) => todo!(),
TypeDecl::Adt(adt_decl) => todo!(),
}
}
}
And now we're stuck. We can't fill in the Alias
and Extern
branches, because we have no idea what they contain. The problem is that TypeId
guarantees the existence of a TypeDecl
, but says nothing about which variant it might be.
What we'd really like is a proof that a TypeDecl
exists, and that it's an AdtDecl
. But we can totally do that! All we have to do is create a new type, and declare that it has those semantics.
/// a TypeId that we guarantee refers to an AdtDecl
struct AdtId(TypeId);
impl Env {
fn get_adt(&self, id: AdtId) -> &AdtDecl {
let type_decl = self.get_type(id.0);
match type_decl {
TypeDecl::Adt(adt_decl) => adt_decl,
_ => unreachable!(),
}
}
}
At the risk of tripping over terminology, we can view AdtId
like an implicit dependent proof: it is a value (the TypeId
), and a proof about that value (the fact that the corresponding TypeDecl
is definitely an AdtDecl
). 4 We can use this to give a more precise definition of ConstrId
, and properly implement its accessor method.
struct ConstrId {
type_id: AdtId,
index: usize,
}
impl Env {
fn get_constr(&self, id: ConstrId) -> &ConstrDecl {
let adt_decl = self.get_adt(id.type_id);
adt_decl.constrs.get(id.index).unwrap()
}
}
Summary
Working on a compiler in Rust means defining dozens of these types, and then spending a great deal of time thinking about their precise semantics as you work with them. Treating them as proofs might seem odd if you haven't encountered the idea before, but the analogy with references is a useful one: newtyped indices are existence proofs in their simplest forms, and you can construct more complex proofs by building larger types out of them.
Newtyped indices are also access patterns; each one corresponds uniquely to a single accessor method. The fact that this correspondence is unique is why we can avoid names altogether with the std::ops::Index
trait, relying on the type itself to describe how it should be used.
If you're interested in looking at a larger implementation of this technique, the salsa
crate is probably the best example; its documentation book has a good overview chapter that covers the basics of its design.
And in fact this argument is the basic premise for every following point in the post, so it's a pretty important idea to buy into!↩
Should you replace
Option::unwrap
withOption::unwrap_unchecked
here? It's not necessarily an incorrect reading of the semantics, but I'd be inclined to leave them checked unless it was actually a performance issue.↩The ordering of this pair doesn't matter; we could just as well have written this as
(FunId, TypeId)
. As an analogy, think of how the AND operator is commutative: that is,x AND y
is the same asy AND x
.↩I'm fairly sure this is a coherent description? If you were going to do something similar in Lean,
AdtId
would have an additional field describing this proof, and it would depend on the value of the first field.↩