eikopf

Newtyped Indices are Proofs

It's pretty well-known at this point that the solution for complex webs of ownership in Rust is just to use newtyped indices. Graphs are the prototypical example; the earliest post on this subject I'm aware of was posted by Niko Matsakis in 2015. For my part, I first came across the idea in a 2018 post from Matklad, which also points out that the std::ops::Index trait can provide a uniform interface for this pattern.

But most discussion seems to stop here, concluding that newtyped indices are a nice way to get around some ownership constraints with careful API design. I'd like to extend this a little further, with the idea that we can reason about these types as proofs about their referents.

Setting Up

Briefly, let's put together a workable example. This code is fairly similar to a compiler project I've been working on, but with some irrelevant bits cut out. At this point in the compiler we've assembled all our ASTs into a single structure called the environment.

// i'll omit the usual derive macros, you know what they are
struct FunId(usize);
struct TypeId(usize);

struct Env {
  fun_decls: HashMap<FunId, FunDecl>,
  type_decls: HashMap<TypeId, TypeDecl>,
  // ... some other fields
}

I'll leave FunDecl as an opaque type for now, but we'll need to know some details about TypeDecl for later sections.

enum TypeDecl {
  Alias(TypeAliasDecl),
  Extern(ExternTypeDecl),
  Adt(AdtDecl),
}

// we also care a little about AdtDecl

struct AdtDecl {
  name: Symbol // an interned string
  constrs: Box<[ConstrDecl]>,
}

In all, we have two distinct kinds of declarations, and there are three subkinds of type declaration. One particular type declaration subkind has two known fields, but we have no idea about the other subkinds.

Semantics

As was pointed out to me by @without.boats on Bluesky, everything that follows here relies on the implicit assumption that there can only be at most one Env at any one time. If you have more than one Env, you can use an index for one in the other, potentially crashing as a result. Using something like a GhostCell might be a solution, but it's arguably not worth the effort here.

If you're familiar with the Pointers Are Complicated posts by Ralf Jung, you might be familiar with the idea that pointers are more than just integers; they have an implicit zero-sized field whose type conveys their properties. The std::ptr docs are also a good source for this idea, where that invisible field is characterised as the provenance of the pointer.

References are distinguished from pointers in that they can never be null—they always point to a proper instance of the referent type. That is, if you have a &T, you are guaranteed that a corresponding T value exists somewhere; in other words, it's a proof! The existence of a &T is evidence of the fact that a T value exists.

With that context in mind, what does it mean to have a FunId value? I think it should be obvious that it isn't just a usize, since there's no reason to create a FunId value unless you intend for it to represent something about the usize it contains. I'd argue1 that if you have a FunId, you also have evidence about the existence of a FunDecl in the Env; it confers the same sort of provenance as a &FunDecl. And just like a reference, a FunId grants its holder the necessary permission to read data from the referent FunDecl.

Basic Accessors

I'll write explicit methods here rather than using the Index trait, since it makes the resulting code a little easier to understand at a glance. Let's start with FunId and TypeId, for which we have obvious access patterns.

impl Env {
  fn get_fun(&self, id: FunId) -> &FunDecl {
    self.fun_decls.get(&id).unwrap()
  }

  fn get_type(&self, id: TypeId) -> &TypeDecl {
    self.type_decls.get(&id).unwrap()
  }
}

The fact that we're calling Option::unwrap here is incidental; if these accessor methods ever panic then something has already gone wrong in the creation of a TypeId or FunId. 2

From the perspective of proofs, these methods are theorems; for example Env::get_fun says something to the effect of "given an Env and a FunId, there exists a FunDecl in the Env."

Composite Indices

Let's say we have some function that needs to deal with both kinds of declaration; we might handle this with a pair of types like the following.

enum DeclId {
  Fun(FunId),
  Type(TypeId),
}

enum DeclRef<'a> {
  Fun(&'a FunDecl)
  Type(&'a TypeDecl)
}

The corresponding access pattern is so obvious it's almost not worth mentioning (but here it is anyway).

impl Env {
  fn get_decl(&self, id: DeclId) -> DeclRef<'_> {
    match id {
      DeclId::Fun(id) => DeclRef::Fun(self.get_fun(&id)),
      DeclId::Type(id) => DeclRef::Type(self.get_type(&id)),
    }
  }
}

If you're familiar with some type theory, you might be aware that coproducts correspond to logical disjunction (the logical OR operation); consider that a DeclId is either a FunId or a TypeId.

A DeclId is therefore the disjunction of two existence proofs; it can be constructed with a proof that a FunDecl exists (the Fun variant) or a proof that a TypeDecl exists (the Type variant). This gives us a reasonable interpretation of get_decl as a theorem: given an Env and a DeclId, either there exists a FunDecl in the Env or there exists a TypeDecl in the Env.

Is there an equivalent way to represent logical conjunction (the logical AND operator)? Yes! It's the product of TypeId and FunId, which we would write as (TypeId, FunId). 3

impl Env {
  fn get_decl_pair(&self, id: (TypeId, FunId)) -> (&TypeDecl, &FunDecl) {
    let type_decl = self.get_type(id.0);
    let fun_decl = self.get_fun(id.1);
    (type_decl, fun_decl)
  }
}

Abstracted from this example, the broader point is that you can combine newtyped indices to get composite index types with reasonable meanings, and moreover that these composites inherit the semantics of their subcomponents.

An Involved Example

Let's imagine now we want to define a ConstrId type to refer to some ConstrDecl somewhere in the environment. We need to know its parent TypeDecl, as well as the index it appears at in the constrs field, so let's put those together in a struct.

struct ConstrId {
  type_id: TypeId,
  index: usize,
}

Maybe you can already see the problem? Let's try to write its accessor method.

impl Env {
  fn get_constr(&self, id: ConstrId) -> &ConstrDecl {
    let type_decl = self.get_type(id.type_id);

    match type_decl {
      TypeDecl::Alias(type_alias_decl) => todo!(),
      TypeDecl::Extern(extern_type_decl) => todo!(),
      TypeDecl::Adt(adt_decl) => todo!(),
    }
  }
}

And now we're stuck. We can't fill in the Alias and Extern branches, because we have no idea what they contain. The problem is that TypeId guarantees the existence of a TypeDecl, but says nothing about which variant it might be.

What we'd really like is a proof that a TypeDecl exists, and that it's an AdtDecl. But we can totally do that! All we have to do is create a new type, and declare that it has those semantics.

/// a TypeId that we guarantee refers to an AdtDecl
struct AdtId(TypeId);

impl Env {
  fn get_adt(&self, id: AdtId) -> &AdtDecl {
    let type_decl = self.get_type(id.0);

    match type_decl {
      TypeDecl::Adt(adt_decl) => adt_decl,
      _ => unreachable!(),
    }
  }
}

At the risk of tripping over terminology, we can view AdtId like an implicit dependent proof: it is a value (the TypeId), and a proof about that value (the fact that the corresponding TypeDecl is definitely an AdtDecl). 4 We can use this to give a more precise definition of ConstrId, and properly implement its accessor method.

struct ConstrId {
  type_id: AdtId,
  index: usize,
}

impl Env {
  fn get_constr(&self, id: ConstrId) -> &ConstrDecl {
    let adt_decl = self.get_adt(id.type_id);
    adt_decl.constrs.get(id.index).unwrap()
  }
}

Summary

Working on a compiler in Rust means defining dozens of these types, and then spending a great deal of time thinking about their precise semantics as you work with them. Treating them as proofs might seem odd if you haven't encountered the idea before, but the analogy with references is a useful one: newtyped indices are existence proofs in their simplest forms, and you can construct more complex proofs by building larger types out of them.

Newtyped indices are also access patterns; each one corresponds uniquely to a single accessor method. The fact that this correspondence is unique is why we can avoid names altogether with the std::ops::Index trait, relying on the type itself to describe how it should be used.

If you're interested in looking at a larger implementation of this technique, the salsa crate is probably the best example; its documentation book has a good overview chapter that covers the basics of its design.

  1. And in fact this argument is the basic premise for every following point in the post, so it's a pretty important idea to buy into!

  2. Should you replace Option::unwrap with Option::unwrap_unchecked here? It's not necessarily an incorrect reading of the semantics, but I'd be inclined to leave them checked unless it was actually a performance issue.

  3. The ordering of this pair doesn't matter; we could just as well have written this as (FunId, TypeId). As an analogy, think of how the AND operator is commutative: that is, x AND y is the same as y AND x.

  4. I'm fairly sure this is a coherent description? If you were going to do something similar in Lean, AdtId would have an additional field describing this proof, and it would depend on the value of the first field.

#design #rust #types