Shopify · vinistock · Jan 8, 2026 · Morriar · Jan 9, 2026 · Morriar
@@ -1,10 +1,14 @@
 # Architecture
 
-Rubydex indexes Ruby codebases in two distinct stages: **Discovery** and **Resolution**. Understanding this separation is crucial for working with the codebase.
+Rubydex uses a graph to represent Ruby code, which gets populated and analyzed in multiple phases.
 
-## Core Concepts: Definition vs Declaration
+## Core Concepts
 
-A **Definition** represents a single source-level construct found at a specific location in the code. It captures exactly what the parser sees without making assumptions about runtime behavior.
+### Definition vs Declaration
+
+A **Definition** represents a single source-level construct found at a specific location in the code. It captures key
+information from the AST without making major transformations or assumptions about runtime behavior. Definitions are
+captured during extraction.
-A **Definition** represents a single source-level construct found at a specific location in the code. It captures key
-information from the AST without making major transformations or assumptions about runtime behavior. Definitions are
-captured during extraction.
+A **Definition** represents a single source-level construct found at a specific location in the code. It captures key
+information from the AST without making major transformations or assumptions about runtime behavior. Definitions are
+captured during indexing.
-A **Definition** represents a single source-level construct found at a specific location in the code. It captures key
-information from the AST without making major transformations or assumptions about runtime behavior. Definitions are
-captured during extraction.
+A **Definition** represents a single source-level construct found at a specific location in the code. It captures key
+information from the AST without making major transformations or assumptions about runtime behavior. Definitions are
+captured during indexing.
 
 A **Declaration** represents the global semantic concept of a name, combining all definitions that contribute to the same fully qualified name. Declarations are produced during resolution.
 
@@ -21,7 +25,7 @@ class Foo::Bar; end
 class Foo::Bar; end
 ```
 
-**Definitions** (4 total - what the indexer discovers):
+**Definitions** (4 total - what extraction discovers):
 
 1. Module definition for `Foo` in `foo.rb`
 2. Class definition for `Bar` (nested inside `Foo`) in `foo.rb`
@@ -33,60 +37,16 @@ class Foo::Bar; end
 1. `Foo` - A module that has a constant `Bar` under its namespace
 2. `Foo::Bar` - A class, composed of definitions 2, 3, and 4
 
-## Two-Stage Indexing Pipeline
-
-### Stage 1: Discovery
-
-Discovery walks the AST and extracts definitions from source code. It captures **only what is explicitly written**, making no assumptions about runtime behavior.
-
-**What Discovery does:**
-
-- Creates `Definition` objects for classes, modules, methods, constants, variables
-- Records source locations, comments, and lexical ownership (`owner_id`)
-- Captures unresolved constant references (e.g., `Foo::Bar` as a `NameId`)
-- Records mixins (`include`, `prepend`, `extend`) on their containing class/module
-
-**What Discovery does NOT do:**
+### Documents
 
-- Compute fully qualified names
-- Resolve constant references to declarations
-- Determine inheritance hierarchies
-- Assign semantic membership
-
-#### Why No Assumptions During Discovery?
-
-Consider this example:
-
-```ruby
-module Bar; end
-
-class Foo
-  class Bar::Baz; end
-end
-```
-
-Without resolving constant references, it may appear that `Bar::Baz` is created under `Foo`. But it's actually not - `Bar` resolves to the top-level `Bar`, so the class is `Bar::Baz`, not `Foo::Bar::Baz`.
-
-Discovery cannot know this without first resolving `Bar`. This is why fully qualified names and semantic membership are computed during Resolution, not Discovery.
-
-### Stage 2: Resolution
-
-Resolution combines the discovered definitions to build a semantic understanding of the codebase.
-
-**What Resolution does:**
-
-- Compute fully qualified names for all definitions
-- Create `Declaration` objects that group definitions by fully qualified name
-- Resolve constant references to their target declarations
-- Linearize ancestor chains (including resolving mixins)
-- Assign semantic membership (which methods/constants belong to which class)
-- Create implicit singleton classes from `def self.method` patterns
+Documents represent a single resource in the codebase, which might be committed to disk or virtual. Documents are
+connected to all concepts extracted from it, like definitions.
 
 ## Graph Structure
 
 Rubydex represents the codebase as a graph, where entities are nodes and relationships are edges. The visualization below shows the conceptual structure (implemented as an adjacency list using IDs).
 
-[Open in Excalidraw](https://excalidraw.com/#json=hQiLSD8nJRVxONhuwtSn4,L78TkfeB4YL1HJTf5L0bvw)
+[Open in Excalidraw](https://excalidraw.com/#json=utleYxF0AaAgEMLpwp1LE,RrLk4AKECjnuhsVd32saxw)
 
 ![Graph visualization](images/graph.png)
 
@@ -103,5 +63,15 @@ Connections between nodes use hashed IDs defined in `ids.rs`:
 
 - `DefinitionId`: Hash of URI, byte offset, and name
 - `DeclarationId`: Hash of fully qualified name (e.g., `Foo::Bar` or `Foo#my_method`)
-- `NameId`: Hash of unqualified name (e.g., `Bar` instead of `Foo::Bar`)
-- `UriId`: Hash of file URI
+- `NameId`: Hash combining unqualified name, parent scope and nesting
+- `UriId`: Hash of document URI
+- `StringId`: Hash of an interned string
+
+## Phases of analysis
+
+The code analysis happens in phases, which are documented in their own markdown files. Stages are used just to help
+clarify at the goal of the steps.
+
+- Indexing: stage for building the knowledge about the codebase
+    - Phase 1: [Extraction](extraction.md)
+    - Phase 2: [Resolution](resolution.md)
@@ -0,0 +1,49 @@
+# Extraction
+
+During extraction, source code is parsed into ASTs and key information is recorded to be transformed and used in
+subsequent phases. The captured information is remembered as is, with no assumptions about runtime behavior or
+semantics.
+
+The intention is to be able to work backwards to the original code since the goal is to support many different tools.
+As an example, considering an `unless` as an `if !` is generally correct for static analysis. However, it would not be
+possible to write a linting rule prohibiting the use of `unless` if we transformed all of them into `if` statements
+during this phase.
+
+As a general rule, this phase tries to represent the code we found in documents with high fidelity. It's the
+[resolution phase](resolution.md) that performs more meaningful transformations on the data.
+
+**What Extraction does:**
+
+- Creates `Definition` objects for classes, modules, methods, constants, variables
+- Records source locations, comments, and lexical ownership (`owner_id`)
+- Captures unresolved constant references (e.g., `Foo::Bar` as a `NameId`)
+- Records mixins (`include`, `prepend`, `extend`) on their containing class/module
+
+**What Extraction does NOT do:**
+
+- Compute fully qualified names
+- Resolve constant references to declarations
+- Determine inheritance hierarchies
+- Assign semantic membership
+
+#### Why No Assumptions During Discovery?
+
+Consider this example:
+
+```ruby
+# bar.rb
+module Bar; end
+
+# baz.rb
+class Foo
+  class Bar::Baz; end
+end
+```
+
+When extracting information from `baz.rb`, it may seem that the class being created is `Foo::Bar::Baz`. However,
+extraction only sees one document at a time and constants in Ruby are resolved globally, taking all of the information
+from the entire codebase into account.
+
+We can only discover that the class' true fully qualified name is actually `Bar::Baz` once we extracted information from
+all of the files involved. This also affects constant ownership. At first glance, it seems that `Bar` is a member of the
+`Foo` class, when in reality it is defined at the top level (and therefore a member of `Object`).
@@ -0,0 +1,114 @@
+# Resolution
+
+Resolution combines the outputs of the extraction phase to create a global semantic representation of the codebase.
+This is the step that tries to understand the resulting structure of everything defined in a project, including all
+relationships and connections.
+
+**What Resolution does:**
+
+- Computes fully qualified names for all definitions
+- Creates `Declaration` objects that group definitions by fully qualified name
+- Resolves constant references to their target declarations
+- Linearizes ancestor chains (including resolving mixins)
+- Keeps track of descendants
+- Assigns semantic membership (which methods/constants belong to which class)
+- Creates implicit singleton classes (e.g.: `def self.method` patterns or class instance variables)
+
+## The resolution loop
+
+### The problem of interdependencies
+
+To create global declarations, we need to fully qualify all names extracted from a codebase. However, determining the
+fully qualified name depends on resolving constants. Consider the same example as the one in
+[extraction](extraction.md).
+
+```ruby
+# bar.rb
+module Bar; end
+
+# baz.rb
+class Foo
+  class Bar::Baz
+    def qux; end
+  end
+end
+```
+
+The fully qualified name of the class defined in `baz.rb` is `Bar::Baz`. Therefore, the fully qualified name of the
+method is `Bar::Baz#qux`. We can only determine that correctly if we already resolved the `Bar` constant reference
+involved in the class' name.
+
+To further increase the complexity, constants have interdependencies. To resolve a constant reference, we need to:
+
+- Search the surrounding lexical scopes
+- Search the ancestor chain of the lexical scope where the reference was found
+- Fall back to the top level
+
+Considering that other constants are involved in the lexical scopes and ancestor chains, you get even more cross
+dependencies. We can even have dependencies within the same ancestor chain. Consider this other example:
+
+```ruby
+module Foo
+  module Bar
+  end
+end
+
+class Baz
+  include Foo
+  include Bar
+end
+```
+
+When we include `Foo`, it makes `Bar` available through inheritance, which then allows us to also include it. This
+means that in order to fully resolve this example we need to:
+
+- Create global declarations for `Foo`, `Bar` and `Baz`
+- Correctly assign membership that `Bar` is owned by `Foo`
+- Partially linearized the ancestor chain of `Baz`, leaving a todo for `Bar` since we can't yet resolve it
+- Finally, fully linearize the ancestors now that we processed `Foo` and know `Bar` is available through inheritance
+
+### The loop
+
+Trying to create a tree of dependencies ahead of time to resolve constants is difficult because some dependencies are
+only identified when we are in the middle of performing resolution. Instead of taking that approach, the resolution
+loop is an optimistically sorted worklist algorithm (inspired by [Sorbet's](https://github.com/sorbet/sorbet) approach).
+
+The basic idea is to sort the worklist in an order that's likely to succeed most of the time, minimizing the amount of
+times we need to retry. If we fail to resolve something, we re-enqueue to try again. The loop has passes (or epochs)
+where we go through the list of work. If we exhaust the worklist or fail to make any progress in a pass, then we
+finalize the loop.
+
+```rust
+// See the actual implementation in resolution.rs
+
+pub fn resolve_all(graph: &mut Graph) {
+    // Partition and sort all of the work ahead of time
+    let (mut unit_queue, other_ids) = sorted_units(graph);
+
+    // Outer loop that controls the passes
+    loop {
+        // Keep track if we made progress this pass
+        let mut made_progress = false;
+
+        // Resolution pass. We go through the full length of the queue at this time, which automatically excludes
+        // retries that we find during this pass
+        for _ in 0..unit_queue.len() {
+            let Some(unit_id) = unit_queue.pop_front() else {
+                break;
+            };
+
+            // Perform different work dependending on what item we found
+            match unit_id {
+                Unit::Definition(id) => { /* handle constant definitions */ }
+                Unit::Reference(id) => { /* handle constant references */ }
+                Unit::Ancestors(id) => { /* handle ancestor linearization retries */ }
+            }
+
+            // If we're no longer able to advance the analysis or if we finished all of the work, break out of the loop
+            if !made_progress || unit_queue.is_empty() {
+                break;
+            }
+        }
+    }
+}
+```