Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 24 additions & 54 deletions docs/architecture.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
# Architecture

Rubydex indexes Ruby codebases in two distinct stages: **Discovery** and **Resolution**. Understanding this separation is crucial for working with the codebase.
Rubydex uses a graph to represent Ruby code, which gets populated and analyzed in multiple phases.

## Core Concepts: Definition vs Declaration
## Core Concepts

A **Definition** represents a single source-level construct found at a specific location in the code. It captures exactly what the parser sees without making assumptions about runtime behavior.
### Definition vs Declaration

A **Definition** represents a single source-level construct found at a specific location in the code. It captures key
information from the AST without making major transformations or assumptions about runtime behavior. Definitions are
captured during extraction.
Comment on lines +9 to +11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the last, are we talking about indexing?

Suggested change
A **Definition** represents a single source-level construct found at a specific location in the code. It captures key
information from the AST without making major transformations or assumptions about runtime behavior. Definitions are
captured during extraction.
A **Definition** represents a single source-level construct found at a specific location in the code. It captures key
information from the AST without making major transformations or assumptions about runtime behavior. Definitions are
captured during indexing.


A **Declaration** represents the global semantic concept of a name, combining all definitions that contribute to the same fully qualified name. Declarations are produced during resolution.

Expand All @@ -21,7 +25,7 @@ class Foo::Bar; end
class Foo::Bar; end
```

**Definitions** (4 total - what the indexer discovers):
**Definitions** (4 total - what extraction discovers):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it looks like we're renaming the second phase.

What if we named the phases after what they creates? 🤔

  1. listing becomes documents, or maybe uris or paths
  2. indexer becomes definitions
  3. resolution becomes declarations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not a fan. I'd stick with names that try to describe the phase rather than just use its outputs.


1. Module definition for `Foo` in `foo.rb`
2. Class definition for `Bar` (nested inside `Foo`) in `foo.rb`
Expand All @@ -33,60 +37,16 @@ class Foo::Bar; end
1. `Foo` - A module that has a constant `Bar` under its namespace
2. `Foo::Bar` - A class, composed of definitions 2, 3, and 4

## Two-Stage Indexing Pipeline

### Stage 1: Discovery

Discovery walks the AST and extracts definitions from source code. It captures **only what is explicitly written**, making no assumptions about runtime behavior.

**What Discovery does:**

- Creates `Definition` objects for classes, modules, methods, constants, variables
- Records source locations, comments, and lexical ownership (`owner_id`)
- Captures unresolved constant references (e.g., `Foo::Bar` as a `NameId`)
- Records mixins (`include`, `prepend`, `extend`) on their containing class/module

**What Discovery does NOT do:**
### Documents

- Compute fully qualified names
- Resolve constant references to declarations
- Determine inheritance hierarchies
- Assign semantic membership

#### Why No Assumptions During Discovery?

Consider this example:

```ruby
module Bar; end

class Foo
class Bar::Baz; end
end
```

Without resolving constant references, it may appear that `Bar::Baz` is created under `Foo`. But it's actually not - `Bar` resolves to the top-level `Bar`, so the class is `Bar::Baz`, not `Foo::Bar::Baz`.

Discovery cannot know this without first resolving `Bar`. This is why fully qualified names and semantic membership are computed during Resolution, not Discovery.

### Stage 2: Resolution

Resolution combines the discovered definitions to build a semantic understanding of the codebase.

**What Resolution does:**

- Compute fully qualified names for all definitions
- Create `Declaration` objects that group definitions by fully qualified name
- Resolve constant references to their target declarations
- Linearize ancestor chains (including resolving mixins)
- Assign semantic membership (which methods/constants belong to which class)
- Create implicit singleton classes from `def self.method` patterns
Documents represent a single resource in the codebase, which might be committed to disk or virtual. Documents are
connected to all concepts extracted from it, like definitions.

## Graph Structure

Rubydex represents the codebase as a graph, where entities are nodes and relationships are edges. The visualization below shows the conceptual structure (implemented as an adjacency list using IDs).

[Open in Excalidraw](https://excalidraw.com/#json=hQiLSD8nJRVxONhuwtSn4,L78TkfeB4YL1HJTf5L0bvw)
[Open in Excalidraw](https://excalidraw.com/#json=utleYxF0AaAgEMLpwp1LE,RrLk4AKECjnuhsVd32saxw)

![Graph visualization](images/graph.png)

Expand All @@ -103,5 +63,15 @@ Connections between nodes use hashed IDs defined in `ids.rs`:

- `DefinitionId`: Hash of URI, byte offset, and name
- `DeclarationId`: Hash of fully qualified name (e.g., `Foo::Bar` or `Foo#my_method`)
- `NameId`: Hash of unqualified name (e.g., `Bar` instead of `Foo::Bar`)
- `UriId`: Hash of file URI
- `NameId`: Hash combining unqualified name, parent scope and nesting
- `UriId`: Hash of document URI
- `StringId`: Hash of an interned string

## Phases of analysis

The code analysis happens in phases, which are documented in their own markdown files. Stages are used just to help
clarify at the goal of the steps.

- Indexing: stage for building the knowledge about the codebase
- Phase 1: [Extraction](extraction.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a phase before were we collect all the files we'll index

- Phase 2: [Resolution](resolution.md)
49 changes: 49 additions & 0 deletions docs/extraction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Extraction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have one long file than a few short ones. It makes it easier to search.


During extraction, source code is parsed into ASTs and key information is recorded to be transformed and used in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually do any "transformation"?

subsequent phases. The captured information is remembered as is, with no assumptions about runtime behavior or
semantics.

The intention is to be able to work backwards to the original code since the goal is to support many different tools.
As an example, considering an `unless` as an `if !` is generally correct for static analysis. However, it would not be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be confusing as we do not even look at method bodies.

What if we used the example of storing attributes as methods?

possible to write a linting rule prohibiting the use of `unless` if we transformed all of them into `if` statements
during this phase.

As a general rule, this phase tries to represent the code we found in documents with high fidelity. It's the
[resolution phase](resolution.md) that performs more meaningful transformations on the data.

**What Extraction does:**

- Creates `Definition` objects for classes, modules, methods, constants, variables
- Records source locations, comments, and lexical ownership (`owner_id`)
- Captures unresolved constant references (e.g., `Foo::Bar` as a `NameId`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also unresolved method references

- Records mixins (`include`, `prepend`, `extend`) on their containing class/module

**What Extraction does NOT do:**

- Compute fully qualified names
- Resolve constant references to declarations
- Determine inheritance hierarchies
- Assign semantic membership

#### Why No Assumptions During Discovery?

Consider this example:

```ruby
# bar.rb
module Bar; end

# baz.rb
class Foo
class Bar::Baz; end
end
```

When extracting information from `baz.rb`, it may seem that the class being created is `Foo::Bar::Baz`. However,
extraction only sees one document at a time and constants in Ruby are resolved globally, taking all of the information
from the entire codebase into account.

We can only discover that the class' true fully qualified name is actually `Bar::Baz` once we extracted information from
all of the files involved. This also affects constant ownership. At first glance, it seems that `Bar` is a member of the
`Foo` class, when in reality it is defined at the top level (and therefore a member of `Object`).
Binary file modified docs/images/graph.png
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Declaration should be declined into Class, Module, etc.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
114 changes: 114 additions & 0 deletions docs/resolution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Resolution

Resolution combines the outputs of the extraction phase to create a global semantic representation of the codebase.
This is the step that tries to understand the resulting structure of everything defined in a project, including all
relationships and connections.

**What Resolution does:**

- Computes fully qualified names for all definitions
- Creates `Declaration` objects that group definitions by fully qualified name
- Resolves constant references to their target declarations
- Linearizes ancestor chains (including resolving mixins)
- Keeps track of descendants
- Assigns semantic membership (which methods/constants belong to which class)
- Creates implicit singleton classes (e.g.: `def self.method` patterns or class instance variables)

## The resolution loop

### The problem of interdependencies

To create global declarations, we need to fully qualify all names extracted from a codebase. However, determining the
fully qualified name depends on resolving constants. Consider the same example as the one in
[extraction](extraction.md).

```ruby
# bar.rb
module Bar; end

# baz.rb
class Foo
class Bar::Baz
def qux; end
end
end
```

The fully qualified name of the class defined in `baz.rb` is `Bar::Baz`. Therefore, the fully qualified name of the
method is `Bar::Baz#qux`. We can only determine that correctly if we already resolved the `Bar` constant reference
involved in the class' name.

To further increase the complexity, constants have interdependencies. To resolve a constant reference, we need to:

- Search the surrounding lexical scopes
- Search the ancestor chain of the lexical scope where the reference was found
- Fall back to the top level

Considering that other constants are involved in the lexical scopes and ancestor chains, you get even more cross
dependencies. We can even have dependencies within the same ancestor chain. Consider this other example:

```ruby
module Foo
module Bar
end
end

class Baz
include Foo
include Bar
end
```

When we include `Foo`, it makes `Bar` available through inheritance, which then allows us to also include it. This
means that in order to fully resolve this example we need to:

- Create global declarations for `Foo`, `Bar` and `Baz`
- Correctly assign membership that `Bar` is owned by `Foo`
- Partially linearized the ancestor chain of `Baz`, leaving a todo for `Bar` since we can't yet resolve it
- Finally, fully linearize the ancestors now that we processed `Foo` and know `Bar` is available through inheritance

### The loop

Trying to create a tree of dependencies ahead of time to resolve constants is difficult because some dependencies are
only identified when we are in the middle of performing resolution. Instead of taking that approach, the resolution
loop is an optimistically sorted worklist algorithm (inspired by [Sorbet's](https://github.com/sorbet/sorbet) approach).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's point to the exact file for reference


The basic idea is to sort the worklist in an order that's likely to succeed most of the time, minimizing the amount of
times we need to retry. If we fail to resolve something, we re-enqueue to try again. The loop has passes (or epochs)
where we go through the list of work. If we exhaust the worklist or fail to make any progress in a pass, then we
finalize the loop.

```rust
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point a diagram with boxes and arrows may be easier to follow. But let's leave this for later once everything stabilized.

// See the actual implementation in resolution.rs

pub fn resolve_all(graph: &mut Graph) {
// Partition and sort all of the work ahead of time
let (mut unit_queue, other_ids) = sorted_units(graph);

// Outer loop that controls the passes
loop {
// Keep track if we made progress this pass
let mut made_progress = false;

// Resolution pass. We go through the full length of the queue at this time, which automatically excludes
// retries that we find during this pass
for _ in 0..unit_queue.len() {
let Some(unit_id) = unit_queue.pop_front() else {
break;
};

// Perform different work dependending on what item we found
match unit_id {
Unit::Definition(id) => { /* handle constant definitions */ }
Unit::Reference(id) => { /* handle constant references */ }
Unit::Ancestors(id) => { /* handle ancestor linearization retries */ }
}

// If we're no longer able to advance the analysis or if we finished all of the work, break out of the loop
if !made_progress || unit_queue.is_empty() {
break;
}
}
}
}
```
Loading