diff --git a/docs/architecture.md b/docs/architecture.md index d6ed0eaa..4cc29c6e 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -1,10 +1,14 @@ # Architecture -Rubydex indexes Ruby codebases in two distinct stages: **Discovery** and **Resolution**. Understanding this separation is crucial for working with the codebase. +Rubydex uses a graph to represent Ruby code, which gets populated and analyzed in multiple phases. -## Core Concepts: Definition vs Declaration +## Core Concepts -A **Definition** represents a single source-level construct found at a specific location in the code. It captures exactly what the parser sees without making assumptions about runtime behavior. +### Definition vs Declaration + +A **Definition** represents a single source-level construct found at a specific location in the code. It captures key +information from the AST without making major transformations or assumptions about runtime behavior. Definitions are +captured during extraction. A **Declaration** represents the global semantic concept of a name, combining all definitions that contribute to the same fully qualified name. Declarations are produced during resolution. @@ -21,7 +25,7 @@ class Foo::Bar; end class Foo::Bar; end ``` -**Definitions** (4 total - what the indexer discovers): +**Definitions** (4 total - what extraction discovers): 1. Module definition for `Foo` in `foo.rb` 2. Class definition for `Bar` (nested inside `Foo`) in `foo.rb` @@ -33,60 +37,16 @@ class Foo::Bar; end 1. `Foo` - A module that has a constant `Bar` under its namespace 2. `Foo::Bar` - A class, composed of definitions 2, 3, and 4 -## Two-Stage Indexing Pipeline - -### Stage 1: Discovery - -Discovery walks the AST and extracts definitions from source code. It captures **only what is explicitly written**, making no assumptions about runtime behavior. - -**What Discovery does:** - -- Creates `Definition` objects for classes, modules, methods, constants, variables -- Records source locations, comments, and lexical ownership (`owner_id`) -- Captures unresolved constant references (e.g., `Foo::Bar` as a `NameId`) -- Records mixins (`include`, `prepend`, `extend`) on their containing class/module - -**What Discovery does NOT do:** +### Documents -- Compute fully qualified names -- Resolve constant references to declarations -- Determine inheritance hierarchies -- Assign semantic membership - -#### Why No Assumptions During Discovery? - -Consider this example: - -```ruby -module Bar; end - -class Foo - class Bar::Baz; end -end -``` - -Without resolving constant references, it may appear that `Bar::Baz` is created under `Foo`. But it's actually not - `Bar` resolves to the top-level `Bar`, so the class is `Bar::Baz`, not `Foo::Bar::Baz`. - -Discovery cannot know this without first resolving `Bar`. This is why fully qualified names and semantic membership are computed during Resolution, not Discovery. - -### Stage 2: Resolution - -Resolution combines the discovered definitions to build a semantic understanding of the codebase. - -**What Resolution does:** - -- Compute fully qualified names for all definitions -- Create `Declaration` objects that group definitions by fully qualified name -- Resolve constant references to their target declarations -- Linearize ancestor chains (including resolving mixins) -- Assign semantic membership (which methods/constants belong to which class) -- Create implicit singleton classes from `def self.method` patterns +Documents represent a single resource in the codebase, which might be committed to disk or virtual. Documents are +connected to all concepts extracted from it, like definitions. ## Graph Structure Rubydex represents the codebase as a graph, where entities are nodes and relationships are edges. The visualization below shows the conceptual structure (implemented as an adjacency list using IDs). -[Open in Excalidraw](https://excalidraw.com/#json=hQiLSD8nJRVxONhuwtSn4,L78TkfeB4YL1HJTf5L0bvw) +[Open in Excalidraw](https://excalidraw.com/#json=utleYxF0AaAgEMLpwp1LE,RrLk4AKECjnuhsVd32saxw) ![Graph visualization](images/graph.png) @@ -103,5 +63,15 @@ Connections between nodes use hashed IDs defined in `ids.rs`: - `DefinitionId`: Hash of URI, byte offset, and name - `DeclarationId`: Hash of fully qualified name (e.g., `Foo::Bar` or `Foo#my_method`) -- `NameId`: Hash of unqualified name (e.g., `Bar` instead of `Foo::Bar`) -- `UriId`: Hash of file URI +- `NameId`: Hash combining unqualified name, parent scope and nesting +- `UriId`: Hash of document URI +- `StringId`: Hash of an interned string + +## Phases of analysis + +The code analysis happens in phases, which are documented in their own markdown files. Stages are used just to help +clarify at the goal of the steps. + +- Indexing: stage for building the knowledge about the codebase + - Phase 1: [Extraction](extraction.md) + - Phase 2: [Resolution](resolution.md) diff --git a/docs/extraction.md b/docs/extraction.md new file mode 100644 index 00000000..381f522f --- /dev/null +++ b/docs/extraction.md @@ -0,0 +1,49 @@ +# Extraction + +During extraction, source code is parsed into ASTs and key information is recorded to be transformed and used in +subsequent phases. The captured information is remembered as is, with no assumptions about runtime behavior or +semantics. + +The intention is to be able to work backwards to the original code since the goal is to support many different tools. +As an example, considering an `unless` as an `if !` is generally correct for static analysis. However, it would not be +possible to write a linting rule prohibiting the use of `unless` if we transformed all of them into `if` statements +during this phase. + +As a general rule, this phase tries to represent the code we found in documents with high fidelity. It's the +[resolution phase](resolution.md) that performs more meaningful transformations on the data. + +**What Extraction does:** + +- Creates `Definition` objects for classes, modules, methods, constants, variables +- Records source locations, comments, and lexical ownership (`owner_id`) +- Captures unresolved constant references (e.g., `Foo::Bar` as a `NameId`) +- Records mixins (`include`, `prepend`, `extend`) on their containing class/module + +**What Extraction does NOT do:** + +- Compute fully qualified names +- Resolve constant references to declarations +- Determine inheritance hierarchies +- Assign semantic membership + +#### Why No Assumptions During Discovery? + +Consider this example: + +```ruby +# bar.rb +module Bar; end + +# baz.rb +class Foo + class Bar::Baz; end +end +``` + +When extracting information from `baz.rb`, it may seem that the class being created is `Foo::Bar::Baz`. However, +extraction only sees one document at a time and constants in Ruby are resolved globally, taking all of the information +from the entire codebase into account. + +We can only discover that the class' true fully qualified name is actually `Bar::Baz` once we extracted information from +all of the files involved. This also affects constant ownership. At first glance, it seems that `Bar` is a member of the +`Foo` class, when in reality it is defined at the top level (and therefore a member of `Object`). diff --git a/docs/images/graph.png b/docs/images/graph.png index 784f8bf0..65b6933f 100644 Binary files a/docs/images/graph.png and b/docs/images/graph.png differ diff --git a/docs/resolution.md b/docs/resolution.md new file mode 100644 index 00000000..b39706d3 --- /dev/null +++ b/docs/resolution.md @@ -0,0 +1,114 @@ +# Resolution + +Resolution combines the outputs of the extraction phase to create a global semantic representation of the codebase. +This is the step that tries to understand the resulting structure of everything defined in a project, including all +relationships and connections. + +**What Resolution does:** + +- Computes fully qualified names for all definitions +- Creates `Declaration` objects that group definitions by fully qualified name +- Resolves constant references to their target declarations +- Linearizes ancestor chains (including resolving mixins) +- Keeps track of descendants +- Assigns semantic membership (which methods/constants belong to which class) +- Creates implicit singleton classes (e.g.: `def self.method` patterns or class instance variables) + +## The resolution loop + +### The problem of interdependencies + +To create global declarations, we need to fully qualify all names extracted from a codebase. However, determining the +fully qualified name depends on resolving constants. Consider the same example as the one in +[extraction](extraction.md). + +```ruby +# bar.rb +module Bar; end + +# baz.rb +class Foo + class Bar::Baz + def qux; end + end +end +``` + +The fully qualified name of the class defined in `baz.rb` is `Bar::Baz`. Therefore, the fully qualified name of the +method is `Bar::Baz#qux`. We can only determine that correctly if we already resolved the `Bar` constant reference +involved in the class' name. + +To further increase the complexity, constants have interdependencies. To resolve a constant reference, we need to: + +- Search the surrounding lexical scopes +- Search the ancestor chain of the lexical scope where the reference was found +- Fall back to the top level + +Considering that other constants are involved in the lexical scopes and ancestor chains, you get even more cross +dependencies. We can even have dependencies within the same ancestor chain. Consider this other example: + +```ruby +module Foo + module Bar + end +end + +class Baz + include Foo + include Bar +end +``` + +When we include `Foo`, it makes `Bar` available through inheritance, which then allows us to also include it. This +means that in order to fully resolve this example we need to: + +- Create global declarations for `Foo`, `Bar` and `Baz` +- Correctly assign membership that `Bar` is owned by `Foo` +- Partially linearized the ancestor chain of `Baz`, leaving a todo for `Bar` since we can't yet resolve it +- Finally, fully linearize the ancestors now that we processed `Foo` and know `Bar` is available through inheritance + +### The loop + +Trying to create a tree of dependencies ahead of time to resolve constants is difficult because some dependencies are +only identified when we are in the middle of performing resolution. Instead of taking that approach, the resolution +loop is an optimistically sorted worklist algorithm (inspired by [Sorbet's](https://github.com/sorbet/sorbet) approach). + +The basic idea is to sort the worklist in an order that's likely to succeed most of the time, minimizing the amount of +times we need to retry. If we fail to resolve something, we re-enqueue to try again. The loop has passes (or epochs) +where we go through the list of work. If we exhaust the worklist or fail to make any progress in a pass, then we +finalize the loop. + +```rust +// See the actual implementation in resolution.rs + +pub fn resolve_all(graph: &mut Graph) { + // Partition and sort all of the work ahead of time + let (mut unit_queue, other_ids) = sorted_units(graph); + + // Outer loop that controls the passes + loop { + // Keep track if we made progress this pass + let mut made_progress = false; + + // Resolution pass. We go through the full length of the queue at this time, which automatically excludes + // retries that we find during this pass + for _ in 0..unit_queue.len() { + let Some(unit_id) = unit_queue.pop_front() else { + break; + }; + + // Perform different work dependending on what item we found + match unit_id { + Unit::Definition(id) => { /* handle constant definitions */ } + Unit::Reference(id) => { /* handle constant references */ } + Unit::Ancestors(id) => { /* handle ancestor linearization retries */ } + } + + // If we're no longer able to advance the analysis or if we finished all of the work, break out of the loop + if !made_progress || unit_queue.is_empty() { + break; + } + } + } +} +```