Add incremental invalidation/resolution#589
Add incremental invalidation/resolution#589st0012 wants to merge 4 commits intounresolve-primitivesfrom
Conversation
365da40 to
774d476
Compare
vinistock
left a comment
There was a problem hiding this comment.
I'm still trying to reason about the algorithm, so not done reviewing yet.
7c0e491 to
e8cdc32
Compare
vinistock
left a comment
There was a problem hiding this comment.
This PR is highly complex. I think the mechanism works, but it's hard to ship so many changes in one go. Here's what I think we should do to minimize the changes and ship with confidence:
- Move the rename of
resolve_alltoresolveto a separate PR - Create a PR implementing
unresolve_nameandunresolve_referencethat can be shipped separate from the algorithm - Add the
name_dependentshashmap ofIdentityHashMap<NameId, IdentityHashSet<NameDependent>>using aNameDependentenum. Populate this map during indexing when creating names, so that the global graph only has to merge the work at the end
With this foundation, it will be significantly easier for reviewers to focus on the algorithm. What do you think?
| if let Some(name_set) = self.declaration_to_names.get_mut(&declaration_id) { | ||
| name_set.remove(&name_id); | ||
| if name_set.is_empty() { | ||
| self.declaration_to_names.remove(&declaration_id); | ||
| } | ||
| } |
There was a problem hiding this comment.
This is one of the reasons why I'm not a fan of the multiple hashmaps. We need to ensure that the data on the auxiliary maps is also consistent and the benefit is just avoiding tracing the graph from declaration -> definitions -> names. I'm not convinced tbh.
Also, to make sure we're making progress, I think you can probably ship a separate PR with just the unresolve_name and unresolve_reference methods separately.
There was a problem hiding this comment.
I'll address all the feedback first and then see how we can split this PR.
rust/rubydex/src/model/graph.rs
Outdated
| /// Reverse index: for each `NameId`, which definitions and constant references use it. | ||
| /// Eliminates O(D+R) scans during invalidation. | ||
| name_users: IdentityHashMap<NameId, Vec<NameUser>>, | ||
|
|
||
| /// Reverse index: for each `NameId`, which other names depend on it | ||
| /// (via nesting or `parent_scope`). Used for cascade invalidation. | ||
| name_dependents: IdentityHashMap<NameId, IdentityHashSet<NameId>>, |
There was a problem hiding this comment.
Why do we need these two maps? The idea of having an enum for name dependents is so that you can go NameId -> ReferenceId | DefinitionId -> NameId.
There was a problem hiding this comment.
I merged the maps but I think the value should also include NameId. So it'd be NameId -> ReferenceId | DefinitionId | NameId.
I prototyped using reference/definition to look up name but it doesn't work well with parent_scope/nesting cases. For example:
class Baz
include Foo
CONST
endThe reference CONST creates a name with nesting=baz_name, but it doesn't create a member declaration under Baz. So Baz.members() is empty — there's no path from baz_name to the reference's name through the declaration/definitions. When Baz's ancestors change (e.g., a new prepend Bar is added), we need to re-evaluate CONST, but without the explicit Name(NameId) entry under baz_name, we can't discover it.
There was a problem hiding this comment.
there's no path from baz_name to the reference's name through the declaration/definitions
This is the only reason why we need name_dependents. In your example, this is what I would expect the hashmap to look like:
# name_dependents
{
NameId(Baz) => Set[ReferenceId(Foo), ReferenceId(CONST)]
}
This allows the graph to remember which references and definitions will be potentially impacted by a name change. We then trace name_dependents and the rest of the graph to unresolve names.
In this case, if we had to unresolve all names due to a change to Baz, I would expect the algorithm to do something like this:
Bazchanged. Loop through name dependents- Regardless whether the dependent is a reference or definition, get its
name_id, pull the name from the graph and unresolve it - Now, unresolving the definition and reference may invalidate other things. Go back to 1. and invalidate the
name_idfor the reference/definition
This also involves invalidating ancestors, but you get the idea.
|
|
||
| /// Accumulated work items from update/delete operations. | ||
| /// Drained by `take_pending_work()` before resolution. | ||
| pending_work: Vec<Unit>, |
There was a problem hiding this comment.
We still need an answer for the ever growing memory if users don't call resolve.
995854d to
746e90c
Compare
Consolidate two reverse index maps (name_users and name_dependents) into a single name_dependents map using a unified NameDependent enum with Definition, Reference, and Name variants. This reduces the Graph struct from 3 auxiliary maps to 2 and eliminates a class of consistency bugs where two maps keyed on NameId had to be cleaned up in sync. Addresses PR review feedback about multiple hashmaps requiring consistency maintenance.
a421b3c to
9ee49c6
Compare
Summary
Replace the full re-resolution strategy (
clear_declarations+ resolve everything from scratch) with incremental invalidation. Graph mutations (update/delete_document) now compute the minimal set of definitions, references, and ancestor chains that need re-resolution, and the resolver processes only that subset.How it works
Graph mutations follow a three-step pipeline:
invalidate— Detaches old document data from declarations. Identifies which namespace declarations are affected (definition removed, new definition added, new mixin reference added). Runsinvalidate_ancestor_chains→unresolve_affected_references→cascade_name_invalidationon the combined set. Collects empty declarations for tree removal.remove_document_data— Removes raw refs/defs/names/strings from maps.extend— Merges newLocalGraphdata and queues new definitions/references for resolution.Work items are accumulated as
PendingWork(definitions, references, ancestors) and drained internally by the resolver. For the initial full index, this contains everything. For incremental updates, only the invalidated subset.Reverse indices
Four reverse indices are added to
Graphto avoid O(N) scans during invalidation:declaration_to_names— which names resolve to a given declaration (extracted to Add unresolve functions anddeclaration_to_namesreverse index #627)name_to_references— which constant references use a given namename_dependents— which names depend on another name (via nesting/parent_scope)name_to_definitions— which definitions use a given nameThese enable targeted BFS walks from affected declarations instead of scanning all names/references.
Cascade invalidation
When a declaration's ancestors change, the invalidation cascades:
Compared to
mainCorrectness: identical — all declaration counts, definition counts, orphan rates, and linked/orphan breakdowns match exactly between main and the branch.
Performance (initial full index on 94,036 files):
Memory:
Resolution is 39% faster at the cost of 13.5% more memory from five reverse indices (
declaration_to_names,name_to_references,name_dependents,name_to_definitions,pending_work).