Skip to content

Conversation

@shivasurya
Copy link
Owner

This PR completes the 3-pass algorithm for building Python call graphs by implementing the final pass that resolves call targets and constructs the complete graph structure with edges linking callers to callees.

Changes

Core Implementation (builder.go)

  1. BuildCallGraph(): Main entry point for Pass 3

    • Indexes all function definitions from code graph
    • Iterates through all Python files in the registry
    • Extracts imports and call sites for each file
    • Resolves each call site to its target function
    • Builds edges and stores call site details
    • Returns complete CallGraph with all relationships
  2. indexFunctions(): Function indexing

    • Scans code graph for all function/method definitions
    • Maps each function to its FQN using module registry
    • Populates CallGraph.Functions map for quick lookup
  3. getFunctionsInFile(): File-scoped function retrieval

    • Filters code graph nodes by file path
    • Returns only function/method definitions in that file
    • Used for finding containing functions of call sites
  4. findContainingFunction(): Call site parent resolution

    • Determines which function contains a given call site
    • Uses line number comparison with nearest-match algorithm
    • Finds function with highest line number ≤ call line
    • Returns empty string for module-level calls
  5. resolveCallTarget(): Core resolution logic

    • Handles simple names: sanitize() → myapp.utils.sanitize
    • Handles qualified names: utils.sanitize() → myapp.utils.sanitize
    • Resolves through import maps first
    • Falls back to same-module resolution
    • Validates FQNs against module registry
    • Returns (FQN, resolved bool) tuple
  6. validateFQN(): FQN validation

    • Checks if a fully qualified name exists in registry
    • Handles both modules and functions within modules
    • Validates parent module for function FQNs
  7. readFileBytes(): File reading helper

    • Reads source files for parsing
    • Handles absolute path conversion

Checklist:

  • Tests passing (gradle testGo)?
  • Lint passing (golangci-lint run this requires golangci-lint)?

shivasurya and others added 6 commits October 25, 2025 22:47
Add foundational data structures for Python call graph construction:

New Types:
- CallSite: Represents function call locations with arguments and resolution status
- CallGraph: Maps functions to callees with forward/reverse edges
- ModuleRegistry: Maps Python file paths to module paths
- ImportMap: Tracks imports per file for name resolution
- Location: Source code position tracking
- Argument: Function call argument metadata

Features:
- 100% test coverage with comprehensive unit tests
- Bidirectional call graph edges (forward and reverse)
- Support for ambiguous short names in module registry
- Helper functions for module path manipulation

This establishes the foundation for 3-pass call graph algorithm:
- Pass 1 (next PR): Module registry builder
- Pass 2 (next PR): Import extraction and resolution
- Pass 3 (next PR): Call graph construction

Related: Phase 1 - Call Graph Construction & 3-Pass Algorithm

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implement the first pass of the call graph construction algorithm: building
a complete registry of Python modules by walking the directory tree.

New Features:
- BuildModuleRegistry: Walks directory tree and maps file paths to module paths
- convertToModulePath: Converts file system paths to Python import paths
- shouldSkipDirectory: Filters out venv, __pycache__, build dirs, etc.

Module Path Conversion:
- Handles regular files: myapp/views.py → myapp.views
- Handles packages: myapp/utils/__init__.py → myapp.utils
- Supports deep nesting: myapp/api/v1/endpoints/users.py → myapp.api.v1.endpoints.users
- Cross-platform: Normalizes Windows/Unix path separators

Performance Optimizations:
- Skips 15+ common non-source directories (venv, __pycache__, .git, dist, build, etc.)
- Avoids scanning thousands of dependency files
- Indexes both full module paths and short names for ambiguity detection

Test Coverage: 93%
- Comprehensive unit tests for all conversion scenarios
- Integration tests with real Python project structure
- Edge case handling: empty dirs, non-Python files, deep nesting, permissions
- Error path testing: walk errors, invalid paths, system errors
- Test fixtures: test-src/python/simple_project/ with realistic structure
- Documented: Remaining 7% are untestable OS-level errors (filepath.Abs failures)

This establishes Pass 1 of 3:
- ✅ Pass 1: Module registry (this PR)
- Next: Pass 2 - Import extraction and resolution
- Next: Pass 3 - Call graph construction

Related: Phase 1 - Call Graph Construction & 3-Pass Algorithm
Base Branch: shiva/callgraph-infra-1 (PR #1)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This PR implements comprehensive import extraction for Python code using
tree-sitter AST parsing. It handles all three main import styles:

1. Simple imports: `import module`
2. From imports: `from module import name`
3. Aliased imports: `import module as alias` and `from module import name as alias`

The implementation uses direct AST traversal instead of tree-sitter queries
for better compatibility and control. It properly handles:
- Multiple imports per line (`from json import dumps, loads`)
- Nested module paths (`import xml.etree.ElementTree`)
- Whitespace variations
- Invalid/malformed syntax (fault-tolerant parsing)

Key functions:
- ExtractImports(): Main entry point that parses code and builds ImportMap
- traverseForImports(): Recursively traverses AST to find import statements
- processImportStatement(): Handles simple and aliased imports
- processImportFromStatement(): Handles from-import statements with proper
  module name skipping to avoid duplicate entries

Test coverage: 92.8% overall, 90-95% for import extraction functions

Test fixtures include:
- simple_imports.py: Basic import statements
- from_imports.py: From import statements with multiple names
- aliased_imports.py: Aliased imports (both simple and from)
- mixed_imports.py: Mixed import styles

All tests passing, linting clean, builds successfully.

This is Pass 2 Part A of the 3-pass call graph algorithm.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This PR implements comprehensive relative import resolution for Python using
a 3-pass algorithm. It extends the import extraction system from PR #3 to handle
Python's relative import syntax with dot notation.

Key Changes:

1. **Added FileToModule reverse mapping to ModuleRegistry**
   - Enables O(1) lookup from file path to module path
   - Required for resolving relative imports
   - Updated AddModule() to maintain bidirectional mapping

2. **Implemented resolveRelativeImport() function**
   - Handles single dot (.) for current package
   - Handles multiple dots (.., ...) for parent/grandparent packages
   - Navigates package hierarchy using module path components
   - Clamps excessive dots to root package level
   - Falls back gracefully when file not in registry

3. **Enhanced processImportFromStatement() for relative imports**
   - Detects relative_import nodes in tree-sitter AST
   - Extracts import_prefix (dots) and optional module suffix
   - Resolves relative paths to absolute module paths before adding to ImportMap

4. **Comprehensive test coverage (94.5% overall)**
   - Unit tests for resolveRelativeImport with various dot counts
   - Integration tests with ExtractImports
   - Tests for deeply nested packages
   - Tests for mixed absolute and relative imports
   - Real fixture files with project structure

Relative Import Examples:
- `from . import utils` → "currentpackage.utils"
- `from .. import config` → "parentpackage.config"
- `from ..utils import helper` → "parentpackage.utils.helper"
- `from ...db import query` → "grandparent.db.query"

Test Fixtures:
- Created myapp/submodule/handler.py with all relative import styles
- Created supporting package structure with __init__.py files
- Tests verify correct resolution across package hierarchy

All tests passing, linting clean, builds successfully.

This is Pass 2 Part B of the 3-pass call graph algorithm.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This PR implements call site extraction from Python source code using
tree-sitter AST parsing. It builds on the import resolution work from
PRs #3 and #4 to prepare for call graph construction in Pass 3.

## Changes

### Core Implementation (callsites.go)

1. **ExtractCallSites()**: Main entry point for extracting call sites
   - Parses Python source with tree-sitter
   - Traverses AST to find all call expressions
   - Returns slice of CallSite objects with location information

2. **traverseForCalls()**: Recursive AST traversal
   - Tracks function context while traversing
   - Updates context when entering function definitions
   - Finds and processes call expressions

3. **processCallExpression()**: Call site processing
   - Extracts callee name (function/method being called)
   - Parses arguments (positional and keyword)
   - Creates CallSite with source location
   - Parameters for importMap and caller reserved for Pass 3

4. **extractCalleeName()**: Callee name extraction
   - Handles simple identifiers: foo()
   - Handles attributes: obj.method(), obj.attr.method()
   - Recursively builds dotted names

5. **extractArguments()**: Argument parsing
   - Extracts all positional arguments
   - Preserves keyword arguments as "name=value" in Value field
   - Tracks argument position and variable status

6. **convertArgumentsToSlice()**: Helper for struct conversion
   - Converts []*Argument to []Argument for CallSite struct

### Comprehensive Tests (callsites_test.go)

Created 17 test functions covering:
- Simple function calls: foo(), bar()
- Method calls: obj.method(), self.helper()
- Arguments: positional, keyword, mixed
- Nested calls: foo(bar(x))
- Multiple functions in one file
- Class methods
- Chained calls: obj.method1().method2()
- Module-level calls (no function context)
- Source location tracking
- Empty files
- Complex arguments: expressions, lists, dicts, lambdas
- Nested method calls: obj.attr.method()
- Real file fixture integration

### Test Fixture (simple_calls.py)

Created realistic test file with:
- Function definitions with various call patterns
- Method calls on objects
- Calls with arguments (positional and keyword)
- Nested calls
- Class methods with self references

## Test Coverage

- Overall: 93.3%
- ExtractCallSites: 90.0%
- traverseForCalls: 93.3%
- processCallExpression: 83.3%
- extractCalleeName: 91.7%
- extractArguments: 87.5%
- convertArgumentsToSlice: 100.0%

## Design Decisions

1. **Keyword argument handling**: Store as "name=value" in Value field
   - Tree-sitter provides full keyword_argument node content
   - Preserves complete argument information for later analysis
   - Separating name/value would require additional parsing

2. **Caller context tracking**: Parameter reserved but not used yet
   - Will be populated in Pass 3 during call graph construction
   - Enables linking call sites to their containing functions

3. **Import map parameter**: Reserved for Pass 3 resolution
   - Will be used to resolve qualified names to FQNs
   - Enables cross-file call graph construction

4. **Location tracking**: Store exact position for each call site
   - File, line, column information
   - Enables precise error reporting and code navigation

## Testing Strategy

- Unit tests for each extraction function
- Integration tests with tree-sitter AST
- Real file fixture for end-to-end validation
- Edge cases: empty files, no context, nested structures

## Next Steps (PR #6)

Pass 3 will use this call site data to:
1. Build the complete call graph structure
2. Resolve call targets to function definitions
3. Link caller and callee through edges
4. Handle disambiguation for overloaded names

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This PR completes the 3-pass algorithm for building Python call graphs
by implementing the final pass that resolves call targets and constructs
the complete graph structure with edges linking callers to callees.

## Changes

### Core Implementation (builder.go)

1. **BuildCallGraph()**: Main entry point for Pass 3
   - Indexes all function definitions from code graph
   - Iterates through all Python files in the registry
   - Extracts imports and call sites for each file
   - Resolves each call site to its target function
   - Builds edges and stores call site details
   - Returns complete CallGraph with all relationships

2. **indexFunctions()**: Function indexing
   - Scans code graph for all function/method definitions
   - Maps each function to its FQN using module registry
   - Populates CallGraph.Functions map for quick lookup

3. **getFunctionsInFile()**: File-scoped function retrieval
   - Filters code graph nodes by file path
   - Returns only function/method definitions in that file
   - Used for finding containing functions of call sites

4. **findContainingFunction()**: Call site parent resolution
   - Determines which function contains a given call site
   - Uses line number comparison with nearest-match algorithm
   - Finds function with highest line number ≤ call line
   - Returns empty string for module-level calls

5. **resolveCallTarget()**: Core resolution logic
   - Handles simple names: sanitize() → myapp.utils.sanitize
   - Handles qualified names: utils.sanitize() → myapp.utils.sanitize
   - Resolves through import maps first
   - Falls back to same-module resolution
   - Validates FQNs against module registry
   - Returns (FQN, resolved bool) tuple

6. **validateFQN()**: FQN validation
   - Checks if a fully qualified name exists in registry
   - Handles both modules and functions within modules
   - Validates parent module for function FQNs

7. **readFileBytes()**: File reading helper
   - Reads source files for parsing
   - Handles absolute path conversion

### Comprehensive Tests (builder_test.go)

Created 15 test functions covering:

**Resolution Tests:**
- Simple imported function resolution
- Qualified import resolution (module.function)
- Same-module function resolution
- Unresolved method calls (obj.method)
- Non-existent function handling

**Validation Tests:**
- Module existence validation
- Function-in-module validation
- Non-existent module handling

**Helper Function Tests:**
- Function indexing from code graph
- Functions-in-file filtering
- Containing function detection with edge cases

**Integration Tests:**
- Simple single-file call graph
- Multi-file call graph with imports
- Real test fixture integration

## Test Coverage

- Overall: 91.8%
- BuildCallGraph: 80.8%
- indexFunctions: 87.5%
- getFunctionsInFile: 100.0%
- findContainingFunction: 100.0%
- resolveCallTarget: 85.0%
- validateFQN: 100.0%
- readFileBytes: 75.0%

## Algorithm Overview

Pass 3 ties together all previous work:

### Pass 1 (PR #2): BuildModuleRegistry
- Maps file paths to module paths
- Enables FQN generation

### Pass 2 (PRs #3-5): Import & Call Site Extraction
- ExtractImports: Maps local names to FQNs
- ExtractCallSites: Finds all function calls in AST

### Pass 3 (This PR): Call Graph Construction
- Resolves call targets using import maps
- Links callers to callees with edges
- Validates resolutions against registry
- Stores detailed call site information

## Resolution Strategy

The resolver uses a multi-step approach:

1. **Simple names** (no dots):
   - Check import map first
   - Fall back to same-module lookup
   - Return unresolved if neither works

2. **Qualified names** (with dots):
   - Split into base + rest
   - Resolve base through imports
   - Append rest to get full FQN
   - Try current module if not imported

3. **Validation**:
   - Check if target exists in registry
   - For functions, validate parent module exists
   - Mark resolution success/failure

## Design Decisions

1. **Containing function detection**:
   - Uses nearest-match algorithm based on line numbers
   - Finds function with highest line number ≤ call line
   - Handles module-level calls by returning empty FQN

2. **Resolution priority**:
   - Import map takes precedence over same-module
   - Explicit imports always respected even if unresolved
   - Same-module only tried when not in imports

3. **Validation vs Resolution**:
   - Resolution finds FQN from imports/context
   - Validation checks if FQN exists in registry
   - Both pieces of information stored in CallSite

4. **Error handling**:
   - Continues processing even if some files fail
   - Marks individual call sites as unresolved
   - Returns partial graph instead of failing completely

## Next Steps

The call graph infrastructure is now complete. Future PRs will:

- PR #7: Add CFG data structures for control flow analysis
- PR #8: Implement pattern matching for security rules
- PR #9: Integrate into main initialization pipeline
- PR #10: Add comprehensive documentation and examples
- PR #11: Performance optimizations (caching, pooling)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@shivasurya shivasurya self-assigned this Oct 26, 2025
@safedep
Copy link

safedep bot commented Oct 26, 2025

SafeDep Report Summary

Green Malicious Packages Badge Green Vulnerable Packages Badge Green Risky License Badge

No dependency changes detected. Nothing to scan.

This report is generated by SafeDep Github App

@shivasurya shivasurya added enhancement New feature or request go Pull requests that update go code labels Oct 26, 2025
@codecov
Copy link

codecov bot commented Oct 26, 2025

Codecov Report

❌ Patch coverage is 79.77528% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.86%. Comparing base (5bc564e) to head (f5af0cc).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
sourcecode-parser/graph/callgraph/builder.go 79.77% 10 Missing and 8 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #327      +/-   ##
==========================================
+ Coverage   74.70%   74.86%   +0.15%     
==========================================
  Files          27       28       +1     
  Lines        2799     2888      +89     
==========================================
+ Hits         2091     2162      +71     
- Misses        653      663      +10     
- Partials       55       63       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Base automatically changed from shiva/callgraph-infra-5 to main October 29, 2025 02:28
@shivasurya shivasurya merged commit 3d65fad into main Oct 29, 2025
5 checks passed
@shivasurya shivasurya deleted the shiva/callgraph-infra-6 branch October 29, 2025 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request go Pull requests that update go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants