Skip to content

Commit 5454e99

Browse files
shivasuryaclaude
andauthored
cpf/enhancement: Implement import extraction with tree-sitter (#324)
* feat: Add core data structures for call graph (PR #1) Add foundational data structures for Python call graph construction: New Types: - CallSite: Represents function call locations with arguments and resolution status - CallGraph: Maps functions to callees with forward/reverse edges - ModuleRegistry: Maps Python file paths to module paths - ImportMap: Tracks imports per file for name resolution - Location: Source code position tracking - Argument: Function call argument metadata Features: - 100% test coverage with comprehensive unit tests - Bidirectional call graph edges (forward and reverse) - Support for ambiguous short names in module registry - Helper functions for module path manipulation This establishes the foundation for 3-pass call graph algorithm: - Pass 1 (next PR): Module registry builder - Pass 2 (next PR): Import extraction and resolution - Pass 3 (next PR): Call graph construction Related: Phase 1 - Call Graph Construction & 3-Pass Algorithm 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * feat: Implement module registry - Pass 1 of 3-pass algorithm (PR #2) Implement the first pass of the call graph construction algorithm: building a complete registry of Python modules by walking the directory tree. New Features: - BuildModuleRegistry: Walks directory tree and maps file paths to module paths - convertToModulePath: Converts file system paths to Python import paths - shouldSkipDirectory: Filters out venv, __pycache__, build dirs, etc. Module Path Conversion: - Handles regular files: myapp/views.py → myapp.views - Handles packages: myapp/utils/__init__.py → myapp.utils - Supports deep nesting: myapp/api/v1/endpoints/users.py → myapp.api.v1.endpoints.users - Cross-platform: Normalizes Windows/Unix path separators Performance Optimizations: - Skips 15+ common non-source directories (venv, __pycache__, .git, dist, build, etc.) - Avoids scanning thousands of dependency files - Indexes both full module paths and short names for ambiguity detection Test Coverage: 93% - Comprehensive unit tests for all conversion scenarios - Integration tests with real Python project structure - Edge case handling: empty dirs, non-Python files, deep nesting, permissions - Error path testing: walk errors, invalid paths, system errors - Test fixtures: test-src/python/simple_project/ with realistic structure - Documented: Remaining 7% are untestable OS-level errors (filepath.Abs failures) This establishes Pass 1 of 3: - ✅ Pass 1: Module registry (this PR) - Next: Pass 2 - Import extraction and resolution - Next: Pass 3 - Call graph construction Related: Phase 1 - Call Graph Construction & 3-Pass Algorithm Base Branch: shiva/callgraph-infra-1 (PR #1) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * feat: Implement import extraction with tree-sitter - Pass 2 Part A This PR implements comprehensive import extraction for Python code using tree-sitter AST parsing. It handles all three main import styles: 1. Simple imports: `import module` 2. From imports: `from module import name` 3. Aliased imports: `import module as alias` and `from module import name as alias` The implementation uses direct AST traversal instead of tree-sitter queries for better compatibility and control. It properly handles: - Multiple imports per line (`from json import dumps, loads`) - Nested module paths (`import xml.etree.ElementTree`) - Whitespace variations - Invalid/malformed syntax (fault-tolerant parsing) Key functions: - ExtractImports(): Main entry point that parses code and builds ImportMap - traverseForImports(): Recursively traverses AST to find import statements - processImportStatement(): Handles simple and aliased imports - processImportFromStatement(): Handles from-import statements with proper module name skipping to avoid duplicate entries Test coverage: 92.8% overall, 90-95% for import extraction functions Test fixtures include: - simple_imports.py: Basic import statements - from_imports.py: From import statements with multiple names - aliased_imports.py: Aliased imports (both simple and from) - mixed_imports.py: Mixed import styles All tests passing, linting clean, builds successfully. This is Pass 2 Part A of the 3-pass call graph algorithm. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent 4e21322 commit 5454e99

File tree

6 files changed

+577
-0
lines changed

6 files changed

+577
-0
lines changed
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
package callgraph
2+
3+
import (
4+
"context"
5+
6+
sitter "github.com/smacker/go-tree-sitter"
7+
"github.com/smacker/go-tree-sitter/python"
8+
)
9+
10+
// ExtractImports extracts all import statements from a Python file and builds an ImportMap.
11+
// It handles three main import styles:
12+
// 1. Simple imports: import module
13+
// 2. From imports: from module import name
14+
// 3. Aliased imports: from module import name as alias
15+
//
16+
// The resulting ImportMap maps local names (aliases or imported names) to their
17+
// fully qualified module paths, enabling later resolution of function calls.
18+
//
19+
// Algorithm:
20+
// 1. Parse source code with tree-sitter Python parser
21+
// 2. Execute tree-sitter query to find all import statements
22+
// 3. Process each import match to extract module paths and aliases
23+
// 4. Build ImportMap with resolved fully qualified names
24+
//
25+
// Parameters:
26+
// - filePath: absolute path to the Python file being analyzed
27+
// - sourceCode: contents of the Python file as byte array
28+
// - registry: module registry for resolving module paths
29+
//
30+
// Returns:
31+
// - ImportMap: map of local names to fully qualified module paths
32+
// - error: if parsing fails or source is invalid
33+
//
34+
// Example:
35+
//
36+
// Source code:
37+
// import os
38+
// from myapp.utils import sanitize
39+
// from myapp.db import query as db_query
40+
//
41+
// Result ImportMap:
42+
// {
43+
// "os": "os",
44+
// "sanitize": "myapp.utils.sanitize",
45+
// "db_query": "myapp.db.query"
46+
// }
47+
func ExtractImports(filePath string, sourceCode []byte, registry *ModuleRegistry) (*ImportMap, error) {
48+
importMap := NewImportMap(filePath)
49+
50+
// Parse with tree-sitter
51+
parser := sitter.NewParser()
52+
parser.SetLanguage(python.GetLanguage())
53+
defer parser.Close()
54+
55+
tree, err := parser.ParseCtx(context.Background(), nil, sourceCode)
56+
if err != nil {
57+
return nil, err
58+
}
59+
defer tree.Close()
60+
61+
// Traverse AST to find import statements
62+
traverseForImports(tree.RootNode(), sourceCode, importMap)
63+
64+
return importMap, nil
65+
}
66+
67+
// traverseForImports recursively traverses the AST to find import statements.
68+
// Uses direct AST traversal instead of queries for better compatibility.
69+
func traverseForImports(node *sitter.Node, sourceCode []byte, importMap *ImportMap) {
70+
if node == nil {
71+
return
72+
}
73+
74+
nodeType := node.Type()
75+
76+
// Process import statements
77+
switch nodeType {
78+
case "import_statement":
79+
processImportStatement(node, sourceCode, importMap)
80+
// Don't recurse into children - we've already processed this import
81+
return
82+
case "import_from_statement":
83+
processImportFromStatement(node, sourceCode, importMap)
84+
// Don't recurse into children - we've already processed this import
85+
return
86+
}
87+
88+
// Recursively process children
89+
for i := 0; i < int(node.ChildCount()); i++ {
90+
child := node.Child(i)
91+
traverseForImports(child, sourceCode, importMap)
92+
}
93+
}
94+
95+
// processImportStatement handles simple import statements: import module [as alias].
96+
// Examples:
97+
// - import os → "os" = "os"
98+
// - import os as op → "op" = "os"
99+
func processImportStatement(node *sitter.Node, sourceCode []byte, importMap *ImportMap) {
100+
// Look for 'name' field which contains the import
101+
nameNode := node.ChildByFieldName("name")
102+
if nameNode == nil {
103+
return
104+
}
105+
106+
// Check if it's an aliased import
107+
if nameNode.Type() == "aliased_import" {
108+
// import module as alias
109+
moduleNode := nameNode.ChildByFieldName("name")
110+
aliasNode := nameNode.ChildByFieldName("alias")
111+
112+
if moduleNode != nil && aliasNode != nil {
113+
moduleName := moduleNode.Content(sourceCode)
114+
aliasName := aliasNode.Content(sourceCode)
115+
importMap.AddImport(aliasName, moduleName)
116+
}
117+
} else if nameNode.Type() == "dotted_name" {
118+
// Simple import: import module
119+
moduleName := nameNode.Content(sourceCode)
120+
importMap.AddImport(moduleName, moduleName)
121+
}
122+
}
123+
124+
// processImportFromStatement handles from-import statements: from module import name [as alias].
125+
// Examples:
126+
// - from os import path → "path" = "os.path"
127+
// - from os import path as ospath → "ospath" = "os.path"
128+
// - from json import dumps, loads → "dumps" = "json.dumps", "loads" = "json.loads"
129+
func processImportFromStatement(node *sitter.Node, sourceCode []byte, importMap *ImportMap) {
130+
// Get the module being imported from
131+
moduleNameNode := node.ChildByFieldName("module_name")
132+
if moduleNameNode == nil {
133+
return
134+
}
135+
136+
moduleName := moduleNameNode.Content(sourceCode)
137+
138+
// The 'name' field might be:
139+
// 1. A single dotted_name: from os import path
140+
// 2. A single aliased_import: from os import path as ospath
141+
// 3. A wildcard_import: from os import *
142+
//
143+
// For multiple imports (from json import dumps, loads), tree-sitter
144+
// creates multiple child nodes, so we need to check all children
145+
for i := 0; i < int(node.ChildCount()); i++ {
146+
child := node.Child(i)
147+
148+
// Skip the module_name node itself - we only want the imported names
149+
if child == moduleNameNode {
150+
continue
151+
}
152+
153+
// Process each import name/alias
154+
if child.Type() == "aliased_import" {
155+
// from module import name as alias
156+
importNameNode := child.ChildByFieldName("name")
157+
aliasNode := child.ChildByFieldName("alias")
158+
159+
if importNameNode != nil && aliasNode != nil {
160+
importName := importNameNode.Content(sourceCode)
161+
aliasName := aliasNode.Content(sourceCode)
162+
fqn := moduleName + "." + importName
163+
importMap.AddImport(aliasName, fqn)
164+
}
165+
} else if child.Type() == "dotted_name" || child.Type() == "identifier" {
166+
// from module import name
167+
importName := child.Content(sourceCode)
168+
fqn := moduleName + "." + importName
169+
importMap.AddImport(importName, fqn)
170+
}
171+
}
172+
}

0 commit comments

Comments
 (0)