A powerful Node.js tool designed to prepare large codebases for analysis by Large Language Models (LLMs) like Gemini Pro, Claude, or GPT-4. The tool intelligently combines all project files and folders into a single, well-structured Markdown document that can be easily fed into LLMs with large context windows (up to 1M+ tokens).
When working with complex projects that span multiple files and directories, it becomes challenging to provide complete context to LLMs for code analysis, documentation, or refactoring tasks. This tool solves that problem by:
- Consolidating all project files into a single Markdown document
- Preserving the original directory structure for context
- Filtering unnecessary files to optimize token usage
- Formatting code with proper syntax highlighting
- Supporting intelligent LLM-based file filtering
- β Simple extraction of all project files
- β Directory structure visualization
- β Comment removal option
- β Automatic exclusion of common build artifacts
- β Markdown formatting with syntax highlighting
- π€ AI-powered file filtering using OpenAI GPT models
- π 5-level filtering aggressiveness (from minimal to very aggressive)
- π― Custom analysis focus for targeted extraction
- π Context-aware processing (reads README, documentation)
- π·οΈ Intelligent project naming suggestions
- βοΈ Command-line interface with flexible options
-
Clone or download the files to your project directory
-
Install dependencies (for advanced version):
npm installThe simple version requires minimal setup:
node contentExtractor.jsConfiguration:
- Edit the
projectBasePathvariable in the script (default:./files_to_extract/) - Set
shouldRemoveCommentstotrueif you want to strip code comments - Modify
EXCLUDE_PATTERNSarray to customize file exclusions
Set up environment (for LLM filtering):
# Create .env file
echo "OPENAI_API_KEY=your_openai_api_key_here" > .envThe advanced version offers command-line options and AI-powered filtering:
# Basic usage (with default settings)
node llmContentExtractor.js
# Remove comments from code
node llmContentExtractor.js --deleteComments
# Apply LLM filtering (level 1-5)
node llmContentExtractor.js --filterLevel 3
# Focus on specific functionality
node llmContentExtractor.js --focus "authentication and user management logic"
# Combine all options
node llmContentExtractor.js -d -f 4 -o "API gateway and microservices architecture"Command Line Options:
| Option | Alias | Description | Default |
|---|---|---|---|
--deleteComments |
-d |
Remove comments from code | false |
--filterLevel |
-f |
LLM filtering level (0-5) | 2 |
--focus |
-o |
Custom analysis focus prompt | "" |
Project Path Configuration:
- Edit the
HARDCODED_PROJECT_PATHvariable in the script (default:./files_to_extract/)
The advanced version supports 5 levels of AI-powered file filtering:
| Level | Name | Description |
|---|---|---|
| 0 | Disabled | No LLM filtering, only static exclusions |
| 1 | Minimal | Exclude build artifacts, system files, IDE configs |
| 2 | Light | + Configuration files (prettier, eslint, webpack, etc.) |
| 3 | Medium | + Documentation, large static data, basic tests |
| 4 | Aggressive | + Non-core utilities, styles, demo scripts |
| 5 | Very Aggressive | Only absolutely critical business logic files |
your-project/
βββ contentExtractor.js # Basic version
βββ llmContentExtractor.js # Advanced version
βββ files_to_extract/ # Your project files go here
βββ promts/ # Generated output files
βββ .env # Environment variables (for advanced)
The tool generates a comprehensive Markdown file with:
# Project Analysis Prompt
## User-Defined Analysis Focus
> Your custom focus prompt (if provided)
## Project Directory Structure
```text
-- src/
-- components/
-- Header.js
-- Footer.js
-- utils/
-- helpers.js
-- package.json
-- README.mdimport React from 'react';
// Your code here...{
"name": "your-project",
// Your package.json content...
}
## π§ Configuration
### Static Exclusions
Both versions automatically exclude common files that aren't useful for analysis:
- `node_modules/`, `.git/`, `.DS_Store`
- `package-lock.json`, `yarn.lock`
- `.gitignore`, `.prettierrc`
- Build and cache directories
### Custom Exclusions
Add patterns to the `EXCLUDE_PATTERNS` array:
```javascript
const EXCLUDE_PATTERNS = [
'.DS_Store',
'node_modules',
'dist', // Add custom exclusions
'coverage', // Test coverage reports
'*.log' // Log files
];
I have a complex Node.js project. Please analyze the codebase and:
1. Identify the main architecture patterns
2. Suggest improvements for scalability
3. Find potential security vulnerabilities
[Paste the generated Markdown here]
Here's my complete project codebase in Markdown format.
Focus on: "microservices communication patterns"
[Paste the generated Markdown here]
Please provide recommendations for:
- Service discovery improvements
- API versioning strategy
- Error handling patterns
The tool automatically excludes these common patterns:
- Dependencies:
node_modules/,vendor/ - Version Control:
.git/,.svn/ - Build Artifacts:
dist/,build/,*.min.js - System Files:
.DS_Store,Thumbs.db - Lock Files:
package-lock.json,yarn.lock,composer.lock - IDE Files:
.vscode/,.idea/,*.swp
# Use aggressive filtering for large codebases
node llmContentExtractor.js -f 4 -d -o "core business logic only"# Target specific functionality
node llmContentExtractor.js -f 2 -o "user authentication and authorization system"# Keep documentation but remove implementation details
node llmContentExtractor.js -f 1 -o "API documentation and interface definitions""OpenAI API key not configured"
- Create a
.envfile with your OpenAI API key - Or set
filterLevelto 0 to disable LLM filtering
"Directory not found"
- Check the
HARDCODED_PROJECT_PATHin the script - Ensure the target directory exists
"Too many tokens"
- Increase the
filterLevelto be more aggressive - Use a custom
--focusto target specific areas - Enable
--deleteCommentsto reduce token count
"Files not being excluded"
- Check that file paths match exactly in the exclusion list
- Use forward slashes (
/) in path patterns - Test with a smaller
filterLevelfirst
- Start with level 2-3 filtering for most projects
- Use custom focus to target specific areas of interest
- Remove comments for token optimization
- Test with small projects before processing large codebases
Feel free to customize the exclusion patterns, add new filtering logic, or extend the LLM integration for other AI models.
This tool is provided as-is for educational and development purposes. Ensure you comply with OpenAI's API usage policies when using LLM filtering features.
Ready to analyze your codebase with AI? Drop your project files in files_to_extract/ and run the extractor! π