IntelliDoc is an advanced persona-driven document intelligence system that extracts and ranks the most relevant sections from PDF documents based on a specific persona and their job-to-be-done. It's designed to help users quickly find the most pertinent information in documents based on their specific needs and expertise level.
- Features
- Installation
- Usage
- Project Structure
- Architecture
- Input Format
- Output Format
- Docker Support
- Development
- License
- Multi-format Support: Extracts headings from both text-based and image-based PDFs
- Multilingual OCR: Supports multiple languages including English, Japanese, Chinese, Arabic, Hindi, and Korean
- Smart Heading Detection:
- Identifies headings based on font size, weight, and style
- Handles nested heading hierarchies
- Distinguishes between main headings and subheadings
- Document Structure Analysis:
- Automatically detects document structure
- Identifies body text vs. headings
- Handles complex layouts and columns
- Output Formats:
- JSON output with hierarchical heading structure
- Preserves page numbers and positions
- Extracts text content under each heading
- Enhanced PDF text extraction with outline parsing
- OCR fallback for image-based PDFs
- Multilingual support (English, Japanese, Chinese, Arabic, Hindi, Korean)
- Intelligent section boundary detection
- Content type classification
- Table and figure extraction
- Automatic persona type identification (researcher, analyst, student, etc.)
- Domain focus detection (academic, business, technical, etc.)
- Experience level assessment
- Expertise area extraction
- Job requirement parsing
- Multi-dimensional scoring system:
- Semantic Similarity (35%)
- Keyword Overlap (25%)
- Content Type Match (20%)
- Expertise Alignment (15%)
- Structural Importance (5%)
- Granular content extraction from top sections
- Paragraph-level relevance scoring
- Intelligent text chunking for optimal readability
- Context-aware text refinement
- Python 3.8 or higher
- Tesseract OCR (for image-based PDFs)
- Poppler (for PDF processing)
- Git (for cloning the repository)
git clone https://github.com/Codealpha07/IntelliDoc.git
cd IntelliDoc# Create a virtual environment
python -m venv venv
# Activate the virtual environment
# On Windows:
.\venv\Scripts\activate
# On Unix or MacOS:
source venv/bin/activatepip install -r requirements.txtUbuntu/Debian:
sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utilsmacOS (using Homebrew):
brew install tesseract popplerWindows:
- Download and install Tesseract OCR from UB Mannheim
- During installation, check "Add Tesseract to your system PATH"
- Download and install Poppler from poppler-windows
- Add Poppler to your system PATH
python -c "import fitz; print('PyMuPDF version:', fitz.__version__)"
python -c "import pytesseract; print('Tesseract version:', pytesseract.get_tesseract_version())"-
Prepare Input Files
- Place 3-10 PDF files in the
inputs/directory - Create a configuration file (see Input Format)
- Example directory structure:
IntelliDoc/ ├── inputs/ │ ├── document1.pdf │ ├── document2.pdf │ └── config.json └── ...
- Place 3-10 PDF files in the
-
Run the Application
# Basic usage with default settings python main.py # With custom input/output directories python main.py --input-dir ./my_inputs --output-dir ./my_outputs # With verbose logging python main.py --verbose
-
View Results
- Output will be saved in the
outputs/directory by default - Main output file:
challenge1b_output.json - Logs are available in
app.log(when verbose mode is enabled)
- Output will be saved in the
python main.py [options]
Options:
--input-dir PATH Directory containing input PDFs (default: inputs/)
--output-dir PATH Directory to save output (default: outputs/)
--config FILE Path to configuration file (default: auto-detected)
--verbose Enable verbose logging (default: False)
--debug Enable debug mode (more detailed logging) (default: False)
--max-docs N Maximum number of documents to process (default: 10)
--no-ocr Disable OCR processing (faster but may miss text in images)
--language LANG Set OCR language (default: eng)Create a config.json file in your input directory with the following format:
{
"persona": "[Persona description]",
"job_to_be_done": "[Detailed description of the task]"
}Example:
{
"persona": "Senior Machine Learning Engineer with 5 years of experience in NLP",
"job_to_be_done": "Research state-of-the-art transformer architectures for document understanding"
}You can also configure the application using environment variables:
# Set input/output directories
export INTELLIDOC_INPUT_DIR=./my_inputs
export INTELLIDOC_OUTPUT_DIR=./my_outputs
# Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
export INTELLIDOC_LOG_LEVEL=INFO
# Enable/disable features
export INTELLIDOC_ENABLE_OCR=true
export INTELLIDOC_MAX_DOCS=5The system expects:
- 3-10 PDF files in
/app/input/directory - Configuration file (one of):
config.json: JSON format withpersonaandjob_to_be_donefieldspersona.json: Alternative JSON configurationinput.json: Another alternative format*.txt: Plain text with persona on first line, job description following
{
"persona": "PhD Researcher in Computational Biology with expertise in machine learning applications",
"job_to_be_done": "Prepare a comprehensive literature review focusing on methodologies, datasets, and performance benchmarks for graph neural networks in drug discovery"
}The system generates challenge1b_output.json with:
{
"metadata": {
"input_documents": ["doc1.pdf", "doc2.pdf"],
"persona": "...",
"job_to_be_done": "...",
"processing_timestamp": "2025-01-XX...",
"total_sections_analyzed": 45,
"documents_processed": 3
},
"extracted_sections": [
{
"document": "doc1.pdf",
"page_number": 3,
"section_title": "Graph Neural Network Architectures",
"importance_rank": 1
}
],
"subsection_analysis": [
{
"document": "doc1.pdf",
"page_number": 3,
"refined_text": "Graph neural networks have emerged as..."
}
]
}The system recognizes 8 persona types:
- Researcher: Focus on methodology, literature, benchmarks
- Student: Emphasis on concepts, examples, fundamentals
- Analyst: Business metrics, trends, performance analysis
- Engineer: Technical implementation, architecture, systems
- Manager: Strategy, planning, execution, processes
- Consultant: Recommendations, best practices, optimization
- Journalist: Facts, events, context, impact analysis
- Entrepreneur: Market opportunities, innovation, scaling
Automatically identifies 6 content types:
- Methodology: Approaches, techniques, frameworks
- Results: Findings, data, measurements, outcomes
- Background: Context, literature, historical information
- Analysis: Evaluation, interpretation, discussion
- Examples: Cases, illustrations, applications
- Summary: Conclusions, abstracts, key points
def calculate_score(section, persona, job):
score = (
0.35 * semantic_similarity(section, persona, job) +
0.25 * keyword_overlap(section, persona, job) +
0.20 * content_type_match(section, job) +
0.15 * expertise_alignment(section, persona) +
0.05 * structural_importance(section)
)
return min(score, 1.0)- Efficient Text Processing: Optimized regular expressions and string operations
- Smart Caching: Avoid redundant computations
- Memory Management: Process documents sequentially to minimize RAM usage
- Early Filtering: Skip irrelevant content early in the pipeline
-
Build the Docker image:
docker build --platform linux/amd64 -t intellidoc:latest . -
Verify the image was built successfully:
docker images | grep intellidoc
-
Create input and output directories:
mkdir -p ./input ./output
-
Place your PDF files and config in the input directory:
cp your-document.pdf ./input/ cp config.json ./input/
-
Run the container:
docker run --rm \ -v $(pwd)/input:/app/input \ -v $(pwd)/output:/app/output \ --network none \ intellidoc:latest
For easier management, you can use Docker Compose:
-
Create a
docker-compose.ymlfile:version: '3.8' services: intellidoc: build: . volumes: - ./input:/app/input - ./output:/app/output environment: - INTELLIDOC_LOG_LEVEL=INFO - INTELLIDOC_MAX_DOCS=5 deploy: resources: limits: cpus: '2' memory: 2G
-
Start the service:
docker-compose up --build
- Resource Limits: The example above sets reasonable CPU and memory limits
- Volume Mounting: Use named volumes for production deployments
- Security: Runs with non-root user inside container
- Caching: Leverages Docker layer caching for faster builds
- Python 3.8+: Core programming language
- PyMuPDF (fitz): Advanced PDF processing and text extraction
- pytesseract: OCR capabilities for image-based PDFs
- Pillow: Image processing for OCR preprocessing
- numpy: Numerical computations and array operations
- pandas: Data manipulation and analysis
- scikit-learn: Machine learning utilities for text processing
- nltk: Natural language processing toolkit
- tqdm: Progress bars for long-running operations
- pytest: Testing framework
- black: Code formatting
- flake8: Linting
- mypy: Static type checking
- pylint: Code quality checking
- Tesseract OCR: For optical character recognition
- Poppler: For PDF rendering and processing
- Git: For version control
- Graceful Fallbacks: Automatically falls back to OCR for image-based PDFs
- Robust Parsing: Handles malformed PDF structures and recovers gracefully
- Input Validation: Validates all inputs before processing
- Configuration Defaults: Provides sensible defaults for missing configuration
- Resource Management: Properly manages file handles and system resources
Logs are written to app.log and can be configured via environment variables:
import logging
# Basic configuration
logging.basicConfig(
level=os.getenv('INTELLIDOC_LOG_LEVEL', 'INFO'),
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('app.log'),
logging.StreamHandler()
]
)
# Get logger for a module
logger = logging.getLogger(__name__)-
OCR Failures:
- Ensure Tesseract is installed and in PATH
- Check that the language packs are installed
- Verify image quality if using scanned documents
-
PDF Processing Errors:
- Corrupt PDFs may cause issues
- Try opening the PDF in a viewer to verify integrity
- Consider converting to a different format if problems persist
-
Performance Issues:
- Large PDFs may require more memory
- Consider splitting large documents
- Enable logging to identify bottlenecks
| Metric | Value | Notes |
|---|---|---|
| Processing Time | ~30-45s | For 5 average-sized PDFs |
| Memory Usage | 400-600MB | Peak usage during processing |
| CPU Usage | 2 cores | Can be scaled based on workload |
| Model Size | <100MB | No large external models |
| Document Size | Up to 50MB | Per document |
| Batch Size | Up to 10 | Documents per run |
-
Vertical Scaling:
- Increase CPU cores for faster processing
- Add more RAM for larger documents or batches
-
Horizontal Scaling:
- Process documents in parallel across multiple instances
- Use a message queue for job distribution
-
Optimization Tips:
- Enable caching for repeated documents
- Disable OCR if not needed with
--no-ocr - Process documents in smaller batches
| Resource | Minimum | Recommended |
|---|---|---|
| CPU Cores | 1 | 2+ |
| RAM | 1GB | 2GB |
| Disk Space | 100MB | 1GB |
| OS | Linux/Windows/macOS | Linux |
The codebase includes comprehensive test coverage for:
- Document processing and text extraction
- Persona analysis and classification
- Relevance scoring algorithms
- Input/output handling
- Error conditions and edge cases
Run the full test suite:
pytest tests/ -v --cov=.-
Unit Tests:
- Individual component testing
- Mocked dependencies
- Edge case validation
-
Integration Tests:
- End-to-end document processing
- Configuration handling
- File system operations
-
Performance Tests:
- Processing time benchmarks
- Memory usage profiling
- Scaling characteristics
Test data is stored in tests/test_data/ and includes:
- Sample PDF documents
- Configuration files
- Expected output files
The project includes a .github/workflows/ci.yml file that runs:
- Unit tests
- Type checking
- Linting
- Code coverage
-
Semantic Understanding:
- Context-aware text analysis
- Topic modeling
- Entity recognition
-
Persona Adaptation:
- Dynamic adjustment based on expertise level
- Domain-specific processing
- Customizable scoring weights
-
Content Intelligence:
- Automatic section detection
- Table and figure extraction
- Citation analysis
-
Scoring Weights:
{ 'semantic_similarity': 0.35, 'keyword_overlap': 0.25, 'content_type_match': 0.20, 'expertise_alignment': 0.15, 'structural_importance': 0.05 } -
Plugins and Extensions:
- Custom document processors
- Specialized analyzers
- Output formatters
-
API Access:
- RESTful interface
- Python package
- Command-line interface
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
For support, please open an issue in the GitHub repository.
- Adobe for the original challenge
- Open source contributors
- The Python community
This implementation combines the robust PDF extraction capabilities from Round 1A with sophisticated intelligence layers to deliver highly relevant, persona-specific document insights.