A research tool using Claude Code to enable flexible, natural language exploration of a local CTAHR (UH College of Tropical Agriculture and Human Resources) document collection. Indexes and searches PDFs, DOCX, PPTX, and other file formats using SQLite FTS5 full-text search.
┌─────────────────────────────────────────────────────────────┐
│ Researcher enters natural language query: │
│ "Find documents about soil conservation in Hawaii" │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ AI Agent generates search terms and variations: │
│ "soil conservation", "erosion control", "cover crop", etc. │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Searches local SQLite FTS5 index of document collection │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Agent analyzes results, extracts text, identifies │
│ relevant documents, and reports findings │
└─────────────────────────────────────────────────────────────┘
npm installPlace your CTAHR document collection in ctahr-pdfs/. Then build the index:
node index-docs.jsNo browser or external services required. Everything runs locally.
Start a Claude Code session and describe your research query in natural language:
"Find documents about bee habitat and pollinator health"
The agent will:
- Generate appropriate search terms
- Search the indexed document catalog
- Review and analyze results
- Report findings with document paths
Build/update the index:
node index-docs.js # incremental (skip unchanged)
node index-docs.js --rebuild # full rebuildSearch documents:
node search-docs.js "soil conservation"
node search-docs.js "ruminant nutrition" --limit 10
node search-docs.js "bee" --type pdf
node search-docs.js --statsView a specific document:
node view-doc.js smarts2/bsipes # by catalog ID
node view-doc.js "ctahr-pdfs/SMARTS2/bsipes.pdf" # by path
node view-doc.js smarts2/bsipes --no-save| Format | Extraction | Notes |
|---|---|---|
| pdf-parse | Full text + page count | |
| DOCX | mammoth | Clean text extraction |
| PPTX | officeparser | Slide text (images not extracted) |
| XLSX | officeparser | Cell text |
| JPG/PNG | metadata only | Use Read tool for visual review |
├── CLAUDE.md # Instructions for AI agent
├── README.md # This file
├── RESEARCH_BRIEF_PROCESS.md # How to create new research briefs
│
├── index-docs.js # Build/update document index
├── search-docs.js # Search the catalog
├── view-doc.js # Extract + display document text
│
├── subagents/ # Sub-agent prompt templates
│ ├── triage-agent-prompt.md # Batch document triage agent
│ └── merge-agent-prompt.md # State file merge agent
│
├── docs/ # Technical documentation
│ ├── RESEARCH_WORKFLOW.md # End-to-end research process guide
│ ├── SUBAGENT_RUNNER.md # Sub-agent runner usage
│ └── SUBAGENT_TESTING.md # Sub-agent test verification
│
├── results/ # Research output files
├── research-briefs/ # Research query templates
│ ├── _template.md # Blank template
│ └── ctahr-general.md # General CTAHR collection brief
│
├── research-state/ # Session state files (JSON)
├── extracted-text/ # Saved extracted text (auto-created, gitignored)
├── ctahr-pdfs/ # Source documents (gitignored)
└── catalog.db # SQLite FTS5 index (gitignored)
For large-scale document review, the project includes a triage sub-agent that can process batches of documents in parallel within Claude Code:
- Search for relevant documents
- Collect paths/IDs from results
- Launch triage agents (using
subagents/triage-agent-prompt.md) to extract, tier-assess, and capture findings - Review agent output, then merge into the research state file
See docs/RESEARCH_WORKFLOW.md for the full process.
- Image-only PDFs (scanned without OCR) yield little extractable text
- PPTX charts and images are not captured in text extraction
- Complex table layouts in PDFs may not extract cleanly
- Use Claude's Read tool for visual review of any document with important non-text content
This is a research project for nonprofit and community partners. Contact the project maintainers for collaboration opportunities.
ISC