-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Ticket: Develop Week 1–2 Core Content for Document Intelligence Foundations
Description
Create instructional content and hands-on exercises for Weeks 1–2 of the Document Intelligence course, covering foundational concepts and the first practical extraction pipeline. Content should follow the Content Development Guide and be aligned to the course Learning Objectives (LOs).
📘 Content Development Guide
https://docs.google.com/document/d/1kZkrEpwPW0UkHe0kvl7tEVbKe7cLH622wTnoF_XNb40/edit?tab=t.kpgyevktwssa#bookmark=id.2bxrvou21vrt
🎯 Relevant Learning Objectives (Reference)
https://docs.google.com/document/d/1kZkrEpwPW0UkHe0kvl7tEVbKe7cLH622wTnoF_XNb40/edit?tab=t.kpgyevktwssa#bookmark=id.hgmn7xwnkkmt
Scope of Work
Week 1, Session 1: Introduction to Data Intelligence
Concept Content
- Explain differences between structured, semi-structured, and unstructured documents
- Introduce OCR and NLP at a conceptual level
- Discuss tradeoffs between extraction approaches (accuracy vs. cost vs. speed)
Hands-On Exercise
- Provide sample invoice documents
- Guide students through inspecting invoices to identify where a regex-only approach may fail
- Emphasize real-world variability and ambiguity
LO Alignment
- Distinguish document types
- Explain OCR/NLP roles
- Analyze tradeoffs in extraction approaches
Week 1, Session 2: OCR Basics
Concept Content
- Define Optical Character Recognition (OCR)
- Explain how OCR fits into a document intelligence pipeline
- Introduce common Python OCR tools (focus on pytesseract)
Hands-On Exercise
- Walk students through applying pytesseract to invoice images
- Include examples of noisy or imperfect OCR output
- Prompt students to evaluate OCR quality
LO Alignment
- Explain OCR fundamentals
- Apply OCR tools to real documents
- Analyze OCR limitations
Week 2, Session 1: OCR + Regex
Concept Content
- Introduce regular expressions in the context of document extraction
- Briefly introduce spaCy as an optional or exploratory NLP tool
Hands-On Exercise
- Build an invoice processing pipeline that:
- Uses OCR to extract text
- Applies regex to extract key fields
- Outputs structured JSON
- Highlight brittle areas and failure cases
LO Alignment
- Apply regex for structured extraction
- Construct an OCR + rules-based pipeline
- Analyze limitations of rule-based approaches
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels