Skip to content

Ticket: Develop Week 1–2 Core Content for Document Intelligence Foundations #2

@gcziprusz

Description

@gcziprusz

Ticket: Develop Week 1–2 Core Content for Document Intelligence Foundations

Description

Create instructional content and hands-on exercises for Weeks 1–2 of the Document Intelligence course, covering foundational concepts and the first practical extraction pipeline. Content should follow the Content Development Guide and be aligned to the course Learning Objectives (LOs).

📘 Content Development Guide
https://docs.google.com/document/d/1kZkrEpwPW0UkHe0kvl7tEVbKe7cLH622wTnoF_XNb40/edit?tab=t.kpgyevktwssa#bookmark=id.2bxrvou21vrt

🎯 Relevant Learning Objectives (Reference)
https://docs.google.com/document/d/1kZkrEpwPW0UkHe0kvl7tEVbKe7cLH622wTnoF_XNb40/edit?tab=t.kpgyevktwssa#bookmark=id.hgmn7xwnkkmt

Scope of Work

Week 1, Session 1: Introduction to Data Intelligence

Concept Content

  • Explain differences between structured, semi-structured, and unstructured documents
  • Introduce OCR and NLP at a conceptual level
  • Discuss tradeoffs between extraction approaches (accuracy vs. cost vs. speed)

Hands-On Exercise

  • Provide sample invoice documents
  • Guide students through inspecting invoices to identify where a regex-only approach may fail
  • Emphasize real-world variability and ambiguity

LO Alignment

  • Distinguish document types
  • Explain OCR/NLP roles
  • Analyze tradeoffs in extraction approaches

Week 1, Session 2: OCR Basics

Concept Content

  • Define Optical Character Recognition (OCR)
  • Explain how OCR fits into a document intelligence pipeline
  • Introduce common Python OCR tools (focus on pytesseract)

Hands-On Exercise

  • Walk students through applying pytesseract to invoice images
  • Include examples of noisy or imperfect OCR output
  • Prompt students to evaluate OCR quality

LO Alignment

  • Explain OCR fundamentals
  • Apply OCR tools to real documents
  • Analyze OCR limitations

Week 2, Session 1: OCR + Regex

Concept Content

  • Introduce regular expressions in the context of document extraction
  • Briefly introduce spaCy as an optional or exploratory NLP tool

Hands-On Exercise

  • Build an invoice processing pipeline that:
    • Uses OCR to extract text
    • Applies regex to extract key fields
    • Outputs structured JSON
  • Highlight brittle areas and failure cases

LO Alignment

  • Apply regex for structured extraction
  • Construct an OCR + rules-based pipeline
  • Analyze limitations of rule-based approaches

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions