Ticket: Develop Week 1–2 Core Content for Document Intelligence Foundations

## Ticket: Develop Week 1–2 Core Content for Document Intelligence Foundations

### Description

Create instructional content and hands-on exercises for **Weeks 1–2** of the Document Intelligence course, covering foundational concepts and the first practical extraction pipeline. Content should follow the **Content Development Guide** and be aligned to the **course Learning Objectives (LOs)**.

📘 **Content Development Guide**  
https://docs.google.com/document/d/1kZkrEpwPW0UkHe0kvl7tEVbKe7cLH622wTnoF_XNb40/edit?tab=t.kpgyevktwssa#bookmark=id.2bxrvou21vrt  

🎯 **Relevant Learning Objectives (Reference)**  
https://docs.google.com/document/d/1kZkrEpwPW0UkHe0kvl7tEVbKe7cLH622wTnoF_XNb40/edit?tab=t.kpgyevktwssa#bookmark=id.hgmn7xwnkkmt  


### Scope of Work

#### Week 1, Session 1: Introduction to Data Intelligence
**Concept Content**
- Explain differences between structured, semi-structured, and unstructured documents
- Introduce OCR and NLP at a conceptual level
- Discuss tradeoffs between extraction approaches (accuracy vs. cost vs. speed)

**Hands-On Exercise**
- Provide sample invoice documents
- Guide students through inspecting invoices to identify where a regex-only approach may fail
- Emphasize real-world variability and ambiguity

**LO Alignment**
- Distinguish document types  
- Explain OCR/NLP roles  
- Analyze tradeoffs in extraction approaches  

---

#### Week 1, Session 2: OCR Basics
**Concept Content**
- Define Optical Character Recognition (OCR)
- Explain how OCR fits into a document intelligence pipeline
- Introduce common Python OCR tools (focus on pytesseract)

**Hands-On Exercise**
- Walk students through applying pytesseract to invoice images
- Include examples of noisy or imperfect OCR output
- Prompt students to evaluate OCR quality

**LO Alignment**
- Explain OCR fundamentals  
- Apply OCR tools to real documents  
- Analyze OCR limitations  

---

#### Week 2, Session 1: OCR + Regex
**Concept Content**
- Introduce regular expressions in the context of document extraction
- Briefly introduce spaCy as an optional or exploratory NLP tool

**Hands-On Exercise**
- Build an invoice processing pipeline that:
  - Uses OCR to extract text
  - Applies regex to extract key fields
  - Outputs structured JSON
- Highlight brittle areas and failure cases

**LO Alignment**
- Apply regex for structured extraction  
- Construct an OCR + rules-based pipeline  
- Analyze limitations of rule-based approaches  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ticket: Develop Week 1–2 Core Content for Document Intelligence Foundations #2