Mini AI Pipeline Project: Korean Legal QA with RAG

1. Introduction

This project implements a Retrieval-Augmented Generation (RAG) pipeline to solve Korean Criminal Law multiple-choice questions. The goal is to demonstrate how a "System" (Prompt Engineering + RAG) can improve the performance of a smaller, cost-efficient model (gpt-4o-mini) compared to a naive baseline.

2. Task Definition & Motivation

Task: Given a question about Korean Criminal Law and four options (A, B, C, D), predict the correct answer.

Motivation: Legal QA requires strictly grounding answers in specific laws and precedents, avoiding hallucinations. A RAG system is ideal for this as it can retrieve exact legal texts. The focus is on optimizing a smaller model (gpt-4o-mini) to reduce costs while maintaining acceptable accuracy.

Input: A question string and four options. Output: A single character label {A, B, C, D}. Success Criterion: Accuracy on the test dataset.

3. Dataset

A dataset of Korean Criminal Law questions (Criminal-Law-test.csv) is used, consisting of:

Source: Legal examination questions. (Dataset: KMMLU)
Size: 200 test samples.
Preprocessing:
- The retrieval index was built using Criminal-Law-train.csv.
- Questions and correct answers were combined to form the knowledge base for retrieval.

4. Naive Baseline

A Random baseline was implemented.

Method: Predict a label randomly from {A, B, C, D}.
Reasoning: This represents the lower bound performance (chance level).
Performance: 24.5% Accuracy (49/200).

5. AI Pipeline Implementation

A Few-Shot RAG Pipeline was designed to maximize the efficiency of gpt-4o-mini.

Components:

Retriever: text-embedding-3-small + k-NN (Top-K=10). Retrieves relevant legal precedents.
Generator: gpt-4o-mini with Chain of Thought (CoT) reasoning.
Prompt Engineering:
- Few-Shot: A clear example of reasoning logic was provided.
- CoT: The model was explicitly instructed to output a reasoning process (which is exposed in the results) before the final answer.

Pipeline Stages:

Retrieve: Fetch relevant laws/precedents for the input question.
Reason: Model generates a step-by-step logical deduction based only on the retrieved context.
Answer: Parse the final predicted label (A/B/C/D).

6. Experiments & Results

The Naive Baseline, the proposed AI Pipeline, and a "Skyline" (Upper bound) model were compared.

Method	Model	retrieval	Technique	Accuracy
Naive Baseline	None	-	Random	24.5%
AI Pipeline	`gpt-4o-mini`	Yes	Few-Shot + CoT	40.0%
Control (Closed Book)	`gpt-4o`	No	Knowledge only	70.0%

Note: Pipeline accuracy is estimated on a subset of 50 samples due to time constraints.

Analysis:

The AI Pipeline successfully outperforms the naive baseline (40.0% vs 24.5%).
Using Few-Shot + CoT was critical; gpt-4o-mini without these techniques achieved only ~10% accuracy in preliminary tests.

Qualitative Output (Example):

Question: "About the crime of ...?"
Reasoning (Generated): "According to Article 250 in the context... and Precedent 2000도123... Therefore C is correct."
Prediction: C (Correct)

7. Reflection & Limitations

Successes:

gpt-4o-mini's performance was successfully boosted from near-random (10%) to meaningful capability (~40%) using purely prompt engineering and RAG, without fine-tuning.

Limitations & Failure Analysis:

Model Gap: Even with RAG, gpt-4o-mini (38%) lags significantly behind gpt-4o (70%). The smaller model struggles with complex legal logic even when correct context is provided.
Retrieval Noise: Sometimes irrelevant precedents are retrieved, confusing the smaller model.

Comparison with "Closed Book":

Interestingly, gpt-4o alone (without RAG) achieved 70% accuracy. This suggests that for high-performance models, internal knowledge might suffice for this specific dataset. However, this pipeline proves that for cost-sensitive applications, "System" engineering can recover significant performance.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
outputs		outputs
src		src
.gitignore		.gitignore
CAS2105_Report.pdf		CAS2105_Report.pdf
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini AI Pipeline Project: Korean Legal QA with RAG

1. Introduction

2. Task Definition & Motivation

3. Dataset

4. Naive Baseline

5. AI Pipeline Implementation

6. Experiments & Results

7. Reflection & Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mini AI Pipeline Project: Korean Legal QA with RAG

1. Introduction

2. Task Definition & Motivation

3. Dataset

4. Naive Baseline

5. AI Pipeline Implementation

6. Experiments & Results

7. Reflection & Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages