Skip to content

Career-HY/Experiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

37 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Career-HY RAG ์‹คํ—˜ ํŒŒ์ดํ”„๋ผ์ธ

Career-HY RAG ์‹œ์Šคํ…œ์˜ ๋‹ค์–‘ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์‹คํ—˜ํ•˜์—ฌ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ํŒŒ์ดํ”„๋ผ์ธ์ž…๋‹ˆ๋‹ค.

๐Ÿ“‹ ๋ชฉ์ฐจ

  1. ํ”„๋กœ์ ํŠธ ๊ฐœ์š”
  2. ์ฃผ์š” ๊ธฐ๋Šฅ
  3. ์„ค์น˜ ๋ฐ ์„ค์ •
  4. ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
  5. ์•„ํ‚คํ…์ฒ˜
  6. ์‹ค์ œ ์„œ๋น„์Šค ํ†ตํ•ฉ ๊ฐ€์ด๋“œ
  7. ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋ถ„์„
  8. ์ฐธ๊ณ  ์ž๋ฃŒ

1. ํ”„๋กœ์ ํŠธ ๊ฐœ์š”

๋ชฉ์ 

  • ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ ์ตœ์ ํ™”: ๋‹ค์–‘ํ•œ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ, ์ฒญํ‚น ์ „๋žต, ๊ฒ€์ƒ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋น„๊ต
  • ์ฒด๊ณ„์  ์‹คํ—˜: YAML ๊ธฐ๋ฐ˜ ์„ค์ •์œผ๋กœ ์žฌํ˜„ ๊ฐ€๋Šฅํ•œ ์‹คํ—˜ ํ™˜๊ฒฝ ์ œ๊ณต
  • GT ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ: ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ธฐ๋ฐ˜ Ground Truth ๋ฐ์ดํ„ฐ์…‹ ์ž๋™ ์ƒ์„ฑ
  • ์‹ค์ œ ์„œ๋น„์Šค ์ ์šฉ: StructuredDocumentLoader ๋ฐ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ™•์žฅ ๊ธฐ๋Šฅ ์ œ๊ณต

์ฃผ์š” ํŠน์ง•

  • ๐Ÿณ Docker ๊ธฐ๋ฐ˜: ์ผ๊ด€๋œ ์‹คํ—˜ ํ™˜๊ฒฝ ๋ณด์žฅ
  • ๐Ÿ’พ ์ž„๋ฒ ๋”ฉ ์บ์‹ฑ: ๋™์ผ ์„ค์ • ์žฌ์‹คํ—˜ ์‹œ API ๋น„์šฉ ์ ˆ์•ฝ
  • ๐Ÿ“Š ๋‹ค์–‘ํ•œ ํ‰๊ฐ€ ์ง€ํ‘œ: Recall@k, Precision@k, MRR, MAP, nDCG@k
  • ๐Ÿ”ง ๋ชจ๋“ˆํ˜• ์•„ํ‚คํ…์ฒ˜: ์‰ฌ์šด ํ™•์žฅ์„ฑ๊ณผ ์œ ์ง€๋ณด์ˆ˜
  • ๐Ÿ“ YAML ์„ค์ •: ์ฝ”๋“œ ์ˆ˜์ • ์—†์ด ์‹คํ—˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •
  • ๐ŸŽฏ ์„น์…˜๋ณ„ ์ฒญํ‚น: ์ฑ„์šฉ๊ณต๊ณ ์˜ ๊ตฌ์กฐํ™”๋œ ์ •๋ณด(์šฐ๋Œ€์‚ฌํ•ญ, ์ž๊ฒฉ์š”๊ฑด, ์ฃผ์š”์—…๋ฌด) ํ™œ์šฉ
  • ๐Ÿ“ˆ GT ์ž๋™ ์ƒ์„ฑ: ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ธฐ๋ฐ˜ Ground Truth ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ
  • ๐Ÿ”„ ํ†ตํ•ฉ ํŒŒ์ดํ”„๋ผ์ธ: GT ์ƒ์„ฑ ๊ด€๋ จ ๊ธฐ๋Šฅ์„ ํ•˜๋‚˜์˜ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ํ†ตํ•ฉ
  • โšก ๋™์  ํ‰๊ฐ€: ์ •๋‹ต ๊ฐœ์ˆ˜์— ๋”ฐ๋ฅธ ์ž๋™ top_k ์กฐ์ •
  • ๐Ÿ“… ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ™•์žฅ: deadline, start_date, crawling_time ์ง€์›

2. ์ฃผ์š” ๊ธฐ๋Šฅ

2.1 RAG ์‹คํ—˜ ํŒŒ์ดํ”„๋ผ์ธ

์ฒด๊ณ„์ ์ด๊ณ  ์žฌํ˜„ ๊ฐ€๋Šฅํ•œ RAG ์‹คํ—˜์„ ์œ„ํ•œ ์ข…ํ•ฉ ํŒŒ์ดํ”„๋ผ์ธ์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ธฐ๋Šฅ

  • ๋‹ค์–‘ํ•œ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์ง€์›: OpenAI, Snowflake ๋“ฑ
  • ๋‹ค์–‘ํ•œ ์ฒญํ‚น ์ „๋žต: Recursive, Fixed, No Chunk
  • ๋ฒกํ„ฐ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ: ChromaDB, FAISS
  • ์ข…ํ•ฉ ํ‰๊ฐ€ ์‹œ์Šคํ…œ: ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ + LangSmith ์ •์„ฑํ‰๊ฐ€

ํ‰๊ฐ€ ์ง€ํ‘œ

  • Recall@k: ์ „์ฒด ๊ด€๋ จ ๋ฌธ์„œ ์ค‘ ์ƒ์œ„ k๊ฐœ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ํฌํ•จ๋œ ๊ด€๋ จ ๋ฌธ์„œ์˜ ๋น„์œจ
  • Precision@k: ์ƒ์œ„ k๊ฐœ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ์ค‘ ์‹ค์ œ๋กœ ๊ด€๋ จ ์žˆ๋Š” ๋ฌธ์„œ์˜ ๋น„์œจ
  • MRR@k: ๊ฐ ์ฟผ๋ฆฌ์˜ ์ฒซ ๋ฒˆ์งธ ๊ด€๋ จ ๋ฌธ์„œ ์ˆœ์œ„์˜ ์—ญ์ˆ˜ ํ‰๊ท  (์ƒ์œ„ k๊ฐœ ๋‚ด)
  • MAP: ๋ชจ๋“  ๊ด€๋ จ ๋ฌธ์„œ ์ˆœ์œ„๋ฅผ ๊ณ ๋ คํ•œ ์ข…ํ•ฉ์  ์„ฑ๋Šฅ
  • nDCG@k: ์ˆœ์œ„๊ฐ€ ๋†’์„์ˆ˜๋ก ๋” ์ค‘์š”ํ•˜๋‹ค๊ณ  ๊ฐ€์ •ํ•œ ์„ฑ๋Šฅ ์ธก์ •
  • R-recall: Recall@(์ •๋‹ต๊ฐœ์ˆ˜) - ์ •๋‹ต ๋ฌธ์„œ ๊ฐœ์ˆ˜๋งŒํผ์˜ recall ๊ณ„์‚ฐ
  • Hit@k_count: ์ƒ์œ„ k๊ฐœ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ์ค‘ ์ •๋‹ต ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜

ํ‰๊ฐ€ ์‹œ์Šคํ…œ ๊ฐœ์„  ์‚ฌํ•ญ

  • ๋™์  top_k ์ง€์›: R-recall ๊ณ„์‚ฐ์„ ์œ„ํ•ด ์ •๋‹ต ๊ฐœ์ˆ˜์— ๋”ฐ๋ผ top_k ์ž๋™ ์กฐ์ •
    • evaluation_top_k = min(max(base_top_k, gt_count), 60)
    • ์ •๋‹ต์ด ๋งŽ์€ ์ฟผ๋ฆฌ์—์„œ๋„ ์ •ํ™•ํ•œ ํ‰๊ฐ€ ๊ฐ€๋Šฅ

LangSmith ์ •์„ฑํ‰๊ฐ€

  • Recommendation Quality: ์ถ”์ฒœ ํ’ˆ์งˆ ์ „๋ฐ˜
  • Personalization Score: ๊ฐœ์ธํ™” ์ˆ˜์ค€
  • Response Helpfulness: ๋„์›€ ์ •๋„
  • Profile Alignment: ํ”„๋กœํ•„ ์ผ์น˜๋„

2.2 StructuredDocumentLoader

์ฑ„์šฉ๊ณต๊ณ ์˜ ๊ตฌ์กฐํ™”๋œ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•œ ์„น์…˜๋ณ„ ์ฒญํ‚น ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค.

์ฃผ์š” ํŠน์ง•

  • ์„น์…˜๋ณ„ ์ฒญํ‚น: ์šฐ๋Œ€์‚ฌํ•ญ(preferred), ์ž๊ฒฉ์š”๊ฑด(qualifications), ์ฃผ์š”์—…๋ฌด(job_duties) ๋ณ„๋„ ์ฒ˜๋ฆฌ
  • JobPostParser: unstructured ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๊ธฐ๋ฐ˜ PDF ํŒŒ์‹ฑ
  • ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋ณด์กด: ์›๋ณธ JSON ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ „์ฒด ๋ณด์กด (deadline, start_date, crawling_time ํฌํ•จ)
  • Context ์ฃผ์ž…: ๊ฐ ์ฒญํฌ์— [ํšŒ์‚ฌ: ...] [์ง๋ฌด: ...] ์ž๋™ ์ถ”๊ฐ€

ํ†ต๊ณ„

  • ์ด ์ฒญํฌ: 2,039๊ฐœ (1,473๊ฐœ ๋ฌธ์„œ)
  • ์„น์…˜๋ณ„ ๋ถ„ํฌ:
    • preferred: 1,036๊ฐœ (50.8%)
    • qualifications: 398๊ฐœ (19.5%)
    • job_duties: 279๊ฐœ (13.7%)
    • full_text (fallback): 326๊ฐœ (16.0%)

์‚ฌ์šฉ ์˜ˆ์‹œ

from implementations.loaders.structured_loader import StructuredDocumentLoader

loader = StructuredDocumentLoader(
    strategy="fast",  # ๋˜๋Š” "hi_res"
    target_sections=["preferred", "qualifications", "job_duties"],
    include_context=True
)
chunks = loader.load_from_documents(documents)

2.2.1 ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ™•์žฅ

์ฑ„์šฉ๊ณต๊ณ ์˜ ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ํฌํ•จํ•œ ํ™•์žฅ๋œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

์ง€์› ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ•„๋“œ
  • deadline: ์ฑ„์šฉ๊ณต๊ณ  ๋งˆ๊ฐ์ผ
  • start_date: ์ฑ„์šฉ๊ณต๊ณ  ์‹œ์ž‘์ผ
  • crawling_time: ๋ฐ์ดํ„ฐ ํฌ๋กค๋ง ์‹œ๊ฐ
  • ๊ธฐ๋ณธ ํ•„๋“œ: rec_idx, title, company, url, tags ๋“ฑ
๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋ณด์กด
  • StructuredDocumentLoader: ๋ชจ๋“  ์ฒญํฌ์— ์›๋ณธ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ „์ฒด ๋ณด์กด
  • Fallback ์ฒญํฌ: ๊ฒฝ๋Ÿ‰/์ผ๋ฐ˜ fallback ๋ชจ๋‘ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋ณด์กด
  • ๋ฒกํ„ฐ DB ์ €์žฅ: ChromaDB์— ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ „์ฒด ์ €์žฅ
  • ์‘๋‹ต ์ƒ์„ฑ: Response Generator์—์„œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ™œ์šฉ
์‚ฌ์šฉ ์˜ˆ์‹œ
# ์ฒญํฌ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ™•์ธ
chunk = chunks[0]
print(chunk.metadata.get("deadline"))      # ๋งˆ๊ฐ์ผ
print(chunk.metadata.get("start_date"))   # ์‹œ์ž‘์ผ
print(chunk.metadata.get("crawling_time")) # ํฌ๋กค๋ง ์‹œ๊ฐ
ํ”„๋กฌํ”„ํŠธ ํ™œ์šฉ
  • ํ”„๋กฌํ”„ํŠธ์— ๋งˆ๊ฐ์ผ ์ •๋ณด ์ž๋™ ํฌํ•จ
  • ์‘๋‹ต ์ƒ์„ฑ ์‹œ ์‹œ๊ฐ„ ์ •๋ณด ํ™œ์šฉ ๊ฐ€๋Šฅ

2.3 GT ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ

ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ธฐ๋ฐ˜ Ground Truth ๋ฐ์ดํ„ฐ์…‹ ์ž๋™ ์ƒ์„ฑ ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค.

ํŒŒ์ดํ”„๋ผ์ธ ๋‹จ๊ณ„

  1. Phase 1: ์ค‘๋ถ„๋ฅ˜ ๊ธฐ๋ฐ˜ ์ดˆ๊ธฐ ํด๋Ÿฌ์Šคํ„ฐ ์ƒ์„ฑ
  2. Phase 2: ์œ ์‚ฌ ์ค‘๋ถ„๋ฅ˜ ๋ณ‘ํ•ฉ (๊ทœ์น™ ๊ธฐ๋ฐ˜)
  3. Phase 3: ๋Œ€ํ‘œ ๋ฌธ์„œ ์„ ํƒ (์ฟผ๋ฆฌ-๋ฌธ์„œ ์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜)
  4. Phase 4: ํ†ต๊ณ„ ๋ฐ ๊ฒฐ๊ณผ ์ €์žฅ

์ž…๋ ฅ ๋ฐ์ดํ„ฐ

  • clustering_results_tag_based/*_classification.json: ๋Œ€๋ถ„๋ฅ˜๋ณ„ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ
  • similarity_rules_template.json: ์œ ์‚ฌ ์ค‘๋ถ„๋ฅ˜ ๋ณ‘ํ•ฉ ๊ทœ์น™

์ถœ๋ ฅ ๋ฐ์ดํ„ฐ

  • gt_generation_results/gt_clusters.json: ํด๋Ÿฌ์Šคํ„ฐ ์ •๋ณด ๋ฐ ๋Œ€ํ‘œ ๋ฌธ์„œ
  • gt_generation_results/gt_clusters_summary.csv: ํด๋Ÿฌ์Šคํ„ฐ ์š”์•ฝ ์ •๋ณด
  • gt_generation_results/gt_generation_statistics.txt: ํ†ต๊ณ„ ์ •๋ณด

์‹คํ–‰ ๋ฐฉ๋ฒ•

# ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ ์‹คํ–‰ (๊ธฐ๋ณธ)
python gt_generation_pipeline.py

# CSV โ†’ JSONL ๋ณ€ํ™˜
python gt_generation_pipeline.py --convert-csv data/gt.csv data/output.jsonl

# ํ‰๊ฐ€์šฉ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
python gt_generation_pipeline.py --create-eval data/gt_analysis.csv data/eval.jsonl

# ๊ทœ์น™ ๊ฒ€์ฆ๋งŒ ์‹คํ–‰
python gt_generation_pipeline.py --validate-rules

ํ†ตํ•ฉ๋œ ๊ธฐ๋Šฅ

  • ์œ ์‚ฌ๋„ ๊ทœ์น™ ๊ฒ€์ฆ: ๊ทœ์น™ ํŒŒ์ผ์˜ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ๋ฐ ์ปค๋ฒ„๋ฆฌ์ง€ ๋ถ„์„
  • CSV โ†’ JSONL ๋ณ€ํ™˜: GT CSV๋ฅผ ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜
  • ํ‰๊ฐ€์šฉ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ: GT Analysis CSV๋ฅผ ํ‰๊ฐ€์šฉ JSONL๋กœ ๋ณ€ํ™˜
  • ๋Œ€ํ‘œ ๋ฌธ์„œ ์„ ํƒ: SentenceTransformer ๊ธฐ๋ฐ˜ ์ฟผ๋ฆฌ-๋ฌธ์„œ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ

2.4 ํด๋Ÿฌ์Šคํ„ฐ๋ง ํŒŒ์ดํ”„๋ผ์ธ

ํƒœ๊ทธ + ์ œ๋ชฉ ๊ธฐ๋ฐ˜์œผ๋กœ ์ฑ„์šฉ๊ณต๊ณ ๋ฅผ ์ง๋ฌด๋ณ„๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋งํ•˜๋Š” ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค.

์ฃผ์š” ๊ธฐ๋Šฅ

  • ํƒœ๊ทธ ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜: ๋Œ€๋ถ„๋ฅ˜/์ค‘๋ถ„๋ฅ˜ ์ž๋™ ํ• ๋‹น
  • ๋‹ค์ค‘ ์นดํ…Œ๊ณ ๋ฆฌ ์ง€์›: ํ•˜๋‚˜์˜ ๋ฌธ์„œ๊ฐ€ ์—ฌ๋Ÿฌ ๋Œ€๋ถ„๋ฅ˜์— ์†ํ•  ์ˆ˜ ์žˆ์Œ
  • UMAP + HDBSCAN: ์ฐจ์› ์ถ•์†Œ ๋ฐ ํด๋Ÿฌ์Šคํ„ฐ๋ง

์‹คํ–‰ ๋ฐฉ๋ฒ•

python job_clustering_pipeline.py

์ถœ๋ ฅ ๊ฒฐ๊ณผ

  • clustering_results_tag_based/: ๋Œ€๋ถ„๋ฅ˜๋ณ„ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ
  • clustering_results/: ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ฒฐ๊ณผ ๋ฐ ์‹œ๊ฐํ™”

3. ์„ค์น˜ ๋ฐ ์„ค์ •

3.1 ํ•„์ˆ˜ ์š”๊ตฌ์‚ฌํ•ญ

  • Python 3.8 ์ด์ƒ
  • Docker ๋ฐ Docker Compose
  • AWS ์ž๊ฒฉ ์ฆ๋ช… (S3 ์ ‘๊ทผ์šฉ)
  • OpenAI API Key (์ž„๋ฒ ๋”ฉ ๋ฐ LLM์šฉ)

3.2 ์„ค์น˜

1. ์ €์žฅ์†Œ ํด๋ก 

git clone <repository-url>
cd Experiment

2. ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ •

.env ํŒŒ์ผ ์ƒ์„ฑ:

# OpenAI API
OPENAI_API_KEY=your_api_key

# AWS S3
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_DEFAULT_REGION=ap-northeast-2

# LangSmith (์„ ํƒ)
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your_langsmith_key

3. Docker ๋นŒ๋“œ ๋ฐ ์‹คํ–‰

docker-compose build
docker-compose up -d

4. ์˜์กด์„ฑ ์„ค์น˜ (๋กœ์ปฌ ์‹คํ–‰ ์‹œ)

pip install -r requirements.txt
# ํด๋Ÿฌ์Šคํ„ฐ๋ง/GT ์ƒ์„ฑ์šฉ ์ถ”๊ฐ€ ์˜์กด์„ฑ
pip install -r requirements_clustering.txt

3.3 ๋””๋ ‰ํ† ๋ฆฌ ๊ตฌ์กฐ

Experiment/
โ”œโ”€โ”€ configs/                 # ์‹คํ—˜ ์„ค์ • ํŒŒ์ผ๋“ค
โ”‚   โ”œโ”€โ”€ baseline_search.yaml
โ”‚   โ”œโ”€โ”€ new_eval_baseline.yaml
โ”‚   โ””โ”€โ”€ new_eval_baseline_recursive.yaml
โ”œโ”€โ”€ core/                   # ํ•ต์‹ฌ ํŒŒ์ดํ”„๋ผ์ธ
โ”‚   โ”œโ”€โ”€ interfaces/        # ์ถ”์ƒ ์ธํ„ฐํŽ˜์ด์Šค (ABC)
โ”‚   โ”œโ”€โ”€ pipeline.py        # ๋ฉ”์ธ ์‹คํ—˜ ํŒŒ์ดํ”„๋ผ์ธ
โ”‚   โ””โ”€โ”€ config.py         # ์„ค์ • ๊ด€๋ฆฌ
โ”œโ”€โ”€ implementations/       # ๊ตฌํ˜„์ฒด๋“ค
โ”‚   โ”œโ”€โ”€ embedders/        # ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ
โ”‚   โ”œโ”€โ”€ chunkers/         # ์ฒญํ‚น ์ „๋žต
โ”‚   โ”œโ”€โ”€ retrievers/       # ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ
โ”‚   โ”œโ”€โ”€ evaluators/       # ํ‰๊ฐ€ ์ง€ํ‘œ ๊ณ„์‚ฐ
โ”‚   โ”œโ”€โ”€ loaders/          # StructuredDocumentLoader
โ”‚   โ””โ”€โ”€ parsers/          # JobPostParser
โ”œโ”€โ”€ utils/                # ์œ ํ‹ธ๋ฆฌํ‹ฐ
โ”‚   โ”œโ”€โ”€ data_loader.py    # S3 ๋ฐ์ดํ„ฐ ๋กœ๋“œ
โ”‚   โ”œโ”€โ”€ embedding_cache.py # ์ž„๋ฒ ๋”ฉ ์บ์‹ฑ ์‹œ์Šคํ…œ
โ”‚   โ””โ”€โ”€ factory.py        # ์ปดํฌ๋„ŒํŠธ ํŒฉํ† ๋ฆฌ
โ”œโ”€โ”€ data/                 # Ground Truth ๋ฐ์ดํ„ฐ
โ”‚   โ””โ”€โ”€ gt_eval_fullquery_cluster_ids.jsonl
โ”œโ”€โ”€ cache/                # ์ž„๋ฒ ๋”ฉ ์บ์‹œ ์ €์žฅ์†Œ
โ”œโ”€โ”€ results/              # ์‹คํ—˜ ๊ฒฐ๊ณผ
โ”œโ”€โ”€ run_experiment.sh     # ๋ฉ”์ธ ์‹คํ—˜ ์‹คํ–‰ ์Šคํฌ๋ฆฝํŠธ
โ”œโ”€โ”€ docker-compose.yml    # Docker ๊ตฌ์„ฑ
โ”œโ”€โ”€ Dockerfile           # Docker ์ด๋ฏธ์ง€ ์ •์˜
โ””โ”€โ”€ requirements.txt     # Python ์˜์กด์„ฑ

4. ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

4.1 ์‹คํ—˜ ์‹คํ–‰

1. YAML ์„ค์ • ํŒŒ์ผ ์ž‘์„ฑ

configs/ ๋””๋ ‰ํ† ๋ฆฌ์— ์‹คํ—˜ ์„ค์ • ํŒŒ์ผ ์ž‘์„ฑ:

# ์‹คํ—˜ ๊ธฐ๋ณธ ์ •๋ณด
experiment_name: "baseline"
description: "ํ˜„์žฌ ์„œ๋น„์Šค์™€ ๋™์ผํ•œ ๋ฒ ์ด์Šค๋ผ์ธ ์„ค์ •"
output_dir: "results"

# ์ž„๋ฒ ๋”ฉ ์„ค์ •
embedder:
  type: "openai"
  model_name: "text-embedding-ada-002"
  batch_size: 5

# ์ฒญํ‚น ์„ค์ •
chunker:
  type: "no_chunk"
  chunk_size: null
  chunk_overlap: null

# ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ ์„ค์ •
retriever:
  type: "chroma"
  collection_name: "job-postings-baseline"
  persist_directory: "/tmp/chroma_baseline"
  top_k: 10

# LLM ์„ค์ •
llm:
  type: "openai"
  model_name: "gpt-4o-mini"
  temperature: 0.7
  max_tokens: 1000

# ๋ฐ์ดํ„ฐ ์„ค์ •
data:
  s3_bucket: "career-hi"
  pdf_prefix: "initial-dataset/pdf/"
  json_prefix: "initial-dataset/json/"
  test_queries_path: "data/gt_eval_fullquery_cluster_ids.jsonl"
  use_structured_loader: false  # StructuredDocumentLoader ์‚ฌ์šฉ ์—ฌ๋ถ€

# ํ‰๊ฐ€ ์„ค์ •
evaluation:
  mode: "retrieval_only"  # ๋˜๋Š” "dual"
  metrics: ["recall@k", "precision@k", "mrr", "map", "ndcg@k"]
  k_values: [1, 3, 5, 10]

2. ์‹คํ—˜ ์‹คํ–‰

# Docker ํ™˜๊ฒฝ
./run_experiment.sh configs/baseline_search.yaml

# ๋กœ์ปฌ ํ™˜๊ฒฝ
python run_experiment.py configs/baseline_search.yaml

3. ๊ฒฐ๊ณผ ํ™•์ธ

  • results/: ์‹คํ—˜ ๊ฒฐ๊ณผ JSON ํŒŒ์ผ
  • LangSmith ์›น ์ธํ„ฐํŽ˜์ด์Šค: https://smith.langchain.com

4.1.1 ์„ค์ • ํŒŒ์ผ ์„ค๋ช…

ํ”„๋กœ์ ํŠธ์—๋Š” ๋‹ค์–‘ํ•œ ์‹คํ—˜ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์œ„ํ•œ ์„ค์ • ํŒŒ์ผ์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

baseline_search.yaml

  • ์šฉ๋„: ๊ธฐ๋ณธ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์‹คํ—˜
  • ํŠน์ง•:
    • Recursive Chunking ๋ฐฉ์‹ ์‚ฌ์šฉ (chunk_size: 700, chunk_overlap: 100)
    • StructuredDocumentLoader ๋ฏธ์‚ฌ์šฉ (์ „์ฒด ํ…์ŠคํŠธ ํŒŒ์‹ฑ)
    • ๋ฐ์ดํ„ฐ ๋ฒ„์ „: v3
  • ์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค: ๊ธฐ๋ณธ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ ์ธก์ •

new_eval_baseline.yaml

  • ์šฉ๋„: StructuredDocumentLoader ๊ธฐ๋ฐ˜ ์„น์…˜๋ณ„ ์ฒญํ‚น ์‹คํ—˜
  • ํŠน์ง•:
    • StructuredDocumentLoader ์‚ฌ์šฉ (์„น์…˜๋ณ„ ์ฒญํ‚น)
    • ์ฒญํ‚น ๋ฐฉ์‹: no_chunk (์„น์…˜ ๋‹จ์œ„๋กœ๋งŒ ๋ถ„ํ• )
    • ํƒ€๊ฒŸ ์„น์…˜: preferred, qualifications, job_duties
    • ๋ฐ์ดํ„ฐ ๋ฒ„์ „: v4
  • ์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค: ์„น์…˜๋ณ„ ๊ตฌ์กฐํ™”๋œ ์ฒญํ‚น์˜ ํšจ๊ณผ ์ธก์ •

new_eval_baseline_recursive.yaml

  • ์šฉ๋„: StructuredDocumentLoader + Recursive Chunking ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์‹คํ—˜
  • ํŠน์ง•:
    • StructuredDocumentLoader ์‚ฌ์šฉ (์„น์…˜๋ณ„ ์ฒญํ‚น)
    • Recursive Chunking ์ถ”๊ฐ€ ์ ์šฉ (chunk_size: 500, chunk_overlap: 75)
    • ํƒ€๊ฒŸ ์„น์…˜: preferred, qualifications, job_duties
    • ๋ฐ์ดํ„ฐ ๋ฒ„์ „: v4
  • ์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค: ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ฒญํ‚น ์ „๋žต ์„ฑ๋Šฅ ํ‰๊ฐ€

4.2 GT ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ

1. ํด๋Ÿฌ์Šคํ„ฐ๋ง ์‹คํ–‰

python job_clustering_pipeline.py

2. GT ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ ์‹คํ–‰

python gt_generation_pipeline.py

3. ๊ฒฐ๊ณผ ํ™•์ธ

  • gt_generation_results/gt_clusters.json: ํด๋Ÿฌ์Šคํ„ฐ ์ •๋ณด
  • gt_generation_results/gt_clusters_summary.csv: ์š”์•ฝ ์ •๋ณด
  • gt_generation_results/gt_generation_statistics.txt: ํ†ต๊ณ„

4.3 ํด๋Ÿฌ์Šคํ„ฐ๋ง ์‹คํ–‰

๊ธฐ๋ณธ ์‹คํ–‰

python job_clustering_pipeline.py

์˜ˆ์ƒ ์†Œ์š” ์‹œ๊ฐ„

  • ๋ฐ์ดํ„ฐ ๋กœ๋“œ: ~5์ดˆ
  • ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ: ~2-3๋ถ„ (์ฒซ ์‹คํ–‰ ์‹œ ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ: +30์ดˆ)
  • UMAP ์ฐจ์› ์ถ•์†Œ: ~30์ดˆ
  • HDBSCAN ํด๋Ÿฌ์Šคํ„ฐ๋ง: ~10์ดˆ
  • ์ด ์˜ˆ์ƒ ์‹œ๊ฐ„: ์•ฝ 3-4๋ถ„

์ถœ๋ ฅ ๊ฒฐ๊ณผ

  • clustering_results_tag_based/: ๋Œ€๋ถ„๋ฅ˜๋ณ„ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ
  • clustering_results/: ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ฒฐ๊ณผ ๋ฐ ์‹œ๊ฐํ™”

5. ์•„ํ‚คํ…์ฒ˜

5.1 ๋ชจ๋“ˆ ๊ตฌ์กฐ

Core ๋ชจ๋“ˆ

  • core/pipeline.py: ๋ฉ”์ธ ์‹คํ—˜ ํŒŒ์ดํ”„๋ผ์ธ
  • core/config.py: ์„ค์ • ๊ด€๋ฆฌ
  • core/interfaces/: ์ถ”์ƒ ์ธํ„ฐํŽ˜์ด์Šค ์ •์˜

๊ตฌํ˜„ ๋ชจ๋“ˆ

  • implementations/embedders/: ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ๊ตฌํ˜„
  • implementations/chunkers/: ์ฒญํ‚น ์ „๋žต ๊ตฌํ˜„
  • implementations/retrievers/: ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ ๊ตฌํ˜„
  • implementations/evaluators/: ํ‰๊ฐ€ ์ง€ํ‘œ ๊ณ„์‚ฐ
  • implementations/loaders/: StructuredDocumentLoader
  • implementations/parsers/: JobPostParser

์œ ํ‹ธ๋ฆฌํ‹ฐ ๋ชจ๋“ˆ

  • utils/data_loader.py: S3 ๋ฐ์ดํ„ฐ ๋กœ๋“œ
  • utils/embedding_cache.py: ์ž„๋ฒ ๋”ฉ ์บ์‹ฑ
  • utils/factory.py: ์ปดํฌ๋„ŒํŠธ ํŒฉํ† ๋ฆฌ

5.2 ๋ฐ์ดํ„ฐ ํ๋ฆ„

S3 ๋ฐ์ดํ„ฐ ๋กœ๋“œ
    โ†“
StructuredDocumentLoader (์„น์…˜๋ณ„ ์ฒญํ‚น)
    โ†“
์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ (์บ์‹ฑ)
    โ†“
๋ฒกํ„ฐ DB ์ €์žฅ (ChromaDB/FAISS)
    โ†“
๊ฒ€์ƒ‰ ๋ฐ ํ‰๊ฐ€
    โ†“
๊ฒฐ๊ณผ ์ €์žฅ

5.3 ํ‰๊ฐ€ ์‹œ์Šคํ…œ

Retrieval ํ‰๊ฐ€

  • ์ „์ฒด ์ฟผ๋ฆฌ์— ๋Œ€ํ•œ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ ์ธก์ •
  • ์ง€์› ์ง€ํ‘œ: Recall@k, Precision@k, MRR@k, MAP, nDCG@k, R-recall, Hit@k_count
  • ๋™์  top_k: ์ •๋‹ต ๊ฐœ์ˆ˜์— ๋”ฐ๋ผ ๊ฒ€์ƒ‰ ๋ฒ”์œ„ ์ž๋™ ์กฐ์ • (์ตœ๋Œ€ 60๊ฐœ)
  • ํ‰๊ฐ€ ์ˆœ์„œ: ์‚ฌ์šฉ์ž ์š”์ฒญ ์ˆœ์„œ๋Œ€๋กœ ์ง€ํ‘œ ๊ณ„์‚ฐ ๋ฐ ์ถœ๋ ฅ

Generation ํ‰๊ฐ€ (LangSmith)

  • ์ƒ˜ํ”Œ ์ฟผ๋ฆฌ์— ๋Œ€ํ•œ ์‘๋‹ต ์ƒ์„ฑ ๋ฐ ์ •์„ฑํ‰๊ฐ€
  • Profile-based Sampling (15๊ฐœ ๊ณ ์œ  ํ”„๋กœํ•„)
  • 4๊ฐ€์ง€ ์ •์„ฑ ํ‰๊ฐ€ ์ง€ํ‘œ:
    • Recommendation Quality: ์ถ”์ฒœ ํ’ˆ์งˆ ์ „๋ฐ˜
    • Personalization Score: ๊ฐœ์ธํ™” ์ˆ˜์ค€
    • Response Helpfulness: ๋„์›€ ์ •๋„
    • Profile Alignment: ํ”„๋กœํ•„ ์ผ์น˜๋„

5.5 ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋ถ„์„

5.5.1 ๋ฌธ์ œ์  ๋ฐ ๊ฐœ์„  ๋ฐฉ์•ˆ

๊ธฐ์กด ๋ฌธ์ œ์ 

  1. GT ์‹ ๋ขฐ์„ฑ ๋ถ€์กฑ: ๊ธฐ์กด GT๋Š” ์ •๋‹ต ๋ฐ์ดํ„ฐ๊ฐ€ 5๊ฐœ๋กœ ์ œํ•œ๋˜์–ด ์‹ ๋ขฐ์„ฑ์ด ๋‚ฎ์Œ
  2. ๋…ธ์ด์ฆˆ ๋ฌธ์ œ: ๋‹จ์ˆœ ํ…์ŠคํŠธ ์ฒญํ‚น์œผ๋กœ ๋ฌด์˜๋ฏธํ•œ ํ…์ŠคํŠธ ํฌํ•จ โ†’ ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์Œ

๊ฐœ์„  ๋ฐฉ์•ˆ

  1. ํƒœ๊ทธ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง: ํ…์ŠคํŠธ์—์„œ tag ์ •๋ณด๋ฅผ ์ด์šฉํ•ด์„œ ์ •๋‹ต ๊ตฐ์ง‘ ์ƒ์„ฑ
  2. StructuredDocumentLoader ๊ตฌํ˜„: JobPostParser๋ฅผ ํ™œ์šฉํ•ด ์„น์…˜๋ณ„ ์ฒญํ‚น
    • ์„น์…˜ ํƒ€์ž…๋ณ„๋กœ ์ฒญํฌ ๋ถ„๋ฆฌ: preferred, qualifications, job_duties
    • ๊ฐ chunk์— ์„น์…˜ ํƒ€์ž… ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํฌํ•จ

5.5.2 ์‹คํ—˜ ์„ค์ •

1) Baseline: Recursive Chunking

  • ์„ค์ • ํŒŒ์ผ: configs/baseline_search.yaml
  • Chunker: recursive (chunk_size: 700, chunk_overlap: 100)
  • StructuredDocumentLoader: ์‚ฌ์šฉ ์•ˆํ•จ
  • ๋ชฉ์ : ๊ธฐ๋ณธ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ ์ธก์ •
  • ๋ฐ์ดํ„ฐ ๋ฒ„์ „: v3
  • ์ฒ˜๋ฆฌ๋œ ๋ฌธ์„œ ์ˆ˜: 4,121๊ฐœ

2) StructuredDocumentLoader (์„น์…˜๋ณ„ ์ฒญํ‚น)

  • ์„ค์ • ํŒŒ์ผ: configs/new_eval_baseline.yaml
  • Chunker: no_chunk (StructuredDocumentLoader๊ฐ€ ์ˆ˜ํ–‰)
  • StructuredDocumentLoader: ์‚ฌ์šฉ (use_structured_loader: true)
  • Target Section: preferred, qualifications, job_duties
  • ๋ชฉ์ : ์„น์…˜๋ณ„ ์ฒญํ‚น ํšจ๊ณผ ์ธก์ •
  • ๋ฐ์ดํ„ฐ ๋ฒ„์ „: v4
  • ์ฒ˜๋ฆฌ๋œ ๋ฌธ์„œ ์ˆ˜: 2,503๊ฐœ

3) StructuredDocumentLoader + Recursive Chunking

  • ์„ค์ • ํŒŒ์ผ: configs/new_eval_baseline_recursive.yaml
  • Chunker: recursive (chunk_size: 500, chunk_overlap: 75)
  • StructuredDocumentLoader: ์‚ฌ์šฉ (use_structured_loader: true)
  • Target Section: preferred, qualifications, job_duties
  • ๋ชฉ์ : ์„น์…˜๋ณ„ ์ฒญํ‚น + Recursive chunking ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ํšจ๊ณผ ์ธก์ •
  • ๋ฐ์ดํ„ฐ ๋ฒ„์ „: v4
  • ์ฒ˜๋ฆฌ๋œ ๋ฌธ์„œ ์ˆ˜: 3,124๊ฐœ

5.5.3 ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ ๋น„๊ต

Metric Baseline ์„น์…˜ ์ฒญํ‚น ์„น์…˜+Recursive ์ตœ์šฐ์ˆ˜
ndcg@10 0.2205 0.2591 0.2547 โœ… ์„น์…˜ ์ฒญํ‚น
mrr@10 0.4227 0.4606 0.4468 โœ… ์„น์…˜ ์ฒญํ‚น
precision@3 0.2479 0.2308 0.2350 โœ… Baseline
precision@5 0.2128 0.2205 0.2077 โœ… ์„น์…˜ ์ฒญํ‚น
precision@10 0.1628 0.1679 0.1667 โœ… ์„น์…˜ ์ฒญํ‚น
precision@20 0.1314 0.1353 0.1321 โœ… ์„น์…˜ ์ฒญํ‚น
recall@10 0.1061 0.1086 0.1090 โœ… ์„น์…˜+Recursive
recall@20 0.1547 0.1609 0.1591 โœ… ์„น์…˜ ์ฒญํ‚น
r_recall 0.1204 0.1345 0.1297 โœ… ์„น์…˜ ์ฒญํ‚น
hit@10_count 1.628 1.680 1.667 โœ… ์„น์…˜ ์ฒญํ‚น
hit@20_count 2.628 2.705 2.641 โœ… ์„น์…˜ ์ฒญํ‚น

๊ฒฐ๋ก : ์„น์…˜๋ณ„ ์ฒญํ‚น์ด ๋Œ€๋ถ€๋ถ„์˜ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ ์ง€ํ‘œ์—์„œ ์ตœ์šฐ์ˆ˜ ์„ฑ๋Šฅ์„ ๋ณด์ž„

5.5.4 ์ƒ์„ฑ ํ’ˆ์งˆ ๋น„๊ต (LangSmith ํ‰๊ฐ€)

Metric Baseline ์„น์…˜ ์ฒญํ‚น ์„น์…˜+Recursive ์ตœ์šฐ์ˆ˜
recommendation_quality 4.2 4.2 4.3 โœ… ์„น์…˜+Recursive
personalization_score 4.5 4.5 4.4 โœ… Baseline/์„น์…˜ ์ฒญํ‚น
response_helpfulness 4.4 4.2 4.3 โœ… Baseline
profile_alignment 4.0 4.0 4.1 โœ… ์„น์…˜+Recursive

๊ฒฐ๋ก : ์ƒ์„ฑ ํ’ˆ์งˆ์€ ์„ธ ๋ฐฉ๋ฒ• ๋ชจ๋‘ ์œ ์‚ฌํ•˜๋‚˜, ์„น์…˜+Recursive๊ฐ€ ์•ฝ๊ฐ„ ์šฐ์ˆ˜ (์ฐจ์ด๋Š” ๋งค์šฐ ์ž‘์Œ)

5.5.5 GT ๋ฐ์ดํ„ฐ์…‹ ํ†ต๊ณ„

GT ๋ฒ„์ „ ์„ค๋ช…

๊ธฐ์กด GT (v1): RAG ์ •๋Ÿ‰์  ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด Agent๋กœ ์ƒ์„ฑํ•œ GT์ž…๋‹ˆ๋‹ค. GT ํ•˜๋‚˜๋‹น ์ดˆ๊ธฐ ์ˆ˜์ง‘ํ•ด๋†“์€ ์ฑ„์šฉ๊ณต๊ณ  ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ฑ„์šฉ ๊ณต๊ณ ๋ฅผ ๋žœ๋ค์œผ๋กœ ์„ ํƒํ•œ ํ›„ ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ 4๊ฐœ์˜ ์ฑ„์šฉ ๊ณต๊ณ ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์ด 5๊ฐœ์˜ ์ฑ„์šฉ๊ณต๊ณ ๋ฅผ ์„ ๋ณ„ํ•ฉ๋‹ˆ๋‹ค. Agent๋Š” 5๊ฐœ์˜ ์ฑ„์šฉ ๊ณต๊ณ ๋ฅผ ํ™•์ธํ•œ ํ›„ ํ•ด๋‹น ์ฑ„์šฉ ๊ณต๊ณ ์— ์ง€์›ํ–ˆ์„ ๋ฒ•ํ•œ ํ•™์ƒ ํ”„๋กœํ•„ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ํ•œ์–‘๋Œ€ํ•™๊ต ์ˆ˜๊ฐ•ํŽธ๋žŒ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ๋Š” API๋ฅผ tool๋กœ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค.

์ƒˆ๋กœ์šด GT (v2): ํƒœ๊ทธ ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ํ†ตํ•ด ์ƒ์„ฑ๋œ GT์ž…๋‹ˆ๋‹ค. ๊ฐ™์€ ํด๋Ÿฌ์Šคํ„ฐ์— ์†ํ•œ ๋ชจ๋“  ์ฑ„์šฉ๊ณต๊ณ ๋ฅผ ์ •๋‹ต์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€๋ณ€ ํฌ๊ธฐ์˜ GT๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ํ‰๊ท  19.8๊ฐœ์˜ ์ •๋‹ต ๋ฌธ์„œ๋ฅผ ํฌํ•จํ•˜๋ฉฐ, ์ตœ๋Œ€ 57๊ฐœ๊นŒ์ง€ ํฌํ•จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด๋ฅผ ํฌํ•จํ•˜๋ฏ€๋กœ ๊ธฐ์กด GT๋ณด๋‹ค ์‹ ๋ขฐ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์กฐ

  • ํŒŒ์ผ: data/gt_eval_fullquery_cluster_ids.jsonl
  • ์ด ์ฟผ๋ฆฌ ์ˆ˜: 79๊ฐœ
  • ํ‰๊ท  GT ๋ฌธ์„œ ์ˆ˜: 19.8๊ฐœ
  • ์ตœ๋Œ€ GT ๋ฌธ์„œ ์ˆ˜: 57๊ฐœ
  • ์ตœ์†Œ GT ๋ฌธ์„œ ์ˆ˜: 1๊ฐœ

GT ํŠน์ง•

  • ํด๋Ÿฌ์Šคํ„ฐ ๊ธฐ๋ฐ˜: ๊ฐ™์€ ํด๋Ÿฌ์Šคํ„ฐ์˜ ๋ชจ๋“  ๋ฌธ์„œ๋ฅผ GT๋กœ ์‚ฌ์šฉ
  • ๊ฐ€๋ณ€ ํฌ๊ธฐ: ์ฟผ๋ฆฌ๋ณ„ GT ๋ฌธ์„œ์ˆ˜๊ฐ€ ๋‹ค๋ฆ„
  • ์ค‘๋ณต ํ—ˆ์šฉ: ๊ฐ™์€ ๋ฌธ์„œ๊ฐ€ ์—ฌ๋Ÿฌ ์ฟผ๋ฆฌ์˜ GT์— ํฌํ•จ๋  ์ˆ˜ ์žˆ์Œ

์ฟผ๋ฆฌ ํ…์ŠคํŠธ ๊ตฌ์กฐ

์งˆ๋ฌธ: [์‚ฌ์šฉ์ž ์งˆ๋ฌธ]
์ „๊ณต: [์ „๊ณต ์ •๋ณด]
๊ด€์‹ฌ ์ง๋ฌด: [๊ด€์‹ฌ ์ง๋ฌด]
์ž๊ฒฉ์ฆ: [์ž๊ฒฉ์ฆ ๋ชฉ๋ก]
๋™์•„๋ฆฌ/๋Œ€์™ธํ™œ๋™: [ํ™œ๋™ ๋‚ด์—ญ]
์ˆ˜๊ฐ• ์ด๋ ฅ:
[๊ฐ•์˜๋ช…] | [ํ•ต์‹ฌ ์—ญ๋Ÿ‰] | [๊ฐ•์˜ ๊ฐœ์š”] | [ํ•™์Šต ๋ชฉํ‘œ]

5.5.6 ๋™์  top_k ์„ค์ •

๋ฌธ์ œ์ 

  • GT ๋ฌธ์„œ ์ˆ˜๊ฐ€ top_k๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ, R-recall ๊ณ„์‚ฐ ๋ถˆ๊ฐ€
  • ์˜ˆ: GT 32๊ฐœ, top_k=20 โ†’ ์ตœ๋Œ€ recall 20/32 = 0.625

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

gt_count = len(ground_truth)
evaluation_top_k = min(max(base_top_k, gt_count), 60)
# ์ตœ์†Œ top_k์™€ GT ๊ฐœ์ˆ˜ ์ค‘ ํฐ ๊ฐ’ ์‚ฌ์šฉ, ์ตœ๋Œ€ 60๊ฐœ๋กœ ์ œํ•œ

์ด๋ฅผ ํ†ตํ•ด ๊ฐ€๋ณ€ ํฌ๊ธฐ GT์— ๋Œ€ํ•ด ์ •ํ™•ํ•œ R-recall ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.


6. ์‹ค์ œ ์„œ๋น„์Šค ํ†ตํ•ฉ ๊ฐ€์ด๋“œ

6.1 ํ†ตํ•ฉ ๊ฐœ์š”

์ด ์„น์…˜์€ ForkExperiment์—์„œ ๊ฐœ๋ฐœํ•œ ๊ธฐ๋Šฅ์„ ์‹ค์ œ Career-HY ์„œ๋น„์Šค์— ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

6.2 ๋‹จ๊ณ„๋ณ„ ํ†ตํ•ฉ ๋ฐฉ๋ฒ•

1๋‹จ๊ณ„: ์˜์กด์„ฑ ์ถ”๊ฐ€

pip install unstructured[pdf]>=0.10.0

2๋‹จ๊ณ„: ํŒŒ์ผ ์ถ”๊ฐ€

  • implementations/loaders/structured_loader.py โ†’ ์„œ๋น„์Šค์˜ loaders/structured_loader.py
  • implementations/parsers/job_post_parser.py โ†’ ์„œ๋น„์Šค์˜ parsers/job_post_parser.py

3๋‹จ๊ณ„: ๋ฐ์ดํ„ฐ ๋กœ๋” ์ˆ˜์ •

  • S3 JSON ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ์—์„œ deadline, start_date, crawling_time ํฌํ•จ
  • ์›๋ณธ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ „์ฒด ๋ณด์กด

4๋‹จ๊ณ„: ์ฒญํ‚น ๋กœ์ง ๋ณ€๊ฒฝ

from loaders.structured_loader import StructuredDocumentLoader

loader = StructuredDocumentLoader(
    strategy="fast",
    target_sections=["preferred", "qualifications", "job_duties"],
    include_context=True
)
chunks = loader.load_from_documents(documents)

5๋‹จ๊ณ„: ๋ฒกํ„ฐ DB ์ €์žฅ ์ˆ˜์ •

  • ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ „์ฒด ์ €์žฅ (primitive ํƒ€์ž…๋งŒ)
  • ChromaDB/FAISS ํ˜ธํ™˜์„ฑ ํ™•์ธ

6๋‹จ๊ณ„: Response Generator ์ˆ˜์ •

  • RecommendedJob ๋ชจ๋ธ์— deadline, start_date, crawling_time ํ•„๋“œ ์ถ”๊ฐ€
  • ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์—์„œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ถ”์ถœ

7๋‹จ๊ณ„: ํ”„๋กฌํ”„ํŠธ ์ˆ˜์ •

  • ํ”„๋กฌํ”„ํŠธ์— ์‹œ๊ฐ„ ์ •๋ณด ํฌํ•จ

6.3 ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ ์ฒดํฌ๋ฆฌ์ŠคํŠธ

์ค€๋น„ ๋‹จ๊ณ„

  • ํ˜„์žฌ ์„œ๋น„์Šค ์ฝ”๋“œ๋ฒ ์ด์Šค ๊ตฌ์กฐ ํŒŒ์•…
  • S3 JSON ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๊ตฌ์กฐ ํ™•์ธ
  • ๊ธฐ์กด ๋ฒกํ„ฐ DB ๋ฐฑ์—…
  • ํ…Œ์ŠคํŠธ ํ™˜๊ฒฝ ์ค€๋น„

์ฝ”๋“œ ํ†ตํ•ฉ

  • StructuredDocumentLoader ํŒŒ์ผ ์ถ”๊ฐ€
  • JobPostParser ํŒŒ์ผ ์ถ”๊ฐ€
  • ๋ฐ์ดํ„ฐ ๋กœ๋” ์ˆ˜์ •
  • ์ฒญํ‚น ๋กœ์ง ๋ณ€๊ฒฝ
  • ๋ฒกํ„ฐ DB ์ €์žฅ ๋กœ์ง ํ™•์ธ
  • Response Generator ์ˆ˜์ •
  • ํ”„๋กฌํ”„ํŠธ ๋นŒ๋” ์ˆ˜์ •

ํ…Œ์ŠคํŠธ

  • ๋‹จ์œ„ ํ…Œ์ŠคํŠธ ์ž‘์„ฑ
  • ํ†ตํ•ฉ ํ…Œ์ŠคํŠธ ์ž‘์„ฑ
  • ์„ฑ๋Šฅ ํ…Œ์ŠคํŠธ
  • A/B ํ…Œ์ŠคํŠธ ์ค€๋น„

๋ฐฐํฌ

  • ์Šคํ…Œ์ด์ง• ํ™˜๊ฒฝ ๋ฐฐํฌ
  • ๋ชจ๋‹ˆํ„ฐ๋ง ์„ค์ •
  • ํ”„๋กœ๋•์…˜ ๋ฐฐํฌ
  • ๋กค๋ฐฑ ๊ณ„ํš ์ˆ˜๋ฆฝ

6.4 ์ฃผ์˜์‚ฌํ•ญ

์„ฑ๋Šฅ ๊ณ ๋ ค์‚ฌํ•ญ

  • ํŒŒ์‹ฑ ์†๋„: fast ์ „๋žต์ด hi_res๋ณด๋‹ค ๋น ๋ฆ„ (์•ฝ 10๋ฐฐ)
  • ์ •ํ™•๋„: hi_res๊ฐ€ ๋” ์ •ํ™•ํ•˜์ง€๋งŒ ๋А๋ฆผ
  • ๊ถŒ์žฅ: ํ”„๋กœ๋•์…˜์—์„œ๋Š” fast ์‚ฌ์šฉ

ํ˜ธํ™˜์„ฑ

  • ๊ธฐ์กด ๋ฒกํ„ฐ DB๋Š” ์ƒˆ๋กœ์šด ๊ตฌ์กฐ์™€ ํ˜ธํ™˜๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ
  • ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ ๋˜๋Š” ์ƒˆ ์ปฌ๋ ‰์…˜ ์ƒ์„ฑ ํ•„์š”

์—๋Ÿฌ ์ฒ˜๋ฆฌ

  • PDF ํŒŒ์‹ฑ ์‹คํŒจ ์‹œ fallback ์ฒ˜๋ฆฌ ํ•„์š”
  • ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๋ˆ„๋ฝ ์‹œ None ์ฒ˜๋ฆฌ

์ž์„ธํ•œ ๋‚ด์šฉ์€ SERVICE_INTEGRATION_GUIDE.md ์ฐธ๊ณ  (ํ–ฅํ›„ ํ†ตํ•ฉ ๋ฌธ์„œ๋กœ ์ด๋™ ์˜ˆ์ •)


7. ์ฐธ๊ณ  ์ž๋ฃŒ

7.1 ๋ฌธ์„œ

  • EXPERIMENT_SUMMARY_20251203.md: ์‹คํ—˜ ์š”์•ฝ ๋ฐ ๋ณ€๊ฒฝ์‚ฌํ•ญ
  • GT_GENERATION_STRATEGY.md: GT ์ƒ์„ฑ ์ „๋žต ์ƒ์„ธ
  • CLUSTERING_README.md: ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ฐ€์ด๋“œ
  • STRUCTURED_CHUNKS_REPORT.md: ๊ตฌ์กฐํ™” ์ฒญํฌ ๋ฆฌํฌํŠธ

7.2 ์™ธ๋ถ€ ์ž๋ฃŒ

7.3 ๋ฐ์ดํ„ฐ ์†Œ์Šค

  • S3 ๋ฒ„ํ‚ท: career-hi
  • PDF ๊ฒฝ๋กœ: initial-dataset/pdf/ (1,473๊ฐœ ํŒŒ์ผ)
  • JSON ๊ฒฝ๋กœ: initial-dataset/json/ (1,473๊ฐœ ํŒŒ์ผ)
  • Ground Truth: data/gt_eval_fullquery_cluster_ids.jsonl (79๊ฐœ ์ฟผ๋ฆฌ)

๐Ÿ“ ๋ผ์ด์„ผ์Šค

์ด ํ”„๋กœ์ ํŠธ๋Š” Career-HY ํŒ€์˜ ๋‚ด๋ถ€ ์‹คํ—˜์šฉ์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.


์ž‘์„ฑ์ผ: 2025-12-03
๋ฒ„์ „: 2.0 (ํ†ตํ•ฉ ๋ฌธ์„œ)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages