| layout | default |
|---|---|
| title | Chapter 4: RAG Implementation |
| nav_order | 4 |
| has_children | false |
| parent | Dify Platform Deep Dive |
Welcome to Chapter 4: RAG Implementation. In this part of Dify Platform: Deep Dive Tutorial, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Building retrieval-augmented generation systems with Dify's document processing and vector search capabilities
By the end of this chapter, you'll be able to:
- Understand the principles of retrieval-augmented generation (RAG)
- Implement document ingestion and processing pipelines
- Configure vector databases for semantic search
- Build RAG-enabled chatbots and Q&A systems
- Optimize retrieval quality and generation accuracy
RAG combines the strengths of retrieval-based and generation-based AI approaches:
Traditional LLMs are limited by their training data cutoff and lack of specific domain knowledge. RAG addresses this by:
- Dynamic Knowledge Access: Retrieve relevant information from external sources
- Reduced Hallucinations: Ground responses in factual, up-to-date information
- Domain Adaptation: Easily customize for specific industries or knowledge domains
- Cost Efficiency: Avoid expensive model retraining for new information
graph TD
subgraph "Ingestion Phase"
A[Documents] --> B[Text Extraction]
B --> C[Chunking]
C --> D[Embedding Generation]
D --> E[Vector Storage]
end
subgraph "Retrieval Phase"
F[User Query] --> G[Query Embedding]
G --> H[Vector Search]
H --> I[Relevant Chunks]
I --> J[Context Assembly]
end
subgraph "Generation Phase"
J --> K[Prompt Engineering]
K --> L[LLM Generation]
L --> M[Response]
end
E -.-> H
style A fill:#e1f5fe
style F fill:#e1f5fe
style M fill:#c8e6c9
Dify supports multiple document formats and ingestion methods:
# Document Ingestion Configuration
document_ingestion_config = {
"supported_formats": [
"pdf", "docx", "txt", "md", "html",
"csv", "json", "xml", "rtf"
],
"processing_options": {
"extract_metadata": True,
"preserve_formatting": False,
"handle_tables": True,
"process_images": False # Text-only extraction
},
"chunking_strategy": {
"method": "semantic", # semantic, fixed, sentence
"chunk_size": 512,
"overlap": 50,
"separator": "\n\n"
}
}Effective chunking is crucial for retrieval quality:
def fixed_size_chunking(text, chunk_size=512, overlap=50):
"""Split text into fixed-size chunks with overlap"""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
# Add overlap from previous chunk
if start > 0:
start = max(0, start - overlap)
chunk = text[start:end]
chunks.append(chunk)
start = end
return chunksdef semantic_chunking(text, max_chunk_size=512):
"""Split text at semantic boundaries"""
import re
# Split on paragraph boundaries first
paragraphs = re.split(r'\n\s*\n', text)
chunks = []
current_chunk = ""
for paragraph in paragraphs:
# Check if adding this paragraph exceeds max size
if len(current_chunk + paragraph) > max_chunk_size:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = paragraph
else:
# Paragraph itself is too long, split by sentences
sentences = re.split(r'(?<=[.!?])\s+', paragraph)
temp_chunk = ""
for sentence in sentences:
if len(temp_chunk + sentence) > max_chunk_size:
if temp_chunk:
chunks.append(temp_chunk.strip())
temp_chunk = sentence
else:
temp_chunk += sentence + " "
if temp_chunk:
current_chunk = temp_chunk
else:
current_chunk += paragraph + "\n\n"
if current_chunk:
chunks.append(current_chunk.strip())
return chunksDify supports multiple embedding providers:
# Embedding Configuration
embedding_config = {
"providers": {
"openai": {
"model": "text-embedding-ada-002",
"dimensions": 1536,
"max_tokens": 8191
},
"huggingface": {
"model": "sentence-transformers/all-MiniLM-L6-v2",
"dimensions": 384,
"max_tokens": 512
},
"cohere": {
"model": "embed-multilingual-v2.0",
"dimensions": 768,
"max_tokens": 2048
}
},
"default_provider": "openai",
"caching": {
"enabled": True,
"ttl_hours": 24
}
}Dify integrates with popular vector databases:
# Vector Database Configuration
vector_db_config = {
"engine": "weaviate", # weaviate, pinecone, qdrant, chroma
"connection": {
"host": "localhost",
"port": 8080,
"api_key": "{{VECTOR_DB_API_KEY}}"
},
"index_settings": {
"metric": "cosine", # cosine, euclidean, dot_product
"ef_construction": 200,
"max_connections": 16
},
"collection_schema": {
"name": "documents",
"properties": [
{"name": "content", "type": "text"},
{"name": "metadata", "type": "object"},
{"name": "source", "type": "text"},
{"name": "chunk_id", "type": "int"}
]
}
}class HybridRetriever:
def __init__(self, vector_db, bm25_index):
self.vector_db = vector_db
self.bm25_index = bm25_index
self.weights = {"semantic": 0.7, "keyword": 0.3}
def retrieve(self, query, top_k=5):
"""Combine semantic and keyword search"""
# Semantic search
semantic_results = self.vector_db.search(
query_embedding=self.embed_query(query),
limit=top_k * 2 # Get more candidates
)
# Keyword search
keyword_results = self.bm25_index.search(query, limit=top_k * 2)
# Combine and rerank
combined_scores = {}
for doc_id, score in semantic_results:
combined_scores[doc_id] = score * self.weights["semantic"]
for doc_id, score in keyword_results:
if doc_id in combined_scores:
combined_scores[doc_id] += score * self.weights["keyword"]
else:
combined_scores[doc_id] = score * self.weights["keyword"]
# Return top-k results
sorted_results = sorted(
combined_scores.items(),
key=lambda x: x[1],
reverse=True
)
return sorted_results[:top_k]class ReRanker:
def __init__(self, cross_encoder_model):
self.model = cross_encoder_model
def rerank(self, query, documents, top_k=3):
"""Re-rank documents using cross-encoder"""
pairs = [[query, doc['content']] for doc in documents]
# Get relevance scores
scores = self.model.predict(pairs)
# Sort by scores
ranked_docs = [
doc for _, doc in sorted(
zip(scores, documents),
key=lambda x: x[0],
reverse=True
)
]
return ranked_docs[:top_k]Let's build a complete document Q&A system:
graph TD
A[Upload Documents] --> B[Document Processor]
B --> C[Text Extractor]
C --> D[Chunk Generator]
D --> E[Embedding Service]
E --> F[Vector Database]
G[User Question] --> H[Query Processor]
H --> I[Embedding Service]
I --> J[Vector Search]
J --> K[Context Retriever]
K --> L[Prompt Builder]
L --> M[LLM Generator]
M --> N[Response Formatter]
N --> O[Final Answer]
document_qa_workflow = {
"name": "Document Q&A Assistant",
"nodes": [
{
"id": "document_processor",
"type": "document_processor",
"config": {
"input_type": "file_upload",
"supported_formats": ["pdf", "docx", "txt"],
"chunking_strategy": {
"method": "semantic",
"chunk_size": 512,
"overlap": 50
}
}
},
{
"id": "embedding_generator",
"type": "embedding",
"config": {
"provider": "openai",
"model": "text-embedding-ada-002"
}
},
{
"id": "vector_store",
"type": "vector_db",
"config": {
"engine": "weaviate",
"collection": "documents"
}
},
{
"id": "query_processor",
"type": "llm",
"config": {
"model": "gpt-3.5-turbo",
"prompt": "Rephrase this question for better retrieval: {user_question}",
"temperature": 0.1
}
},
{
"id": "retriever",
"type": "vector_search",
"config": {
"top_k": 3,
"score_threshold": 0.7,
"rerank": True
}
},
{
"id": "answer_generator",
"type": "llm",
"config": {
"model": "gpt-4",
"prompt_template": """
Answer the question based on the provided context.
Context:
{retrieved_documents}
Question: {original_question}
Instructions:
- Use only information from the provided context
- If the context doesn't contain enough information, say so
- Be concise but comprehensive
- Include relevant quotes from the context
Answer:
""",
"temperature": 0.3,
"max_tokens": 800
}
}
],
"edges": [
{"from": "document_processor", "to": "embedding_generator"},
{"from": "embedding_generator", "to": "vector_store"},
{"from": "query_processor", "to": "retriever", "data": ["rephrased_question"]},
{"from": "retriever", "to": "answer_generator", "data": ["context"]},
{"from": "answer_generator", "to": "output"}
]
}class QueryExpander:
def __init__(self, llm_client):
self.llm = llm_client
def expand_query(self, original_query):
"""Generate multiple query variations"""
prompt = f"""
Generate 3 different phrasings of this question that would help
find relevant information in a document collection:
Original: {original_query}
Variations:
1.
2.
3.
"""
response = self.llm.generate(prompt, temperature=0.7)
variations = self.parse_variations(response)
return [original_query] + variations
def parse_variations(self, response):
"""Extract query variations from LLM response"""
lines = response.strip().split('\n')
variations = []
for line in lines:
if line.startswith(('1.', '2.', '3.')):
variation = line[3:].strip()
if variation:
variations.append(variation)
return variationsclass ContextCompressor:
def __init__(self, llm_client):
self.llm = llm_client
def compress_context(self, query, documents, max_length=2000):
"""Compress retrieved documents to fit context window"""
# Combine all documents
full_context = "\n\n".join([doc['content'] for doc in documents])
if len(full_context) <= max_length:
return full_context
# Use LLM to summarize and compress
prompt = f"""
Summarize the following documents to answer this question: {query}
Documents:
{full_context}
Provide a concise summary that captures all relevant information,
limited to {max_length} characters.
"""
compressed = self.llm.generate(prompt, max_tokens=max_length//4)
return compressed.strip()class MultiHopRetriever:
def __init__(self, retriever, generator):
self.retriever = retriever
self.generator = generator
def retrieve_multi_hop(self, query, max_hops=2):
"""Perform multi-hop retrieval and reasoning"""
context = []
current_query = query
for hop in range(max_hops):
# Retrieve documents for current query
docs = self.retriever.retrieve(current_query, top_k=3)
# Add to context
context.extend(docs)
# Generate follow-up question if not last hop
if hop < max_hops - 1:
follow_up_prompt = f"""
Based on the retrieved information, what additional
information do we need to fully answer: {query}
Retrieved so far: {summarize_docs(docs)}
Follow-up question:
"""
current_query = self.generator.generate(follow_up_prompt).strip()
return contextclass RAGEvaluator:
def __init__(self):
self.metrics = {}
def evaluate_retrieval(self, queries, ground_truth, retrieved_docs):
"""Evaluate retrieval quality"""
precision_scores = []
recall_scores = []
ndcg_scores = []
for query, truth, retrieved in zip(queries, ground_truth, retrieved_docs):
# Precision@K
precision = self.precision_at_k(retrieved, truth, k=5)
# Recall@K
recall = self.recall_at_k(retrieved, truth, k=5)
# NDCG@K
ndcg = self.ndcg_at_k(retrieved, truth, k=5)
precision_scores.append(precision)
recall_scores.append(recall)
ndcg_scores.append(ndcg)
return {
'precision@5': sum(precision_scores) / len(precision_scores),
'recall@5': sum(recall_scores) / len(recall_scores),
'ndcg@5': sum(ndcg_scores) / len(ndcg_scores)
}
def precision_at_k(self, retrieved, relevant, k):
"""Calculate precision at k"""
retrieved_at_k = retrieved[:k]
relevant_retrieved = len(set(retrieved_at_k) & set(relevant))
return relevant_retrieved / k if k > 0 else 0
def recall_at_k(self, retrieved, relevant, k):
"""Calculate recall at k"""
retrieved_at_k = retrieved[:k]
relevant_retrieved = len(set(retrieved_at_k) & set(relevant))
return relevant_retrieved / len(relevant) if relevant else 0
def ndcg_at_k(self, retrieved, relevant, k):
"""Calculate NDCG at k"""
retrieved_at_k = retrieved[:k]
dcg = 0
for i, doc in enumerate(retrieved_at_k):
if doc in relevant:
dcg += 1 / math.log2(i + 2)
# Calculate IDCG
idcg = sum(1 / math.log2(i + 2) for i in range(min(k, len(relevant))))
return dcg / idcg if idcg > 0 else 0class GenerationEvaluator:
def __init__(self, llm_judge):
self.llm_judge = llm_judge
def evaluate_answer_quality(self, question, answer, context):
"""Evaluate answer quality using LLM-as-judge"""
prompt = f"""
Evaluate the quality of this answer to the question.
Question: {question}
Answer: {answer}
Context: {context[:500]}...
Rate on a scale of 1-5 for:
1. Factual accuracy (correct information)
2. Completeness (answers the question fully)
3. Relevance (stays on topic)
4. Groundedness (based on provided context)
Provide scores and brief explanation.
"""
evaluation = self.llm_judge.generate(prompt)
# Parse scores from response
return self.parse_evaluation(evaluation)Estimated Time: 60 minutes
-
Build a Document Q&A System:
- Upload a collection of documents (PDFs, docs, or text files)
- Configure document processing and chunking
- Set up vector embeddings and storage
- Create a retrieval-augmented Q&A interface
-
Optimize Retrieval Quality:
- Experiment with different chunking strategies
- Try various embedding models
- Implement re-ranking and query expansion
- Compare retrieval metrics before and after optimization
-
Add Advanced Features:
- Implement multi-hop retrieval for complex questions
- Add context compression for long documents
- Create a feedback loop for continuous improvement
- RAG Architecture: Combines retrieval and generation for more accurate, up-to-date responses
- Document Processing: Effective chunking and preprocessing are crucial for retrieval quality
- Vector Search: Semantic similarity enables finding relevant information beyond keyword matching
- Optimization Techniques: Query expansion, re-ranking, and context compression improve performance
- Evaluation: Systematic metrics help measure and improve RAG system quality
With RAG fundamentals mastered, we're ready to explore Agent Framework in the next chapter, where we'll learn how to build autonomous agents that can use tools and make decisions.
Ready for autonomous agents? Continue to Chapter 5: Agent Framework
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for self, query, context so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 4: RAG Implementation as an operating subsystem inside Dify Platform: Deep Dive Tutorial, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around documents, relevant, retrieved as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 4: RAG Implementation usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
self. - Input normalization: shape incoming data so
queryreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
context. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- Dify
Why it matters: authoritative reference on
Dify(github.com).
Suggested trace strategy:
- search upstream code for
selfandqueryto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production