Skip to content

Commit 72def34

Browse files
jefcodercforge42
authored andcommitted
update
1 parent 1d54132 commit 72def34

File tree

1 file changed

+27
-74
lines changed

1 file changed

+27
-74
lines changed

tools/rag/hybrid_vector_graph_rag/hybrid_vector_graph_rag.ipynb

Lines changed: 27 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -32,90 +32,43 @@
3232
"metadata": {},
3333
"source": [
3434
"## System Components\n",
35-
"\n",
36-
"1. **Embedding and Storage Module:**\n",
37-
" - Utilizes ChromaDB to store text chunks along with their vector embeddings.\n",
38-
"\n",
39-
"2. **Graph Database Module:**\n",
40-
" - Leverages Neo4j to create nodes representing text chunks and to establish similarity edges based on cosine similarity.\n",
41-
"\n",
42-
"3. **Text Processing Module:**\n",
43-
" - Employs SpaCy for text normalization and lemmatization, ensuring consistent analysis of the data.\n",
44-
"\n",
45-
"4. **Summarization Module:**\n",
46-
" - Uses a language model to generate concise summaries of each text chunk, distilling the core content.\n",
47-
"\n",
48-
"5. **Retrieval Module:**\n",
49-
" - Combines initial vector-based retrieval from ChromaDB with a Breadth-First Search (BFS) on the Neo4j graph to compile a rich context for answering queries.\n",
50-
"\n",
51-
"6. **File Ingestion Module:**\n",
52-
" - Processes PDF and TXT files (via dedicated scripts) to extract and format text for ingestion into the system."
35+
"1. **Embedding and Storage Module:**: Utilizes ChromaDB to store text chunks along with their vector embeddings.\n",
36+
"2. **Graph Database Module:**: Leverages Neo4j to create nodes representing text chunks and to establish similarity edges based on cosine similarity.\n",
37+
"3. **Text Processing Module:**: Employs SpaCy for text normalization and lemmatization, ensuring consistent analysis of the data.\n",
38+
"4. **Summarization Module:**: Uses a language model to generate concise summaries of each text chunk, distilling the core content.\n",
39+
"5. **Retrieval Module:**: Combines initial vector-based retrieval from ChromaDB with a Breadth-First Search (BFS) on the Neo4j graph to compile a rich context for answering queries.\n",
40+
"6. **File Ingestion Module:**: Processes PDF and TXT files (via dedicated scripts) to extract and format text for ingestion into the system.\n"
5341
]
54-
},
42+
},
5543
{
5644
"cell_type": "markdown",
5745
"metadata": {},
5846
"source": [
5947
"## How It Works\n",
60-
"### Ingestion Steps:\n",
61-
"1. **Start Ingestion:**\n",
62-
" - Initiates the ingestion process for new data.\n",
63-
" \n",
64-
"2. **Split Texts into Chunks:**\n",
65-
" - Breaks large documents into smaller segments based on a maximum character limit and specified overlap to preserve context.\n",
66-
" \n",
67-
"3. **Create Embeddings for Chunks:**\n",
68-
" - Converts each text chunk into a numerical vector using a dedicated embedding model.\n",
69-
"4. **Add Documents to ChromaDB:**\n",
70-
" - Stores the text chunks, their embeddings, and associated metadata in ChromaDB for efficient vector retrieval.\n",
71-
"\n",
72-
"5. **Summarize Each Chunk:**\n",
73-
" - Generates a concise summary for each chunk using a language model.\n",
74-
"\n",
75-
"6. **Lemmatize Summaries:**\n",
76-
" - Processes the summaries with SpaCy to reduce words to their base forms, ensuring consistency in text analysis.\n",
77-
"\n",
78-
"7. **Create Lemma Embeddings:**\n",
79-
" - Converts the lemmatized summaries into embeddings to capture the distilled meaning of the content.\n",
80-
"\n",
81-
"8. **Create Nodes in Neo4j:**\n",
82-
" - Inserts the summarized and lemmatized data as nodes in the Neo4j graph database, tagging each with metadata such as a unique corpus label and color.\n",
8348
"\n",
84-
"9. **Create Similarity Edges:**\n",
85-
" - Establishes connections between nodes (text chunks) by creating similarity edges when the cosine similarity exceeds a defined threshold.\n",
86-
"\n",
87-
"10. **End Ingestion:**\n",
88-
" - Finalizes the ingestion process, ensuring all data is stored and all relationships are properly established.\n",
49+
"### Ingestion Steps:\n",
50+
"1. **Start Ingestion:**: Initiates the ingestion process for new data\n",
51+
"2. **Split Texts into Chunks:**: Breaks large documents into smaller segments based on a maximum character limit and specified overlap to preserve context\n",
52+
"3. **Create Embeddings for Chunks:**: Converts each text chunk into a numerical vector using a dedicated embedding model\n",
53+
"4. **Add Documents to ChromaDB:**: Stores the text chunks, their embeddings, and associated metadata in ChromaDB for efficient vector retrieval\n",
54+
"5. **Summarize Each Chunk:**: Generates a concise summary for each chunk using a language model\n",
55+
"6. **Lemmatize Summaries:**: Processes the summaries with SpaCy to reduce words to their base forms, ensuring consistency in text analysis\n",
56+
"7. **Create Lemma Embeddings:**: Converts the lemmatized summaries into embeddings to capture the distilled meaning of the content\n",
57+
"8. **Create Nodes in Neo4j:**: Inserts the summarized and lemmatized data as nodes in the Neo4j graph database, tagging each with metadata such as a unique corpus label and color\n",
58+
"9. **Create Similarity Edges:**: Establishes connections between nodes (text chunks) by creating similarity edges when the cosine similarity exceeds a defined threshold\n",
59+
"10. **End Ingestion:**: Finalizes the ingestion process, ensuring all data is stored and all relationships are properly established\n",
8960
"\n",
9061
"### Retrieval Steps:\n",
91-
"\n",
92-
"1. **Start Retrieval:**\n",
93-
" - Initiates the process based on a user’s query.\n",
94-
"\n",
95-
"2. **Embed User Query:**\n",
96-
" - Transforms the query into a vector using the same embedding model employed during ingestion.\n",
97-
"\n",
98-
"3. **Query ChromaDB for Top_k Chunks:**\n",
99-
" - Retrieves the top k text chunks that are most similar to the query based on vector similarity.\n",
100-
"\n",
101-
"4. **Initialize BFS with Retrieved Chunks:**\n",
102-
" - Starts a Breadth-First Search (BFS) in the Neo4j graph using the initially retrieved chunks as the starting nodes.\n",
103-
"\n",
104-
"5. **Is Context Enough?:**\n",
105-
" - Uses a language model to determine if the gathered context is sufficient to answer the query.\n",
106-
" - If the context is sufficient, the system proceeds to generate the final answer.\n",
107-
" - If not, BFS expands by exploring neighboring nodes with similarity scores above a threshold.\n",
108-
"\n",
109-
"6. **Generate Final Answer:**\n",
110-
" - Compiles all relevant information into a final context and uses a language model to generate a detailed answer.\n",
111-
"\n",
112-
"7. **Return Final Answer:**\n",
113-
" - Outputs the generated answer to the user.\n",
114-
"\n",
115-
"8. **End Retrieval:**\n",
116-
" - Concludes the retrieval process once a complete answer is formulated or the maximum BFS depth is reached."
62+
"1. **Start Retrieval:**: Initiates the process based on a user’s query\n",
63+
"2. **Embed User Query:**: Transforms the query into a vector using the same embedding model employed during ingestion\n",
64+
"3. **Query ChromaDB for Top_k Chunks:**: Retrieves the top k text chunks that are most similar to the query based on vector similarity\n",
65+
"4. **Initialize BFS with Retrieved Chunks:**: Starts a Breadth-First Search (BFS) in the Neo4j graph using the initially retrieved chunks as the starting nodes\n",
66+
"5. **Is Context Enough?:**: Uses a language model to determine if the gathered context is sufficient to answer the query; if the context is sufficient, the system proceeds to generate the final answer; if not, BFS expands by exploring neighboring nodes with similarity scores above a threshold\n",
67+
"6. **Generate Final Answer:**: Compiles all relevant information into a final context and uses a language model to generate a detailed answer\n",
68+
"7. **Return Final Answer:**: Outputs the generated answer to the user\n",
69+
"8. **End Retrieval:**: Concludes the retrieval process once a complete answer is formulated or the maximum BFS depth is reached\n"
11770
]
118-
},
71+
},
11972
{
12073
"cell_type": "markdown",
12174
"metadata": {},

0 commit comments

Comments
 (0)