update

jefcoder · cforge42 · commit 72def34451cc · 2025-03-04T16:03:35.000+01:00
diff --git a/tools/rag/hybrid_vector_graph_rag/hybrid_vector_graph_rag.ipynb b/tools/rag/hybrid_vector_graph_rag/hybrid_vector_graph_rag.ipynb
@@ -32,90 +32,43 @@
       "metadata": {},
       "source": [
         "## System Components\n",
-        "\n",
-        "1. **Embedding and Storage Module:**\n",
-        "   - Utilizes ChromaDB to store text chunks along with their vector embeddings.\n",
-        "\n",
-        "2. **Graph Database Module:**\n",
-        "   - Leverages Neo4j to create nodes representing text chunks and to establish similarity edges based on cosine similarity.\n",
-        "\n",
-        "3. **Text Processing Module:**\n",
-        "   - Employs SpaCy for text normalization and lemmatization, ensuring consistent analysis of the data.\n",
-        "\n",
-        "4. **Summarization Module:**\n",
-        "   - Uses a language model to generate concise summaries of each text chunk, distilling the core content.\n",
-        "\n",
-        "5. **Retrieval Module:**\n",
-        "   - Combines initial vector-based retrieval from ChromaDB with a Breadth-First Search (BFS) on the Neo4j graph to compile a rich context for answering queries.\n",
-        "\n",
-        "6. **File Ingestion Module:**\n",
-        "   - Processes PDF and TXT files (via dedicated scripts) to extract and format text for ingestion into the system."
+        "1. **Embedding and Storage Module:**: Utilizes ChromaDB to store text chunks along with their vector embeddings.\n",
+        "2. **Graph Database Module:**: Leverages Neo4j to create nodes representing text chunks and to establish similarity edges based on cosine similarity.\n",
+        "3. **Text Processing Module:**: Employs SpaCy for text normalization and lemmatization, ensuring consistent analysis of the data.\n",
+        "4. **Summarization Module:**: Uses a language model to generate concise summaries of each text chunk, distilling the core content.\n",
+        "5. **Retrieval Module:**: Combines initial vector-based retrieval from ChromaDB with a Breadth-First Search (BFS) on the Neo4j graph to compile a rich context for answering queries.\n",
+        "6. **File Ingestion Module:**: Processes PDF and TXT files (via dedicated scripts) to extract and format text for ingestion into the system.\n"
       ]
-    },
+    },    
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
         "## How It Works\n",
-        "### Ingestion Steps:\n",
-        "1. **Start Ingestion:**\n",
-        "   - Initiates the ingestion process for new data.\n",
-        " \n",
-        "2. **Split Texts into Chunks:**\n",
-        "   - Breaks large documents into smaller segments based on a maximum character limit and specified overlap to preserve context.\n",
-        " \n",
-        "3. **Create Embeddings for Chunks:**\n",
-        "   - Converts each text chunk into a numerical vector using a dedicated embedding model.\n",
-        "4. **Add Documents to ChromaDB:**\n",
-        "   - Stores the text chunks, their embeddings, and associated metadata in ChromaDB for efficient vector retrieval.\n",
-        "\n",
-        "5. **Summarize Each Chunk:**\n",
-        "   - Generates a concise summary for each chunk using a language model.\n",
-        "\n",
-        "6. **Lemmatize Summaries:**\n",
-        "   - Processes the summaries with SpaCy to reduce words to their base forms, ensuring consistency in text analysis.\n",
-        "\n",
-        "7. **Create Lemma Embeddings:**\n",
-        "   - Converts the lemmatized summaries into embeddings to capture the distilled meaning of the content.\n",
-        "\n",
-        "8. **Create Nodes in Neo4j:**\n",
-        "   - Inserts the summarized and lemmatized data as nodes in the Neo4j graph database, tagging each with metadata such as a unique corpus label and color.\n",
         "\n",
-        "9. **Create Similarity Edges:**\n",
-        "   - Establishes connections between nodes (text chunks) by creating similarity edges when the cosine similarity exceeds a defined threshold.\n",
-        "\n",
-        "10. **End Ingestion:**\n",
-        "    - Finalizes the ingestion process, ensuring all data is stored and all relationships are properly established.\n",
+        "### Ingestion Steps:\n",
+        "1. **Start Ingestion:**: Initiates the ingestion process for new data\n",
+        "2. **Split Texts into Chunks:**: Breaks large documents into smaller segments based on a maximum character limit and specified overlap to preserve context\n",
+        "3. **Create Embeddings for Chunks:**: Converts each text chunk into a numerical vector using a dedicated embedding model\n",
+        "4. **Add Documents to ChromaDB:**: Stores the text chunks, their embeddings, and associated metadata in ChromaDB for efficient vector retrieval\n",
+        "5. **Summarize Each Chunk:**: Generates a concise summary for each chunk using a language model\n",
+        "6. **Lemmatize Summaries:**: Processes the summaries with SpaCy to reduce words to their base forms, ensuring consistency in text analysis\n",
+        "7. **Create Lemma Embeddings:**: Converts the lemmatized summaries into embeddings to capture the distilled meaning of the content\n",
+        "8. **Create Nodes in Neo4j:**: Inserts the summarized and lemmatized data as nodes in the Neo4j graph database, tagging each with metadata such as a unique corpus label and color\n",
+        "9. **Create Similarity Edges:**: Establishes connections between nodes (text chunks) by creating similarity edges when the cosine similarity exceeds a defined threshold\n",
+        "10. **End Ingestion:**: Finalizes the ingestion process, ensuring all data is stored and all relationships are properly established\n",
         "\n",
         "### Retrieval Steps:\n",
-        "\n",
-        "1. **Start Retrieval:**\n",
-        "   - Initiates the process based on a user’s query.\n",
-        "\n",
-        "2. **Embed User Query:**\n",
-        "   - Transforms the query into a vector using the same embedding model employed during ingestion.\n",
-        "\n",
-        "3. **Query ChromaDB for Top_k Chunks:**\n",
-        "   - Retrieves the top k text chunks that are most similar to the query based on vector similarity.\n",
-        "\n",
-        "4. **Initialize BFS with Retrieved Chunks:**\n",
-        "   - Starts a Breadth-First Search (BFS) in the Neo4j graph using the initially retrieved chunks as the starting nodes.\n",
-        "\n",
-        "5. **Is Context Enough?:**\n",
-        "   - Uses a language model to determine if the gathered context is sufficient to answer the query.\n",
-        "   - If the context is sufficient, the system proceeds to generate the final answer.\n",
-        "   - If not, BFS expands by exploring neighboring nodes with similarity scores above a threshold.\n",
-        "\n",
-        "6. **Generate Final Answer:**\n",
-        "   - Compiles all relevant information into a final context and uses a language model to generate a detailed answer.\n",
-        "\n",
-        "7. **Return Final Answer:**\n",
-        "   - Outputs the generated answer to the user.\n",
-        "\n",
-        "8. **End Retrieval:**\n",
-        "   - Concludes the retrieval process once a complete answer is formulated or the maximum BFS depth is reached."
+        "1. **Start Retrieval:**: Initiates the process based on a user’s query\n",
+        "2. **Embed User Query:**: Transforms the query into a vector using the same embedding model employed during ingestion\n",
+        "3. **Query ChromaDB for Top_k Chunks:**: Retrieves the top k text chunks that are most similar to the query based on vector similarity\n",
+        "4. **Initialize BFS with Retrieved Chunks:**: Starts a Breadth-First Search (BFS) in the Neo4j graph using the initially retrieved chunks as the starting nodes\n",
+        "5. **Is Context Enough?:**: Uses a language model to determine if the gathered context is sufficient to answer the query; if the context is sufficient, the system proceeds to generate the final answer; if not, BFS expands by exploring neighboring nodes with similarity scores above a threshold\n",
+        "6. **Generate Final Answer:**: Compiles all relevant information into a final context and uses a language model to generate a detailed answer\n",
+        "7. **Return Final Answer:**: Outputs the generated answer to the user\n",
+        "8. **End Retrieval:**: Concludes the retrieval process once a complete answer is formulated or the maximum BFS depth is reached\n"
       ]
-    },
+    },    
     {
       "cell_type": "markdown",
       "metadata": {},