|
32 | 32 | "metadata": {}, |
33 | 33 | "source": [ |
34 | 34 | "## System Components\n", |
35 | | - "\n", |
36 | | - "1. **Embedding and Storage Module:**\n", |
37 | | - " - Utilizes ChromaDB to store text chunks along with their vector embeddings.\n", |
38 | | - "\n", |
39 | | - "2. **Graph Database Module:**\n", |
40 | | - " - Leverages Neo4j to create nodes representing text chunks and to establish similarity edges based on cosine similarity.\n", |
41 | | - "\n", |
42 | | - "3. **Text Processing Module:**\n", |
43 | | - " - Employs SpaCy for text normalization and lemmatization, ensuring consistent analysis of the data.\n", |
44 | | - "\n", |
45 | | - "4. **Summarization Module:**\n", |
46 | | - " - Uses a language model to generate concise summaries of each text chunk, distilling the core content.\n", |
47 | | - "\n", |
48 | | - "5. **Retrieval Module:**\n", |
49 | | - " - Combines initial vector-based retrieval from ChromaDB with a Breadth-First Search (BFS) on the Neo4j graph to compile a rich context for answering queries.\n", |
50 | | - "\n", |
51 | | - "6. **File Ingestion Module:**\n", |
52 | | - " - Processes PDF and TXT files (via dedicated scripts) to extract and format text for ingestion into the system." |
| 35 | + "1. **Embedding and Storage Module:**: Utilizes ChromaDB to store text chunks along with their vector embeddings.\n", |
| 36 | + "2. **Graph Database Module:**: Leverages Neo4j to create nodes representing text chunks and to establish similarity edges based on cosine similarity.\n", |
| 37 | + "3. **Text Processing Module:**: Employs SpaCy for text normalization and lemmatization, ensuring consistent analysis of the data.\n", |
| 38 | + "4. **Summarization Module:**: Uses a language model to generate concise summaries of each text chunk, distilling the core content.\n", |
| 39 | + "5. **Retrieval Module:**: Combines initial vector-based retrieval from ChromaDB with a Breadth-First Search (BFS) on the Neo4j graph to compile a rich context for answering queries.\n", |
| 40 | + "6. **File Ingestion Module:**: Processes PDF and TXT files (via dedicated scripts) to extract and format text for ingestion into the system.\n" |
53 | 41 | ] |
54 | | - }, |
| 42 | + }, |
55 | 43 | { |
56 | 44 | "cell_type": "markdown", |
57 | 45 | "metadata": {}, |
58 | 46 | "source": [ |
59 | 47 | "## How It Works\n", |
60 | | - "### Ingestion Steps:\n", |
61 | | - "1. **Start Ingestion:**\n", |
62 | | - " - Initiates the ingestion process for new data.\n", |
63 | | - " \n", |
64 | | - "2. **Split Texts into Chunks:**\n", |
65 | | - " - Breaks large documents into smaller segments based on a maximum character limit and specified overlap to preserve context.\n", |
66 | | - " \n", |
67 | | - "3. **Create Embeddings for Chunks:**\n", |
68 | | - " - Converts each text chunk into a numerical vector using a dedicated embedding model.\n", |
69 | | - "4. **Add Documents to ChromaDB:**\n", |
70 | | - " - Stores the text chunks, their embeddings, and associated metadata in ChromaDB for efficient vector retrieval.\n", |
71 | | - "\n", |
72 | | - "5. **Summarize Each Chunk:**\n", |
73 | | - " - Generates a concise summary for each chunk using a language model.\n", |
74 | | - "\n", |
75 | | - "6. **Lemmatize Summaries:**\n", |
76 | | - " - Processes the summaries with SpaCy to reduce words to their base forms, ensuring consistency in text analysis.\n", |
77 | | - "\n", |
78 | | - "7. **Create Lemma Embeddings:**\n", |
79 | | - " - Converts the lemmatized summaries into embeddings to capture the distilled meaning of the content.\n", |
80 | | - "\n", |
81 | | - "8. **Create Nodes in Neo4j:**\n", |
82 | | - " - Inserts the summarized and lemmatized data as nodes in the Neo4j graph database, tagging each with metadata such as a unique corpus label and color.\n", |
83 | 48 | "\n", |
84 | | - "9. **Create Similarity Edges:**\n", |
85 | | - " - Establishes connections between nodes (text chunks) by creating similarity edges when the cosine similarity exceeds a defined threshold.\n", |
86 | | - "\n", |
87 | | - "10. **End Ingestion:**\n", |
88 | | - " - Finalizes the ingestion process, ensuring all data is stored and all relationships are properly established.\n", |
| 49 | + "### Ingestion Steps:\n", |
| 50 | + "1. **Start Ingestion:**: Initiates the ingestion process for new data\n", |
| 51 | + "2. **Split Texts into Chunks:**: Breaks large documents into smaller segments based on a maximum character limit and specified overlap to preserve context\n", |
| 52 | + "3. **Create Embeddings for Chunks:**: Converts each text chunk into a numerical vector using a dedicated embedding model\n", |
| 53 | + "4. **Add Documents to ChromaDB:**: Stores the text chunks, their embeddings, and associated metadata in ChromaDB for efficient vector retrieval\n", |
| 54 | + "5. **Summarize Each Chunk:**: Generates a concise summary for each chunk using a language model\n", |
| 55 | + "6. **Lemmatize Summaries:**: Processes the summaries with SpaCy to reduce words to their base forms, ensuring consistency in text analysis\n", |
| 56 | + "7. **Create Lemma Embeddings:**: Converts the lemmatized summaries into embeddings to capture the distilled meaning of the content\n", |
| 57 | + "8. **Create Nodes in Neo4j:**: Inserts the summarized and lemmatized data as nodes in the Neo4j graph database, tagging each with metadata such as a unique corpus label and color\n", |
| 58 | + "9. **Create Similarity Edges:**: Establishes connections between nodes (text chunks) by creating similarity edges when the cosine similarity exceeds a defined threshold\n", |
| 59 | + "10. **End Ingestion:**: Finalizes the ingestion process, ensuring all data is stored and all relationships are properly established\n", |
89 | 60 | "\n", |
90 | 61 | "### Retrieval Steps:\n", |
91 | | - "\n", |
92 | | - "1. **Start Retrieval:**\n", |
93 | | - " - Initiates the process based on a user’s query.\n", |
94 | | - "\n", |
95 | | - "2. **Embed User Query:**\n", |
96 | | - " - Transforms the query into a vector using the same embedding model employed during ingestion.\n", |
97 | | - "\n", |
98 | | - "3. **Query ChromaDB for Top_k Chunks:**\n", |
99 | | - " - Retrieves the top k text chunks that are most similar to the query based on vector similarity.\n", |
100 | | - "\n", |
101 | | - "4. **Initialize BFS with Retrieved Chunks:**\n", |
102 | | - " - Starts a Breadth-First Search (BFS) in the Neo4j graph using the initially retrieved chunks as the starting nodes.\n", |
103 | | - "\n", |
104 | | - "5. **Is Context Enough?:**\n", |
105 | | - " - Uses a language model to determine if the gathered context is sufficient to answer the query.\n", |
106 | | - " - If the context is sufficient, the system proceeds to generate the final answer.\n", |
107 | | - " - If not, BFS expands by exploring neighboring nodes with similarity scores above a threshold.\n", |
108 | | - "\n", |
109 | | - "6. **Generate Final Answer:**\n", |
110 | | - " - Compiles all relevant information into a final context and uses a language model to generate a detailed answer.\n", |
111 | | - "\n", |
112 | | - "7. **Return Final Answer:**\n", |
113 | | - " - Outputs the generated answer to the user.\n", |
114 | | - "\n", |
115 | | - "8. **End Retrieval:**\n", |
116 | | - " - Concludes the retrieval process once a complete answer is formulated or the maximum BFS depth is reached." |
| 62 | + "1. **Start Retrieval:**: Initiates the process based on a user’s query\n", |
| 63 | + "2. **Embed User Query:**: Transforms the query into a vector using the same embedding model employed during ingestion\n", |
| 64 | + "3. **Query ChromaDB for Top_k Chunks:**: Retrieves the top k text chunks that are most similar to the query based on vector similarity\n", |
| 65 | + "4. **Initialize BFS with Retrieved Chunks:**: Starts a Breadth-First Search (BFS) in the Neo4j graph using the initially retrieved chunks as the starting nodes\n", |
| 66 | + "5. **Is Context Enough?:**: Uses a language model to determine if the gathered context is sufficient to answer the query; if the context is sufficient, the system proceeds to generate the final answer; if not, BFS expands by exploring neighboring nodes with similarity scores above a threshold\n", |
| 67 | + "6. **Generate Final Answer:**: Compiles all relevant information into a final context and uses a language model to generate a detailed answer\n", |
| 68 | + "7. **Return Final Answer:**: Outputs the generated answer to the user\n", |
| 69 | + "8. **End Retrieval:**: Concludes the retrieval process once a complete answer is formulated or the maximum BFS depth is reached\n" |
117 | 70 | ] |
118 | | - }, |
| 71 | + }, |
119 | 72 | { |
120 | 73 | "cell_type": "markdown", |
121 | 74 | "metadata": {}, |
|
0 commit comments