Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 142 additions & 0 deletions docs/user-guide/algorithms/6_vectorisation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Vectorisation

The [vectors][raphtory.vectors] module allows you to transform a graph into a collection of documents and vectorise those documents using an embedding function. Since the AI space moves quickly, Raphtory allows you to plug in your preferred embedding model either locally or from an API.

Using this you can perform [semantic search](https://en.wikipedia.org/wiki/Semantic_search) over your graph data and build powerful AI systems with graph based RAG.

## Vectorise a graph

To vectorise a graph you must create an embeddings function that takes a list of strings and returns a matching list of embeddings. This function can use any model or library you prefer, in this example we use the openai library and direct it to a local API compatible ollama service.

/// tab | :fontawesome-brands-python: Python
```{.python notest}
def get_embeddings(documents, model="embeddinggemma"):
client = OpenAI(base_url='http://localhost:11434/v1/', api_key="ollama")
return [client.embeddings.create(input=text, model=model).data[0].embedding for text in documents]

v = g.vectorise(get_embeddings, nodes=node_document, edges=edge_document, verbose=True)
```
///

When you call [Vectorise()][raphtory.GraphView.vectorise] Raphtory automatically creates documents for each node and edge entity in your graph, optionally you can provide template strings to format documents and pass these to `vectorise()`. This is useful when you know which properties are semantically relevant or want to present information in a specific format when retrieved by a human or machine user. Additionally, you can cache the embedded graph to disk to avoid having to recompute the vectors when nothing has changed.

### Document templates

The templates for entity documents follow a subset of [Jinja](https://jinja.palletsprojects.com/en/stable/templates/) using [Mini Jinja](https://docs.rs/minijinja/latest/minijinja/).

Additionally, graph attributes and properties are exposed so that you can use them in template expressions. The nesting of attributes reflects the Python interface and you can perform chains such as `properties.prop_name` or `src.name` which will follow the same typing as in Python. For `datetime` values, by default Raphtory converts these into milliseconds since the Unix epoch but provides an optional `datetimeformat` function to convert this to a human readable format.

## Retrieve documents

You can retrieve relevant information from the [VectorisedGraph][raphtory.vectors.VectorisedGraph] by making selections.

A [VectorSelection][raphtory.vectors.VectorSelection] is a general object for holding embedded documents, you can create an empty selection or perform a similarity query against a `VectorisedGraph` to populate a new selection.

You can add to a selection by combining existing selections or by adding new documents associated with specific nodes and edges by their IDs. Additionally, you can [expand][raphtory.vectors.VectorSelection.expand_entities_by_similarity] a selection by making similarity queries relative to the entities in the current selection, this uses the power of the graph relationships to constrain your query.

Once you have a selection containing the information you want you can:

- Get the associated graph entities using [nodes()][raphtory.vectors.VectorSelection.nodes] or [edges()][raphtory.vectors.VectorSelection.edges].
- Get the associated documents using [get_documents()][raphtory.vectors.VectorSelection.get_documents] or [get_documents_with_scores()][raphtory.vectors.VectorSelection.get_documents_with_scores].

Each [Document][raphtory.vectors.Document] corresponds to unique entity in the graph, the contents of the associated document and it's vector representation. You can pull any of these out to retrieve information about an entity for a RAG system, compose a subgraph to analyse using Raphtory's algorithms, or feed into some more complex pipeline.

## Asking questions about your network

Using the Network example from the [ingestion using dataframes](../ingestion/3_dataframes.md) discussion you can set up a graph and add some simple AI tools in order to create a `VectorisedGraph`:

/// tab | :fontawesome-brands-python: Python
```{.python notest}
from raphtory import Graph
import pandas as pd
from openai import OpenAI

server_edges_df = pd.read_csv("./network_traffic_edges.csv")
server_edges_df["timestamp"] = pd.to_datetime(server_edges_df["timestamp"])

server_nodes_df = pd.read_csv("./network_traffic_nodes.csv")
server_nodes_df["timestamp"] = pd.to_datetime(server_nodes_df["timestamp"])

traffic_graph = Graph()
traffic_graph.load_edges_from_pandas(
df=server_edges_df,
src="source",
dst="destination",
time="timestamp",
properties=["data_size_MB"],
layer_col="transaction_type",
metadata=["is_encrypted"],
shared_metadata={"datasource": "./network_traffic_edges.csv"},
)
traffic_graph.load_nodes_from_pandas(
df=server_nodes_df,
id="server_id",
time="timestamp",
properties=["OS_version", "primary_function", "uptime_days"],
metadata=["server_name", "hardware_type"],
shared_metadata={"datasource": "./network_traffic_edges.csv"},
)

def get_embeddings(documents, model="embeddinggemma"):
client = OpenAI(base_url='http://localhost:11434/v1/', api_key="ollama")
return [client.embeddings.create(input=text, model=model).data[0].embedding for text in documents]

def send_query_with_docs(query: str, selection):
formatted_docs = "\n".join(doc.content for doc in selection.get_documents())
client = OpenAI(base_url="http://localhost:11434/v1/", api_key="ollama")
instructions = f"You are helpful assistant. Answer the user question using the following context:\n{formatted_docs}"

completion = client.chat.completions.create(
model="gemma3",
messages = [
{"role": "system", "content": f"You are helpful assistant. Answer the user question using the following context:\n{formatted_docs}"},
{"role": "user", "content": query}
]
)
return completion.choices[0].message.content

v = traffic_graph.vectorise(get_embeddings, verbose=True)
```
///

Using this `VectorisedGraph` you can perform similarity queries and feed the results into an LLM to ground it's responses in your data.

/// tab | :fontawesome-brands-python: Python
```{.python notest}
query = "What's the status of my linux boxes?"

node_selection = v.nodes_by_similarity(query, limit=3)

print(send_query_with_docs(query, node_selection))
```
///

However, you must always be aware that LLM responses are still statistical and variations will occur. In production systems you may want to use a structured output tool to enforce a specific format.

The output of the example query should be similar to the following:

!!! Output
```output
Okay, here’s a rundown of the status of your Linux boxes as of today, September 3, 2023:

* **ServerA (Alpha):**
* Datasource: ./network_traffic_edges.csv
* Hardware Type: Blade Server
* OS Version: Ubuntu 20.04 (Changed Sep 1, 2023 08:00)
* Primary Function: Database
* Uptime: 120 days
* **ServerD (Delta):**
* Datasource: ./network_traffic_edges.csv
* Hardware Type: Tower Server
* OS Version: Ubuntu 20.04 (Changed Sep 1, 2023 08:15)
* Primary Function: Application Server
* Uptime: 60 days
* **ServerE (Echo):**
* Datasource: ./network_traffic_edges.csv
* Hardware Type: Rack Server
* OS Version: Red Hat 8.1 (Changed Sep 1, 2023 08:20)
* Primary Function: Backup
* Uptime: 30 days

Do you need any more details about any of these servers?
```
2 changes: 1 addition & 1 deletion docs/user-guide/views/2_time.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ While `before()` and `after()` are more useful for continuous time datasets, `at
In the example below we print the degree of `Lome` across the full dataset, before 12:17 on the 13th of June, and after 9:07 on the 30th of June. We also use two time functions here, [start_date_time][raphtory.GraphView.start_date_time] and [end_date_time][raphtory.GraphView.end_date_time], which return information about a view.

!!! note
In this code example we have called the `before()` on the graph and `after()` on the node. This is important, as there are some subtle differences in outcomes that depend on where these functions are called. This is discussed in detail [below](2_time.md#traversing-the-graph-with-views).
In this code example we have called the `before()` on the graph and `after()` on the node. This is important, as there are some subtle differences in outcomes that depend on where these functions are called. This is discussed in detail [below](2_time.md#propagation-of-time-filters).

/// tab | :fontawesome-brands-python: Python
```python
Expand Down
4 changes: 2 additions & 2 deletions docs/user-guide/views/3_layer.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Before reading this topic, please ensure you are familiar with:

- [Edge layers](../ingestion/2_direct-updates.md#edge-layers)
- [Exploded Edges](../querying/4_edge-metrics.md#exploded-edges)
- [Traversing graphs](2_time.md#traversing-the-graph-with-views)
- [Traversing graphs](2_time.md#propagation-of-time-filters)

## Creating layers views

Expand Down Expand Up @@ -81,7 +81,7 @@ assert str(f"Total weight across Grooming and Resting between {start_day} and {e

## Traversing the graph with layers

Expanding on the example from [the time views](2_time.md#traversing-the-graph-with-views), if you wanted to look at which neighbours LOME has groomed, followed by who those monkeys have rested with, then you could write the following query.
Expanding on the example from [the time views](2_time.md#propagation-of-time-filters), if you wanted to look at which neighbours LOME has groomed, followed by who those monkeys have rested with, then you could write the following query.

!!! note
Similar to the time based filters, if a layer view is applied to the graph then all extracted entities will have this view applied to them. However, if the layer view is applied to a node or edge, it will only last until you have moved to a new node.
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ nav:
- user-guide/algorithms/3_node-algorithms.md
- user-guide/algorithms/4_view-algorithms.md
- user-guide/algorithms/5_community_detection.md
- user-guide/algorithms/6_vectorisation.md
- Exporting:
- user-guide/export/0_dummy_index.md
- user-guide/export/1_intro.md
Expand Down
14 changes: 7 additions & 7 deletions python/python/raphtory/__init__.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -743,17 +743,17 @@ class GraphView(object):
verbose: bool = False,
) -> VectorisedGraph:
"""
Create a VectorisedGraph from the current graph
Create a VectorisedGraph from the current graph.

Args:
embedding (Callable[[list], list]): the embedding function to translate documents to embeddings
nodes (bool | str): if nodes have to be embedded or not or the custom template to use if a str is provided. Defaults to True.
edges (bool | str): if edges have to be embedded or not or the custom template to use if a str is provided. Defaults to True.
cache (str, optional): the path to use to store the cache for embeddings.
verbose (bool): whether or not to print logs reporting the progress. Defaults to False.
embedding (Callable[[list], list]): Specify the embedding function used to vectorise documents into embeddings.
nodes (bool | str): Enable for nodes to be embedded, disable for nodes to not be embedded or specify a custom document property to use if a string is provided. Defaults to True.
edges (bool | str): Enable for edges to be embedded, disable for edges to not be embedded or specify a custom document property to use if a string is provided. Defaults to True.
cache (str, optional): Path used to store the cache of embeddings.
verbose (bool): Enable to print logs reporting progress. Defaults to False.

Returns:
VectorisedGraph: A VectorisedGraph with all the documents/embeddings computed and with an initial empty selection
VectorisedGraph: A VectorisedGraph with all the documents and their embeddings, with an initial empty selection.
"""

def window(self, start: TimeInput, end: TimeInput) -> GraphView:
Expand Down
Loading
Loading