Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
385 changes: 385 additions & 0 deletions src/oss/python/integrations/vectorstores/teradata.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,385 @@
---
title: TeradataVectorStore
---

>Teradata Vector Store is designed to store, index, and search high-dimensional vector embeddings efficiently within your enterprise data platform.

This guide shows you how to quickly get up and running with TeradataVectorStore for your semantic search and RAG applications. Whether you're new to Teradata or looking to add AI capabilities to your existing data workflows, this guide will walk you through everything you need to know.

**What makes TeradataVectorStore special?**
- Built on enterprise-grade Teradata Vantage platform.
- Seamlessly integrates with your existing data warehouse.
- Supports multiple vector search algorithms for different use cases.
- Scales from prototype to production workloads.

## Setup

Before we dive in, you'll need to install the necessary packages. TeradataVectorStore is part of the `langchain-teradata` package, which also includes other Teradata integrations for LangChain.

**New to Teradata?** Refer to :
- [Teradata VantageCloud Lake](https://www.teradata.com/platform/vantagecloud)
- Get started with [VantageCloud Lake](https://docs.teradata.com/r/Lake-Getting-Started-with-VantageCloud-Lake/)

### Installation


```python pip
pip install langchain-teradata
```

### Credentials

**Connecting to Teradata:** The `create_context()` function establishes your connection to the Teradata Vantage system. This is how teradataml (and by extension, TeradataVectorStore) knows which database to connect to and authenticate with.

**What you'll need:**
- **hostname**: Your Teradata system's address
- **username/password**: Your database credentials
- **base_url**: API endpoint for your Teradata system
- **pat_token**: Personal Access Token for API authentication
- **pem_file**: SSL certificate file for secure connections

**For more information** Check out the [Teradata Vector Store User Guide](https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Vector-Store-User-Guide/Setting-up-Vector-Store/Required-Privileges) for detailed setup instructions.

**For information related to teradataml** Refer to [TeradataML User Guide](https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide/Introduction-to-Teradata-Package-for-Python)

```python
import os
from getpass import getpass
from teradataml import create_context

os.environ['TD_HOST'] = getpass(prompt='hostname: ')
os.environ['TD_USERNAME'] = getpass(prompt='username: ')
os.environ['TD_PASSWORD'] = getpass(prompt='password: ')
os.environ['TD_BASE_URL'] = getpass(prompt='base_url: ')
os.environ['TD_PAT_TOKEN'] = getpass(prompt='pat_token: ')
os.environ['TD_PEM_FILE'] = getpass(prompt='pem_file: ')
create_context()
```

---

## Instantiation

**Initialize your embeddings**

**TeradataVectorStore supports three types of embedding objects:**
1. **String identifiers** (e.g., "amazon.titan-embed-text-v1")
2. **TeradataAI objects**
3. **LangChain embedding objects** - LangChain-compatible embedding model objects

```python
# Initialize embeddings
from langchain_aws import BedrockEmbeddings
embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", region_name="us-west-2")
```

**Create Your First Vector Store**

Let's start with some sample Documents and create a vector store. The `from_documents()` method is one of the most straightforward ways to get started - just pass in your documents and TeradataVectorStore handles the rest.

**What happens under the hood:**
- Your documents get converted to a Teradataml Dataframe and passed to the vector store
- The embeddings are generated and stored for each Document object
- Indexes are automatically created for fast similarity search and chat operations

```python
from langchain_teradata import TeradataVectorStore
from langchain_core.documents import Document
# Sample documents about different topics
docs = [
Document(page_content="Teradata provides scalable data analytics solutions for enterprises."),
Document(page_content="Machine learning models require high-quality training data to perform well."),
Document(page_content="Vector databases enable semantic search capabilities beyond keyword matching."),
Document(page_content="LangChain simplifies building applications with large language models."),
Document(page_content="Data warehousing has evolved to support real-time analytics and AI workloads.")
]

# Create the vector store
vs = TeradataVectorStore.from_documents(
name="my_knowledge_base",
documents=docs,
embedding=embeddings
)

print("Vector store created successfully!")
```

After creating your vector store, it's always good practice to verify that everything was set up correctly. TeradataVectorStore provides helpful methods to monitor your operations and understand what's happening behind the scenes.

**Why check status?**
- **Operation tracking**: See exactly which stage your vector store creation is at.
- **Troubleshooting**: Quickly identify if something went wrong during setup.
- **Progress monitoring**: For large datasets, track embedding generation progress.
- **Validation**: Confirm your vector store is ready for queries.

```python
# Check the status of the store.
vs.status()
```

Want to see what's actually inside your vector store? The `get_details()` method gives you a comprehensive overview of your setup - think of it as your vector store's "dashboard."

**What you'll see:**
- **Object inventory**: Number of tables or documents you have added.
- **Search parameters**: Current algorithm settings (HNSW, K-means, etc.)
- **Configuration details**: Embedding dimensions, distance metrics, and indexing options.
- **Performance settings**: Top-k values, similarity thresholds, and other query parameters.

```python
vs.get_details()
```

---

## Manage vector store

### Add items to vector store

One of the best features of TeradataVectorStore is how easy it is to expand your knowledge base. As your business grows and you have more documents, you can continuously add them without rebuilding everything from scratch.

**Real-world scenarios:**
- Add new product documentation as it's created.
- Include fresh research papers or industry reports.
- Incorporate customer feedback and support documents.
- Update with latest policy or procedure changes.

**Enterprise advantage:** Since everything runs on Teradata, you can easily add data from your existing tables, data warehouses, or real-time feeds without complex data movement.

```python
# Add more documents
additional_docs = [
Document(page_content="Retrieval-augmented generation combines the power of search with language models."),
Document(page_content="Teradata's vector capabilities support both structured and unstructured data analysis.")
]

vs.add_documents(documents=additional_docs)
print("Added more knowledge to the vector store!")
```

```python
# Check the status of the new store.
vs.status()
```

---

## Query vector store

Once your vector store has been created and the relevant documents have been added, you will most likely wish to query it during the running of your chain or agent.

### Query directly

Now let's search for information in our vector store. Unlike traditional keyword search, vector search understands the meaning behind your questions. Ask about "AI applications" and it might return results about "machine learning models" because it understands these concepts are related.

**How similarity search works:**
- Your question gets converted to a vector embedding (just like your documents).
- TeradataVectorStore calculates similarity scores between your question and stored documents.
- The most relevant results are returned, ranked by similarity.

```python
# Ask a question
question = "What are vector databases?"
results = vs.similarity_search(question=question, return_type = "json")

print("Found relevant information:")
for result in results.similar_objects:
print(f" {result}")
```

### Query by turning into retriever

You can also transform the vector store into a retriever for easier usage in your chains.

```python
# Create a retriever for your RAG pipeline
retriever = vs.as_retriever(search_type="similarity")

# Test the retriever
retrieved_docs = retriever.invoke("Tell me about Teradata's capabilities")

print("Retrieved documents for RAG:")
for doc in retrieved_docs:
print(f"- {doc.page_content}")
```

---

## Usage for retrieval-augmented generation

The `ask()` combines the power of vector search with language model generation. Instead of just returning raw document chunks, you get coherent, contextual answers.

**The two-step process:**
1. **Retrieval**: Find the most relevant documents from your vector store.
2. **Generation**: Use those documents as context to generate a natural language response.

**Why this is powerful:** Your AI responses are grounded in your actual data, reducing hallucinations and ensuring accuracy. It's like having a knowledgeable assistant who actually read your company's documents!

```python
# Get a comprehensive answer
response = vs.ask(question="What are the benefits of using vector databases?")
print("AI Response:")
print(response)
```

Retrieval-Augmented Generation (RAG) is the technique that powers most modern AI assistants and chatbots. TeradataVectorStore integrates seamlessly with LangChain to make building RAG applications straightforward.

**What makes a good RAG application:**
- **Relevant retrieval**: Your vector store finds the right information.
- **Contextual generation**: The language model uses that information effectively.
- **Source transparency**: Users can see where answers come from.


**How it works with TeradataVectorStore**:
- You can use your vector store as a retriever to get the most relevant documents, then pass those documents to a RAG chain within LangChain workflows.
- This gives you the flexibility to build custom pipelines while leveraging Teradata's powerful vector search capabilities.

Now let's build a complete RAG pipeline that combines your TeradataVectorStore retriever with a language model. This demonstrates the full power of RAG - retrieving relevant information from your vector store and using it to generate informed responses.

**What's happening in this pipeline:**

- Retrieval: Your vector store finds the most relevant documents for the question.
- Context preparation: Those documents become context for the language model.
- Generation: The LM generates an answer based on your actual data.
- Output parsing: Clean, formatted response ready for your application.


**Real-world applications:**

- Customer support: Answer questions using your product documentation.
- Research assistance: Query your organization's knowledge repositories.
- Compliance: Ensure responses are based on approved company information.

```python
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.chat_models import init_chat_model

#Example: Simple RAG chain
# Initialize the chat model
llm = init_chat_model("anthropic.claude-3-5-sonnet-20240620-v1:0",
model_provider="bedrock_converse",
region_name="<ENTER REGION>",
aws_access_key_id = "<ENTER AWS ACCESS KEY>" ,
aws_secret_access_key = "<ENTER AWS SECRET KEY>"
)


# Create a prompt template for the LLM to format its response using retrieved context
prompt = PromptTemplate.from_template(
"Use the following context to answer the question.\nContext:\n{context}\n\nQuestion: {question}\nAnswer:"
)

# Build the RAG chain: retrieve context, format prompt, generate answer, and parse output
rag_chain = (
{
"context": retriever,
"question": RunnablePassthrough()
}
| prompt
| llm
| StrOutputParser()
)

# Invoke the RAG chain with a sample question and print the response
response = rag_chain.invoke("Benefits of Vector Store")
print(response)
```

---

## Working with Different Data Types

TeradataVectorStore's flexibility really shines when working with different types of data sources. Depending on what you're starting with, you can choose the most appropriate method.

**Choose your starting point:**
- **Have PDF documents?** Use `from_documents()` with file paths
- **Working with database tables?** Use `from_datasets()` with DataFrames
- **Already have embeddings?** Use `from_embeddings()` to import them directly

### From PDF Files
```python
# File-based vector store from PDFs
pdf_vs = TeradataVectorStore.from_documents(
name="pdf_knowledge",
documents="path/to/your/document.pdf", # or list of PDF paths
embedding=embeddings
)
```

### From Database Tables
```python
# Content-based from existing tables
from teradataml import DataFrame
table_data = DataFrame('your_table_name')

table_vs = TeradataVectorStore.from_datasets(
name="table_knowledge",
data=table_data,
data_columns=["text_column"],
embedding=embeddings
)
```

### From Pre-computed Embeddings
```python
# If you already have embeddings
embedding_vs = TeradataVectorStore.from_embeddings(
name="embedding_store",
data=your_embedding_data,
data_columns="embedding_column"
)
```

***Note*** <br />
When working with tables (and embedded tables), the `data_columns` parameter is mandatory. This tells TeradataVectorStore exactly which columns contain the text content you want to convert into embeddings. Think of it as pointing the service to the right information

For example, if your table has columns like id, title, description, and category, you'd specify data_columns=["description"] to embed only the description text, or data_columns=["title", "description"] to combine both fields.

Below is a small example of loading sample table with `teradatagenai` and creating a content based store out of it. For the data_columns we will pass the "rev_text" column which will be used to generate the embeddings.

```python
from teradatagenai import load_data

# Load sample data into Teradata
load_data("byom", "amazon_reviews_25")

# Create a vector store from the Teradata table
td_vs = TeradataVectorStore.from_datasets(
name="table_store_amazon",
data="amazon_reviews_25",
data_columns="rev_text",
embedding=embeddings)
```

```python
# Check the status of the new store
td_vs.status()
```

---

## Next Steps

Congratulations! You've just built your first AI-powered search and RAG system with TeradataVectorStore. You're now ready to scale this up to handle real enterprise workloads.

**Ready to go deeper?**
- **Advanced search algorithms**: Try HNSW or K-means clustering for large-scale deployments
- **Custom embedding models**: Experiment with domain-specific embeddings for your industry
- **Real-time updates**: Set up pipelines to automatically update your vector store as new data arrives

**Production considerations:**
- **Security**: Leverage Teradata's enterprise security features
- **Monitoring**: Use Teradata's built-in performance monitoring

**Learn more:**
- [LangChain RAG Tutorials](https://python.langchain.com/docs/tutorials/rag) - Deep dive into RAG patterns
- [TeradataVectorStore Workflows](https://github.com/Teradata/langchain-teradata) - Complete examples and use cases
- [VantageCloud Lake](https://www.teradata.com/platform/vantagecloud) - Cloud-native analytics platform

---

## API reference

For detailed documentation of all TeradataVectorStore features and configurations head to the API reference.
[langchain-teradata User Guide](https://docs.teradata.com/search/documents?query=Teradata+Package+for+LangChain&sort=last_update&virtual-field=title_only&content-lang=en-US)