| layout | name | description | authors | pypi | repo | type | report_issue | version | toc | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
integration |
Valyu Search |
Search and content extraction components using Valyu's API for web and proprietary sources |
|
Search & Content Extraction |
Haystack 2.0 |
true |
Haystack components for integrating Valyu's powerful search and content extraction APIs into your Haystack pipelines.
This package provides two main components:
ValyuSearch- Search component that queries the Valyu DeepSearch API and returns documents with content already includedValyuContentFetcher- Content extraction component that fetches and cleans content from URLs
Key Features:
- Search across web and proprietary sources
- Full content included in search results
- AI-powered content extraction and summarization
Use pip to install Valyu Search for Haystack:
pip install valyu-search-haystackOr install from source:
pip install -e .Requirements:
- Python 3.8+
- haystack-ai >= 2.0.0
- valyu >= 2.2.1
Set your Valyu API key as an environment variable:
export VALYU_API_KEY="your-api-key"The ValyuSearch component integrates with the Valyu DeepSearch API. Unlike many search APIs, Valyu returns full content by default, making it ideal for RAG pipelines.
Basic Usage:
from valyu_haystack import ValyuSearch
from haystack import Pipeline
# Create a search component (API key from VALYU_API_KEY env var)
search = ValyuSearch(
top_k=5,
search_type="all", # "web", "proprietary", or "all"
relevance_threshold=0.5
)
# Create and run a pipeline
pipeline = Pipeline()
pipeline.add_component("search", search)
result = pipeline.run({"search": {"query": "What is Haystack AI?"}})
documents = result["search"]["documents"]
links = result["search"]["links"]Component Parameters:
api_key(Secret): Your Valyu API key. Defaults toVALYU_API_KEYenvironment variabletop_k(int, default=10): Maximum number of results to returnapi_base_url(str): Base URL for the Valyu APIsearch_type(Literal["web", "proprietary", "all"], default="all"): Type of searchrelevance_threshold(float, default=0.5): Minimum relevance score (0.0-1.0)max_price(int, default=100): Maximum price per thousand queries in cents
Output:
documents(List[Document]): Documents with content and rich metadatalinks(List[str]): List of URLs from search results
Metadata included:
title: Page titleurl: Source URLdescription: Page descriptionsource: Data source identifierrelevance_score: Relevance score (0.0-1.0)price: Cost of this resultlength: Content length in charactersdata_type: Type of data ("structured" or "unstructured")image_url: Associated image URL (if any)
The ValyuContentFetcher component extracts clean, readable content from URLs using the Valyu Contents API. It supports batch processing and AI-powered summarization.
Basic Usage:
from valyu_haystack import ValyuContentFetcher
from haystack import Pipeline
# Create a content fetcher component
fetcher = ValyuContentFetcher(
extract_effort="normal", # "normal", "high", or "auto"
response_length="short", # "short", "medium", "large", "max", or int
summary=True # Enable AI summarization
)
# Create and run a pipeline
pipeline = Pipeline()
pipeline.add_component("fetcher", fetcher)
urls = ["https://example.com/article1", "https://example.com/article2"]
result = pipeline.run({"fetcher": {"urls": urls}})
documents = result["fetcher"]["documents"]Component Parameters:
api_key(Secret): Your Valyu API key. Defaults toVALYU_API_KEYenvironment variableapi_base_url(str): Base URL for the Valyu APItimeout(int, default=30): Request timeout in secondsextract_effort(Literal["normal", "high", "auto"], optional): Extraction thoroughnessresponse_length(Union[Literal["short", "medium", "large", "max"], int], optional): Content length per URLsummary(Union[bool, str, Dict], optional): AI summary configFalseorNone: No AI processing (raw content)True: Basic automatic summarizationstr: Custom instructions (max 500 chars)dict: JSON schema for structured extraction
Input:
urls(List[str], optional): List of URLs to fetchdocuments(List[Document], optional): Documents with URLs in metadata
Output:
documents(List[Document]): Documents with extracted content
Metadata included:
url: Source URLtitle: Page titlelength: Content length in characterssource: Data source identifierdata_type: Type of content
RAG Pipeline with Search and Chat:
from haystack import Pipeline
from haystack.utils import Secret
from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from valyu_haystack import ValyuSearch
# Create components
web_search = ValyuSearch(top_k=3)
prompt_template = [
ChatMessage.from_system("You are a helpful assistant."),
ChatMessage.from_user(
"Given the information below:\n"
"{% for document in documents %}{{ document.content }}{% endfor %}\n"
"Answer question: {{ query }}.\nAnswer:"
)
]
prompt_builder = ChatPromptBuilder(template=prompt_template, required_variables={"query", "documents"})
llm = OpenAIChatGenerator(api_key=Secret.from_env_var("OPENAI_API_KEY"), model="gpt-3.5-turbo")
# Build pipeline
pipe = Pipeline()
pipe.add_component("search", web_search)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", llm)
# Connect components
pipe.connect("search.documents", "prompt_builder.documents")
pipe.connect("prompt_builder.messages", "llm.messages")
# Run pipeline
query = "What is the most famous landmark in Berlin?"
result = pipe.run(data={"search": {"query": query}, "prompt_builder": {"query": query}})Indexing Pipeline with Content Fetcher:
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from valyu_haystack import ValyuContentFetcher
# Create components
document_store = InMemoryDocumentStore()
fetcher = ValyuContentFetcher()
writer = DocumentWriter(document_store=document_store)
# Build indexing pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=fetcher, name="fetcher")
indexing_pipeline.add_component(instance=writer, name="writer")
# Connect components
indexing_pipeline.connect("fetcher.documents", "writer.documents")
# Run pipeline
indexing_pipeline.run(data={
"fetcher": {"urls": ["https://haystack.deepset.ai/blog/guide-to-using-zephyr-with-haystack2"]}
})Structured data extraction with Content Fetcher:
from valyu_haystack import ValyuContentFetcher
# Define JSON schema for structured extraction
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"published_date": {"type": "string"},
"summary": {"type": "string"}
}
}
fetcher = ValyuContentFetcher(summary=schema)
result = fetcher.run(urls=["https://example.com/article"])
# Extracted structured data will be in document metadataBoth components use Haystack's Secret class for secure API key management:
- Header:
x-api-key: your-api-key - Environment variable:
VALYU_API_KEY
valyu-search-haystack is distributed under the terms of the Apache-2.0 license.