Skip to content

valyu-network/valyu-search-haystack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

layout name description authors pypi repo type report_issue version toc
integration
Valyu Search
Search and content extraction components using Valyu's API for web and proprietary sources
name socials
Valyu
github
valyu-network
Search & Content Extraction
Haystack 2.0
true

Table of Contents

Overview

PyPI - Version PyPI - Python Version

Haystack components for integrating Valyu's powerful search and content extraction APIs into your Haystack pipelines.

This package provides two main components:

  • ValyuSearch - Search component that queries the Valyu DeepSearch API and returns documents with content already included
  • ValyuContentFetcher - Content extraction component that fetches and cleans content from URLs

Key Features:

  • Search across web and proprietary sources
  • Full content included in search results
  • AI-powered content extraction and summarization

Installation

Use pip to install Valyu Search for Haystack:

pip install valyu-search-haystack

Or install from source:

pip install -e .

Requirements:

  • Python 3.8+
  • haystack-ai >= 2.0.0
  • valyu >= 2.2.1

Usage

Set your Valyu API key as an environment variable:

export VALYU_API_KEY="your-api-key"

ValyuSearch

The ValyuSearch component integrates with the Valyu DeepSearch API. Unlike many search APIs, Valyu returns full content by default, making it ideal for RAG pipelines.

Basic Usage:

from valyu_haystack import ValyuSearch
from haystack import Pipeline

# Create a search component (API key from VALYU_API_KEY env var)
search = ValyuSearch(
    top_k=5,
    search_type="all",  # "web", "proprietary", or "all"
    relevance_threshold=0.5
)

# Create and run a pipeline
pipeline = Pipeline()
pipeline.add_component("search", search)

result = pipeline.run({"search": {"query": "What is Haystack AI?"}})
documents = result["search"]["documents"]
links = result["search"]["links"]

Component Parameters:

  • api_key (Secret): Your Valyu API key. Defaults to VALYU_API_KEY environment variable
  • top_k (int, default=10): Maximum number of results to return
  • api_base_url (str): Base URL for the Valyu API
  • search_type (Literal["web", "proprietary", "all"], default="all"): Type of search
  • relevance_threshold (float, default=0.5): Minimum relevance score (0.0-1.0)
  • max_price (int, default=100): Maximum price per thousand queries in cents

Output:

  • documents (List[Document]): Documents with content and rich metadata
  • links (List[str]): List of URLs from search results

Metadata included:

  • title: Page title
  • url: Source URL
  • description: Page description
  • source: Data source identifier
  • relevance_score: Relevance score (0.0-1.0)
  • price: Cost of this result
  • length: Content length in characters
  • data_type: Type of data ("structured" or "unstructured")
  • image_url: Associated image URL (if any)

ValyuContentFetcher

The ValyuContentFetcher component extracts clean, readable content from URLs using the Valyu Contents API. It supports batch processing and AI-powered summarization.

Basic Usage:

from valyu_haystack import ValyuContentFetcher
from haystack import Pipeline

# Create a content fetcher component
fetcher = ValyuContentFetcher(
    extract_effort="normal",  # "normal", "high", or "auto"
    response_length="short",  # "short", "medium", "large", "max", or int
    summary=True  # Enable AI summarization
)

# Create and run a pipeline
pipeline = Pipeline()
pipeline.add_component("fetcher", fetcher)

urls = ["https://example.com/article1", "https://example.com/article2"]
result = pipeline.run({"fetcher": {"urls": urls}})
documents = result["fetcher"]["documents"]

Component Parameters:

  • api_key (Secret): Your Valyu API key. Defaults to VALYU_API_KEY environment variable
  • api_base_url (str): Base URL for the Valyu API
  • timeout (int, default=30): Request timeout in seconds
  • extract_effort (Literal["normal", "high", "auto"], optional): Extraction thoroughness
  • response_length (Union[Literal["short", "medium", "large", "max"], int], optional): Content length per URL
  • summary (Union[bool, str, Dict], optional): AI summary config
    • False or None: No AI processing (raw content)
    • True: Basic automatic summarization
    • str: Custom instructions (max 500 chars)
    • dict: JSON schema for structured extraction

Input:

  • urls (List[str], optional): List of URLs to fetch
  • documents (List[Document], optional): Documents with URLs in metadata

Output:

  • documents (List[Document]): Documents with extracted content

Metadata included:

  • url: Source URL
  • title: Page title
  • length: Content length in characters
  • source: Data source identifier
  • data_type: Type of content

Pipeline Examples

RAG Pipeline with Search and Chat:

from haystack import Pipeline
from haystack.utils import Secret
from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from valyu_haystack import ValyuSearch

# Create components
web_search = ValyuSearch(top_k=3)

prompt_template = [
    ChatMessage.from_system("You are a helpful assistant."),
    ChatMessage.from_user(
        "Given the information below:\n"
        "{% for document in documents %}{{ document.content }}{% endfor %}\n"
        "Answer question: {{ query }}.\nAnswer:"
    )
]

prompt_builder = ChatPromptBuilder(template=prompt_template, required_variables={"query", "documents"})
llm = OpenAIChatGenerator(api_key=Secret.from_env_var("OPENAI_API_KEY"), model="gpt-3.5-turbo")

# Build pipeline
pipe = Pipeline()
pipe.add_component("search", web_search)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", llm)

# Connect components
pipe.connect("search.documents", "prompt_builder.documents")
pipe.connect("prompt_builder.messages", "llm.messages")

# Run pipeline
query = "What is the most famous landmark in Berlin?"
result = pipe.run(data={"search": {"query": query}, "prompt_builder": {"query": query}})

Indexing Pipeline with Content Fetcher:

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from valyu_haystack import ValyuContentFetcher

# Create components
document_store = InMemoryDocumentStore()
fetcher = ValyuContentFetcher()
writer = DocumentWriter(document_store=document_store)

# Build indexing pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=fetcher, name="fetcher")
indexing_pipeline.add_component(instance=writer, name="writer")

# Connect components
indexing_pipeline.connect("fetcher.documents", "writer.documents")

# Run pipeline
indexing_pipeline.run(data={
    "fetcher": {"urls": ["https://haystack.deepset.ai/blog/guide-to-using-zephyr-with-haystack2"]}
})

Advanced Configuration

Structured data extraction with Content Fetcher:

from valyu_haystack import ValyuContentFetcher

# Define JSON schema for structured extraction
schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "author": {"type": "string"},
        "published_date": {"type": "string"},
        "summary": {"type": "string"}
    }
}

fetcher = ValyuContentFetcher(summary=schema)
result = fetcher.run(urls=["https://example.com/article"])

# Extracted structured data will be in document metadata

API Integration Details

Authentication

Both components use Haystack's Secret class for secure API key management:

  • Header: x-api-key: your-api-key
  • Environment variable: VALYU_API_KEY

License

valyu-search-haystack is distributed under the terms of the Apache-2.0 license.

About

Haystack component to use Valyu Deepsearch to search the web and proprietary sources

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages