Skip to content

Latest commit

 

History

History
663 lines (531 loc) · 20.2 KB

File metadata and controls

663 lines (531 loc) · 20.2 KB
layout title parent nav_order
default
Chapter 5: Result Processing and Ranking
Perplexica Tutorial
5

Chapter 5: Result Processing and Ranking

Welcome to Chapter 5: Result Processing and Ranking. In this part of Perplexica Tutorial: AI-Powered Search Engine, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

How Perplexica transforms raw search results into coherent, useful answers is the heart of what makes it an AI search engine rather than just another search aggregator. This chapter covers the full result processing pipeline -- deduplication, relevance scoring, embedding-based re-ranking, answer synthesis with inline citations, and quality assurance. By the end, you will understand how to tune every stage for your specific use case.

Result Processing Pipeline Overview

After the data collection phase (Chapter 4) delivers raw search results and parsed page content, the result processing pipeline takes over. It transforms noisy, redundant, potentially unreliable data into a single, well-cited answer.

flowchart TD
    subgraph Input["Raw Input"]
        SR[Search Results - 20-30 items]
        PC[Parsed Page Content]
        UF[Uploaded File Content]
    end

    subgraph Dedup["Stage 1: Deduplication"]
        UD[URL Deduplication]
        CD[Content Deduplication]
    end

    subgraph Relevance["Stage 2: Relevance Scoring"]
        SS[Semantic Similarity]
        KW[Keyword Matching]
        FS[Freshness Score]
        AS[Authority Score]
    end

    subgraph Ranking["Stage 3: Re-Ranking"]
        CS[Composite Score Calculation]
        TK[Top-K Selection]
    end

    subgraph Synthesis["Stage 4: Answer Synthesis"]
        CP[Context Preparation]
        LLM[LLM Generation]
        CT[Citation Injection]
    end

    subgraph QA["Stage 5: Quality Assurance"]
        FC[Factual Consistency]
        CV[Citation Verification]
        CM[Completeness Check]
    end

    subgraph Output["Final Output"]
        ANS[Answer with Citations]
        SRC[Source Cards]
        FQ[Follow-up Questions]
    end

    SR & PC & UF --> UD
    UD --> CD
    CD --> SS & KW & FS & AS
    SS & KW & FS & AS --> CS
    CS --> TK
    TK --> CP
    CP --> LLM
    LLM --> CT
    CT --> FC
    FC --> CV
    CV --> CM
    CM --> ANS & SRC & FQ

    classDef input fill:#e1f5fe,stroke:#01579b
    classDef dedup fill:#fff3e0,stroke:#ef6c00
    classDef relevance fill:#e8f5e8,stroke:#1b5e20
    classDef ranking fill:#f3e5f5,stroke:#4a148c
    classDef synthesis fill:#fce4ec,stroke:#b71c1c
    classDef qa fill:#e0f2f1,stroke:#004d40
    classDef output fill:#f1f8e9,stroke:#33691e

    class SR,PC,UF input
    class UD,CD dedup
    class SS,KW,FS,AS relevance
    class CS,TK ranking
    class CP,LLM,CT synthesis
    class FC,CV,CM qa
    class ANS,SRC,FQ output
Loading

Stage 1: Deduplication

Search results from multiple engines frequently contain duplicate or near-duplicate entries. Perplexica removes them in two passes.

URL-Based Deduplication

// src/lib/dedup.ts
function deduplicateByUrl(results: SearchResult[]): SearchResult[] {
  const seen = new Map<string, SearchResult>();

  for (const result of results) {
    // Normalize URL: remove trailing slashes, query params, fragments
    const normalized = normalizeUrl(result.url);

    if (!seen.has(normalized)) {
      seen.set(normalized, result);
    } else {
      // Keep the result with the higher search engine score
      const existing = seen.get(normalized)!;
      if (result.score > existing.score) {
        seen.set(normalized, result);
      }
    }
  }

  return Array.from(seen.values());
}

function normalizeUrl(url: string): string {
  try {
    const parsed = new URL(url);
    // Remove common tracking parameters
    const trackingParams = ['utm_source', 'utm_medium', 'utm_campaign', 'ref', 'fbclid'];
    trackingParams.forEach((p) => parsed.searchParams.delete(p));
    // Remove fragment and trailing slash
    parsed.hash = '';
    let path = parsed.pathname.replace(/\/+$/, '') || '/';
    return `${parsed.origin}${path}${parsed.search}`;
  } catch {
    return url;
  }
}

Content-Based Deduplication

Even with different URLs, pages can have nearly identical content (mirrors, syndication, etc.).

// src/lib/dedup.ts
function deduplicateByContent(
  results: SearchResult[],
  similarityThreshold: number = 0.92,
): SearchResult[] {
  const unique: SearchResult[] = [];

  for (const result of results) {
    const isDuplicate = unique.some((existing) => {
      const sim = jaccardSimilarity(
        tokenize(existing.content),
        tokenize(result.content),
      );
      return sim > similarityThreshold;
    });

    if (!isDuplicate) {
      unique.push(result);
    }
  }

  return unique;
}

function jaccardSimilarity(a: Set<string>, b: Set<string>): number {
  const intersection = new Set([...a].filter((x) => b.has(x)));
  const union = new Set([...a, ...b]);
  return intersection.size / union.size;
}

function tokenize(text: string): Set<string> {
  return new Set(
    text
      .toLowerCase()
      .split(/\W+/)
      .filter((t) => t.length > 2),
  );
}

Stage 2: Relevance Scoring

After deduplication, each result is scored across four dimensions.

Scoring Dimensions

Dimension Weight Method Range
Semantic Similarity 0.40 Cosine similarity of query and document embeddings 0.0 - 1.0
Keyword Match 0.25 TF-IDF overlap between query terms and content 0.0 - 1.0
Freshness 0.15 Exponential decay based on publish date 0.0 - 1.0
Authority 0.20 Domain reputation + HTTPS + content depth heuristic 0.0 - 1.0

Semantic Similarity Scoring

// src/lib/scoring.ts
import { Embeddings } from '@langchain/core/embeddings';
import { computeSimilarity } from './computeSimilarity';

async function scoreSemanticSimilarity(
  query: string,
  results: SearchResult[],
  embeddingModel: Embeddings,
  measure: 'cosine' | 'dot' | 'euclidean',
): Promise<Map<string, number>> {
  const queryEmbedding = await embeddingModel.embedQuery(query);

  const contents = results.map((r) => r.content || r.title);
  const docEmbeddings = await embeddingModel.embedDocuments(contents);

  const scores = new Map<string, number>();

  results.forEach((result, i) => {
    const similarity = computeSimilarity(
      queryEmbedding,
      docEmbeddings[i],
      measure,
    );
    scores.set(result.url, similarity);
  });

  return scores;
}

Freshness Scoring

// src/lib/scoring.ts
function scoreFreshness(publishedDate: string | undefined): number {
  if (!publishedDate) return 0.5; // Unknown date gets neutral score

  const published = new Date(publishedDate).getTime();
  const now = Date.now();
  const ageInDays = (now - published) / (1000 * 60 * 60 * 24);

  // Exponential decay: half-life of 90 days
  const halfLifeDays = 90;
  return Math.exp((-Math.log(2) * ageInDays) / halfLifeDays);
}

Authority Scoring

// src/lib/scoring.ts
const HIGH_AUTHORITY_DOMAINS = new Set([
  'wikipedia.org', 'arxiv.org', 'nature.com', 'science.org',
  'github.com', 'stackoverflow.com', 'docs.python.org',
  'developer.mozilla.org', 'nist.gov', 'nih.gov',
  'reuters.com', 'apnews.com', 'bbc.com',
]);

function scoreAuthority(url: string, content: string): number {
  let score = 0.5; // Base score

  const domain = new URL(url).hostname.replace('www.', '');

  // Domain reputation
  if (HIGH_AUTHORITY_DOMAINS.has(domain)) score += 0.2;
  if (domain.endsWith('.edu')) score += 0.15;
  if (domain.endsWith('.gov')) score += 0.15;

  // HTTPS
  if (url.startsWith('https')) score += 0.05;

  // Content depth heuristic
  const wordCount = content.split(/\s+/).length;
  if (wordCount > 500) score += 0.05;
  if (wordCount > 1500) score += 0.05;

  return Math.min(score, 1.0);
}

Composite Score Calculation

// src/lib/scoring.ts
interface ScoringWeights {
  semantic: number;
  keyword: number;
  freshness: number;
  authority: number;
}

const DEFAULT_WEIGHTS: ScoringWeights = {
  semantic: 0.40,
  keyword: 0.25,
  freshness: 0.15,
  authority: 0.20,
};

function computeCompositeScore(
  semanticScore: number,
  keywordScore: number,
  freshnessScore: number,
  authorityScore: number,
  weights: ScoringWeights = DEFAULT_WEIGHTS,
): number {
  return (
    weights.semantic * semanticScore +
    weights.keyword * keywordScore +
    weights.freshness * freshnessScore +
    weights.authority * authorityScore
  );
}

Stage 3: Re-Ranking and Selection

After scoring, results are re-ranked by composite score and the top-k are selected for answer synthesis.

flowchart LR
    subgraph Before["Before Re-Ranking (15 results)"]
        R1["Result A - 0.82"]
        R2["Result B - 0.45"]
        R3["Result C - 0.91"]
        R4["Result D - 0.67"]
        R5["Result E - 0.73"]
    end

    subgraph Rerank["Re-Ranking"]
        SORT[Sort by Composite Score]
        SELECT[Select Top-K = 5-8]
    end

    subgraph After["After Re-Ranking (Top 5)"]
        T1["1. Result C - 0.91"]
        T2["2. Result A - 0.82"]
        T3["3. Result E - 0.73"]
        T4["4. Result D - 0.67"]
        T5["5. Result B - 0.45"]
    end

    Before --> SORT
    SORT --> SELECT
    SELECT --> After

    classDef before fill:#fff3e0,stroke:#ef6c00
    classDef process fill:#f3e5f5,stroke:#4a148c
    classDef after fill:#e8f5e8,stroke:#1b5e20

    class R1,R2,R3,R4,R5 before
    class SORT,SELECT process
    class T1,T2,T3,T4,T5 after
Loading
// src/lib/reranker.ts
interface RankedResult extends SearchResult {
  compositeScore: number;
  semanticScore: number;
  keywordScore: number;
  freshnessScore: number;
  authorityScore: number;
}

async function rerankAndSelect(
  query: string,
  results: SearchResult[],
  embeddingModel: Embeddings,
  topK: number = 6,
): Promise<RankedResult[]> {
  // Compute all score dimensions
  const semanticScores = await scoreSemanticSimilarity(
    query, results, embeddingModel, 'cosine',
  );

  const ranked: RankedResult[] = results.map((result) => {
    const semantic = semanticScores.get(result.url) || 0;
    const keyword = scoreKeywordMatch(query, result.content);
    const freshness = scoreFreshness(result.publishedDate);
    const authority = scoreAuthority(result.url, result.content);

    return {
      ...result,
      semanticScore: semantic,
      keywordScore: keyword,
      freshnessScore: freshness,
      authorityScore: authority,
      compositeScore: computeCompositeScore(
        semantic, keyword, freshness, authority,
      ),
    };
  });

  // Sort descending by composite score and take top-k
  return ranked
    .sort((a, b) => b.compositeScore - a.compositeScore)
    .slice(0, topK);
}

Stage 4: Answer Synthesis

The top-ranked results are assembled into a context window and passed to the LLM for answer generation.

Context Assembly

// src/lib/synthesis.ts
function assembleContext(rankedResults: RankedResult[]): string {
  return rankedResults
    .map((result, index) => {
      return [
        `[${index + 1}] Source: ${result.url}`,
        `Title: ${result.title}`,
        `Content: ${result.content}`,
        `---`,
      ].join('\n');
    })
    .join('\n\n');
}

Answer Generation with Streaming

// src/lib/synthesis.ts
import { ChatPromptTemplate } from '@langchain/core/prompts';
import { BaseChatModel } from '@langchain/core/language_models/chat_models';

const synthesisPrompt = ChatPromptTemplate.fromMessages([
  [
    'system',
    `You are Perplexica, an AI search engine. Given search results,
provide a comprehensive answer to the user's question.

Rules:
1. Use ONLY information from the provided search results.
2. Cite sources inline using [number] notation, e.g., [1], [2].
3. Every factual claim MUST have a citation.
4. If results are insufficient, say so honestly.
5. Structure your answer with markdown headings and lists.
6. Be comprehensive but concise.

Search Results:
{context}`,
  ],
  ['human', '{query}'],
]);

async function* generateStreamingAnswer(
  query: string,
  rankedResults: RankedResult[],
  chatModel: BaseChatModel,
): AsyncGenerator<string> {
  const context = assembleContext(rankedResults);

  const chain = synthesisPrompt.pipe(chatModel);

  const stream = await chain.stream({
    query,
    context,
  });

  for await (const chunk of stream) {
    yield chunk.content as string;
  }
}

Citation Format

Perplexica uses numbered inline citations that map to source cards displayed alongside the answer.

Citation Style Example When Used
Inline numbered According to recent studies [1], ... Default for all focus modes
Multi-source This is supported by [1][3][5] When multiple sources confirm a fact
Quoted "exact quote from source" [2] For direct quotations
Parenthetical (Source: [4]) For supplementary information

Stage 5: Quality Assurance

Follow-up Question Generation

After generating the answer, Perplexica suggests related follow-up questions.

// src/lib/followUp.ts
const followUpPrompt = ChatPromptTemplate.fromMessages([
  [
    'system',
    `Based on the user's question and the answer provided, suggest
3 follow-up questions that would help the user explore the topic further.
Return them as a JSON array of strings.

Original question: {query}
Answer summary: {answerSummary}`,
  ],
  ['human', 'Generate follow-up questions.'],
]);

async function generateFollowUpQuestions(
  query: string,
  answer: string,
  chatModel: BaseChatModel,
): Promise<string[]> {
  const chain = followUpPrompt.pipe(chatModel).pipe(new StringOutputParser());

  const result = await chain.invoke({
    query,
    answerSummary: answer.substring(0, 1000),
  });

  try {
    return JSON.parse(result);
  } catch {
    return [];
  }
}

Factual Consistency Check

// src/lib/quality.ts
interface QualityReport {
  hasCitations: boolean;
  citationCount: number;
  allCitationsValid: boolean;
  estimatedFactualCoverage: number;
  warnings: string[];
}

function assessAnswerQuality(
  answer: string,
  sourceCount: number,
): QualityReport {
  // Extract citation references [n] from answer
  const citationPattern = /\[(\d+)\]/g;
  const citations = [...answer.matchAll(citationPattern)];
  const uniqueCitations = new Set(citations.map((m) => parseInt(m[1])));

  // Check all citations reference valid sources
  const allValid = [...uniqueCitations].every(
    (n) => n >= 1 && n <= sourceCount,
  );

  // Estimate factual coverage
  const sentences = answer.split(/[.!?]+/).filter((s) => s.trim().length > 20);
  const citedSentences = sentences.filter((s) => /\[\d+\]/.test(s));
  const coverage = sentences.length > 0
    ? citedSentences.length / sentences.length
    : 0;

  const warnings: string[] = [];
  if (coverage < 0.3) warnings.push('Low citation coverage');
  if (!allValid) warnings.push('Invalid citation references found');
  if (uniqueCitations.size < 2) warnings.push('Answer relies on single source');

  return {
    hasCitations: uniqueCitations.size > 0,
    citationCount: uniqueCitations.size,
    allCitationsValid: allValid,
    estimatedFactualCoverage: coverage,
    warnings,
  };
}

Tuning the Pipeline

Different query types benefit from different scoring weights. Here are recommended presets.

Query Type Semantic Keyword Freshness Authority Top-K
General knowledge 0.40 0.25 0.15 0.20 6
Current events / news 0.25 0.20 0.40 0.15 8
Academic research 0.35 0.20 0.10 0.35 5
Technical / how-to 0.45 0.30 0.05 0.20 5
Opinion / comparative 0.40 0.20 0.10 0.30 8
// Dynamic weight selection based on focus mode
function getWeightsForFocusMode(mode: FocusMode): ScoringWeights {
  switch (mode) {
    case 'academicSearch':
      return { semantic: 0.35, keyword: 0.20, freshness: 0.10, authority: 0.35 };
    case 'youtubeSearch':
    case 'redditSearch':
      return { semantic: 0.40, keyword: 0.20, freshness: 0.30, authority: 0.10 };
    case 'writingAssistant':
      return { semantic: 0.50, keyword: 0.25, freshness: 0.05, authority: 0.20 };
    default:
      return DEFAULT_WEIGHTS;
  }
}

Summary

Result processing is where Perplexica transforms raw search data into intelligent answers. The five-stage pipeline -- deduplication, relevance scoring, re-ranking, answer synthesis, and quality assurance -- ensures that the final output is accurate, well-cited, and comprehensive. Each stage is modular and configurable, allowing fine-tuning for different query types and use cases.

Key Takeaways

  • Multi-dimensional scoring -- semantic similarity, keyword matching, freshness, and authority each contribute to a weighted composite score.
  • Embedding-based re-ranking -- cosine similarity between query and document embeddings is the strongest signal for relevance.
  • Inline citations are mandatory -- every factual claim in the answer is traced back to a numbered source, ensuring verifiability.
  • Quality assurance is automated -- citation coverage and validity are checked programmatically before the answer is delivered.
  • Scoring weights are configurable per focus mode -- news queries emphasize freshness while academic queries emphasize authority.
  • Follow-up questions -- the LLM generates contextual follow-up suggestions to keep the research conversation flowing.

Next Steps

With result processing understood, the next chapter covers how these processed answers are presented to users -- the frontend user interface, component architecture, and real-time streaming display.

Ready for UI development? Continue to Chapter 6: User Interface Development


Built with insights from the Perplexica project.

What Problem Does This Solve?

Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for result, score, results so behavior stays predictable as complexity grows.

In practical terms, this chapter helps you avoid three common failures:

  • coupling core logic too tightly to one implementation path
  • missing the handoff boundaries between setup, execution, and validation
  • shipping changes without clear rollback or observability strategy

After working through this chapter, you should be able to reason about Chapter 5: Result Processing and Ranking as an operating subsystem inside Perplexica Tutorial: AI-Powered Search Engine, with explicit contracts for inputs, state transitions, and outputs.

Use the implementation notes around query, subgraph, classDef as your checklist when adapting these patterns to your own repository.

How it Works Under the Hood

Under the hood, Chapter 5: Result Processing and Ranking usually follows a repeatable control path:

  1. Context bootstrap: initialize runtime config and prerequisites for result.
  2. Input normalization: shape incoming data so score receives stable contracts.
  3. Core execution: run the main logic branch and propagate intermediate state through results.
  4. Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
  5. Output composition: return canonical result payloads for downstream consumers.
  6. Operational telemetry: emit logs/metrics needed for debugging and performance tuning.

When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.

Source Walkthrough

Use the following upstream sources to verify implementation details while reading this chapter:

  • View Repo Why it matters: authoritative reference on View Repo (github.com).

Suggested trace strategy:

  • search upstream code for result and score to map concrete implementation paths
  • compare docs claims against actual runtime/config code before reusing patterns in production

Chapter Connections