Skip to content

A sample for using open vision models to extract texts from images using Genkit framework and Ollama (Node.js)

Notifications You must be signed in to change notification settings

LyalinDotCom/Genkit-Ollama-VisionSample

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Vision Text Extractor - Genkit + Ollama + Next.js

A sample web application for extracting text from images using Google Genkit and vision models served by Ollama. This project demonstrates how to integrate these technologies to build AI applications with local language models, optimized for developer laptops.

Vision Text Extractor Screenshot

Features

  • πŸ–ΌοΈ Multiple Input Methods: Upload images via drag-and-drop, file selection, URL, or paste from clipboard
  • πŸ€– Multiple Vision Models: Support for LLaVA (7B, 13B, 34B) and Gemma 3 (27B) vision models through Ollama
  • 🎨 Modern UI: Compact layout with side-by-side input/output, dark mode support
  • ⚑ Real-time Streaming: See results as they're generated with live updates
  • πŸ”§ Customizable Prompts: Pre-built templates for common use cases and custom prompt support
  • πŸ“Š Multiple Output Formats: View results as text, JSON, or markdown
  • πŸ’Ύ Export Options: Download or copy extracted text with one click
  • πŸ” Compact Image Preview: Zoom and rotate images without excessive scrolling
  • πŸ“Œ Sticky Results Panel: Results stay visible while you adjust settings
  • 🚨 Smart Error Handling: Clear messages when Ollama is not running or models are not installed

Prerequisites

  • Node.js 18.x or later
  • Ollama installed and running
  • At least one vision-capable model installed (e.g., llava:7b)

Installing Vision Models

# Install recommended LLaVA vision model (default)
ollama pull llava:7b       # Fast, good quality, 4.7GB (recommended)

# Alternative models
ollama pull llava:13b      # Better accuracy, 7.3GB
ollama pull llava:34b      # Best accuracy, 20GB
ollama pull gemma3:27b     # Google's large model, 17GB

Note: This sample has been tested with LLaVA and Gemma models. Other vision-capable models available in Ollama should work as well. The app will automatically detect all installed models that support vision capabilities.

Installation

  1. Clone this repository:
git clone <your-repo-url>
cd genkit-vision-nextjs
  1. Install dependencies:
npm install
  1. Start the development server:
npm run dev
  1. Open http://localhost:3000 in your browser

Project Structure

genkit-vision-nextjs/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ extract-text/    # Genkit flow API endpoint
β”‚   β”‚   └── models/          # Model availability endpoint
β”‚   β”œβ”€β”€ components/          # React components
β”‚   β”‚   β”œβ”€β”€ Header.tsx       # App header with theme toggle
β”‚   β”‚   β”œβ”€β”€ ImageUpload.tsx  # Image input component
β”‚   β”‚   β”œβ”€β”€ ImagePreview.tsx # Image viewer with controls
β”‚   β”‚   β”œβ”€β”€ ModelSelector.tsx # Model selection dropdown
β”‚   β”‚   β”œβ”€β”€ PromptInput.tsx  # Prompt customization
β”‚   β”‚   └── ExtractionResults.tsx # Results display
β”‚   β”œβ”€β”€ hooks/              # Custom React hooks
β”‚   └── page.tsx           # Main application page
β”œβ”€β”€ lib/
β”‚   β”œβ”€β”€ genkit/
β”‚   β”‚   β”œβ”€β”€ config.ts      # Genkit and model configurations
β”‚   β”‚   └── flows.ts       # Text extraction flow
β”‚   └── utils.ts           # Utility functions
└── public/                # Static assets

How It Works

Genkit Integration

This app uses the genkitx-ollama plugin to integrate with local vision models. The core logic is defined in a Genkit flow:

// lib/genkit/flows.ts
import { ai } from './config';
import { ollama } from 'genkitx-ollama';

export const extractTextFromImage = ai.defineFlow(
  {
    name: 'extractTextFromImage',
    // ... schema definitions
  },
  async (input) => {
    const response = await ai.generate({
      model: ollama.model(input.model),
      prompt: [
        { text: input.prompt },
        { media: { contentType: 'image/jpeg', url: `data:image/jpeg;base64,${input.imageBase64}` } },
      ],
    });
    return { extractedText: response.text };
  }
);

API Routes

The Genkit flow is exposed as a Next.js API route using the appRoute helper:

// app/api/extract-text/route.ts
import { appRoute } from '@genkit-ai/next';
import { extractTextFromImage } from '@/lib/genkit/flows';

export const POST = appRoute(extractTextFromImage);

Frontend Integration

The React frontend uses Genkit's client SDK for type-safe API calls with streaming:

import { streamFlow } from '@genkit-ai/next/client';
import type { extractTextFromImage } from '@/lib/genkit/flows';

// Use streamFlow for streaming responses
const { stream, output } = streamFlow<typeof extractTextFromImage>({
  url: '/api/extract-text',
  input: {
    model: selectedModel,
    imageBase64: base64Image,
    prompt: extractionPrompt,
  }
});

// Process streaming chunks
for await (const chunk of stream) {
  console.log(chunk);
}

// Get final result
const result = await output;

Configuration

Environment Variables

Create a .env.local file:

# Ollama API endpoint (optional, defaults to http://localhost:11434)
OLLAMA_API_URL=http://localhost:11434

Model Configuration

The app uses the genkitx-ollama plugin with dynamic model discovery:

// lib/genkit/config.ts
import { genkit } from 'genkit';
import { ollama } from 'genkitx-ollama';

export const ai = genkit({
  plugins: [
    ollama({
      serverAddress: process.env.OLLAMA_SERVER_ADDRESS || 'http://127.0.0.1:11434',
    }),
  ],
});

The app automatically discovers all vision-capable models installed in Ollama. No hardcoded model list is required!

Usage

  1. Upload an Image: Drag and drop, select a file, paste from clipboard, or provide a URL
  2. Select a Model: Choose from available vision models (green checkmark indicates installed models)
  3. Customize the Prompt: Use preset prompts or create your own
  4. Extract Text: Click the button to start extraction
  5. View Results: See extracted text with streaming updates
  6. Export: Copy to clipboard or download as text/JSON

Development

For the best development experience, we recommend running the Next.js frontend and the Genkit runtime in separate terminal sessions. This allows you to see logs from both processes independently and ensures the Genkit Developer UI functions correctly.

Running the Development Environment

  1. Start the Genkit Runtime:

    Open a terminal and run the following command to start the Genkit runtime with hot-reloading. This will also launch the Genkit Developer UI.

    npm run genkit:watch

    The Genkit Developer UI will be available at http://localhost:4000.

  2. Start the Frontend Application:

    In a second terminal, run the following command to start the Next.js development server.

    npm run dev

    Your application will be available at http://localhost:9002.

Note: It is important to run the Genkit runtime (genkit start) separately from the frontend development server (npm run dev). Attempting to run them together with a command like genkit start -- npm run dev can lead to connection issues with the Genkit Developer UI, as the Next.js server runs in its own process and may not expose the necessary hooks for the UI to connect to the runtime.

Creating a Standalone Build

To create a standalone build of the application, run the following commands:

npm run build
npm start

Troubleshooting

Ollama Connection Issues

If the app can't connect to Ollama:

  1. Ensure Ollama is running: ollama serve
  2. Check the API endpoint in your environment variables
  3. Verify models are installed: ollama list

Model Not Available

If a model shows as unavailable:

  1. Install it with Ollama: ollama pull model-name
  2. Refresh the page to update model availability

Image Processing Errors

  • Ensure images are under 10MB
  • Supported formats: PNG, JPG, JPEG, GIF, WebP
  • For URLs, ensure CORS is enabled on the image server

Contributing

This project is provided as-is, and I do not plan to accept pull requests. Please feel free to fork the repository and make any changes you'd like.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

This software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

Acknowledgments

About

A sample for using open vision models to extract texts from images using Genkit framework and Ollama (Node.js)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages