BAIO (Bioinformatics AI for Open-set detection) is a web-based metagenomic analysis platform that classifies DNA sequences with machine learning. The current deployed inference path uses 6-mer sequence features with saved scikit-learn SVM and RandomForest models to distinguish viral and host DNA, with an optional novelty flag based on low-confidence predictions.
docker compose up
# Frontend: http://localhost:4173
# Backend: http://localhost:8080# Terminal 1 - Backend
source .venv/bin/activate
uvicorn api.main:app --reload --port 8080
# Terminal 2 - Frontend
cd frontend && npm install && npm run dev
# Frontend: http://localhost:5173
# API: http://localhost:8080.env has GOOGLE_API_KEY set. Get API Key
- Sequence Classification: Classifies DNA sequences into Virus or Host categories
- K-mer Analysis: Uses 6-mer frequency features for sequence representation
- Confidence Visualization: Color-coded confidence bars for each prediction
- GC Content Analysis: Heatmap showing GC content distribution
- Risk Assessment: Color-coded risk level indicators (Low/Moderate/High)
- Sequence Explanations: Per-sequence summaries based on prediction, confidence, GC content, and organism-name heuristics
- Dark Mode: Toggle between light and dark themes
- Export Options: Download results as JSON, CSV, or PDF
- AI Assistant: Gemini-powered chat for sequence analysis questions
- Novelty Flagging: Optional heuristic flag for low-confidence or out-of-distribution-looking sequences
baio/
β
βββ api/ # FastAPI backend
β βββ main.py # API endpoints (classify, chat, health)
β βββ llm_client.py # Google Gemini AI client
β βββ Dockerfile # Docker container config
β
βββ frontend/ # React + Vite frontend
β βββ src/
β β βββ components/
β β β βββ Header.tsx # App header with AI Assistant & model info
β β β βββ SequenceInput.tsx # FASTA input form
β β β βββ ConfigPanel.tsx # Classification settings
β β β βββ ResultsDashboard.tsx # Results table & visualizations
β β β βββ ChatWidget.tsx # AI chat widget (legacy)
β β βββ App.tsx # Main application component
β β βββ api.ts # API client functions
β β βββ types.ts # TypeScript interfaces
β βββ package.json # Node.js dependencies
β βββ tailwind.config.js # Tailwind configuration
β
βββ binary_classifiers/ # ML classification core used by the API
β βββ predict_class.py # Loads saved models, predicts labels and probabilities
β βββ evaluation.py # Evaluation helpers and metric generation
β βββ transformers/ # K-mer transformer + saved vectorizers
β βββ models/ # Saved model files (*.pkl)
β
βββ metaseq/ # Experimental/research ML utilities
β βββ dataio.py # FASTA/FASTQ file loaders
β βββ models.py # Alternative pipeline definitions
β βββ train.py # Config-driven training entry point
β
βββ prompting/ # LLM prompting utilities
β
βββ tests/ # Unit tests
β
βββ data/ # Sample FASTA data
β
βββ .env # API keys (GOOGLE_API_KEY)
β
βββ requirements.txt # Python dependencies
βββ environment.yml # Conda environment
βββ docker-compose.yml # Docker setup
βββ README.md # This file
-
Validate input DNA The API checks sequence length, allowed nucleotide characters, GC/AT extremes, and ambiguous-base ratio before classification.
-
Convert sequence to 6-mers Each sequence is split into overlapping substrings of length 6. This turns DNA into a text-like representation the model can vectorize.
-
Vectorize and classify The saved vectorizer converts 6-mers into numeric features, then a saved SVM or RandomForest predicts
VirusorHost. -
Estimate confidence Confidence is derived from model probabilities plus prediction margin and entropy. The frontend shows that confidence directly.
-
Apply novelty heuristic If novelty mode is enabled, BAIO computes
ood_score = 1 - confidence. This is a heuristic flag, not a full open-set model such as Mahalanobis distance or energy-based OOD detection. -
Generate a human-readable explanation The API adds GC content, organism-name pattern matching, and a short explanation string for the UI.
BAIO supports Evo 2 - a state-of-the-art DNA language model from Arc Institute for higher accuracy classification.
| Component | Requirement |
|---|---|
| GPU | NVIDIA GPU with 16GB+ VRAM (7B) or 80GB+ (40B) |
| CUDA | 12.0+ |
| Memory | 32GB+ RAM recommended |
| Model Size | VRAM | Use Case |
|---|---|---|
| 7B | 16GB | Single GPU, good accuracy |
| 40B | 80GB | Maximum accuracy, multi-GPU |
To use Evo 2 embeddings instead of k-mer features:
-
Install dependencies:
pip install transformers torch
-
Check requirements:
python binary_classifiers/evo2_embedder.py
-
Update frontend settings - Select "Evo 2" model type in the configuration panel
- DNA Tokenization: Converts DNA sequences into tokens
- Transformer Processing: Uses StripedHyena 2 architecture
- Contextual Embeddings: Generates 4096-dimensional embeddings
- Classification: Uses embeddings for Virus/Host classification
| Method | Accuracy | Speed | GPU Required |
|---|---|---|---|
| K-mer + RandomForest | ~85% | Fast | No |
| Evo 2 7B + Classifier | ~95% | Medium | Yes (16GB) |
| Evo 2 40B + Classifier | ~98% | Slow | Yes (80GB) |
| Layer | Technology |
|---|---|
| Frontend | React + Vite + TypeScript + Tailwind CSS |
| Backend | FastAPI + Python 3.12 |
| ML | Scikit-learn (SVM, RandomForest) + optional research dependencies for embedding experiments |
| AI | Google Gemini API |
| DevOps | Docker, GitHub Actions, pytest |
This guide explains what each main file does - easy to understand for developers.
| File | What it does |
|---|---|
api/main.py |
Main FastAPI server - handles all API requests like /classify, /chat, /health |
api/llm_client.py |
Connects to Google Gemini AI for the chat assistant |
| File | What it does |
|---|---|
frontend/src/App.tsx |
Main React component - ties everything together |
frontend/src/components/Header.tsx |
Top navigation bar with AI Assistant, dark mode, health status |
frontend/src/components/SequenceInput.tsx |
Input form for pasting DNA sequences or uploading FASTA files |
frontend/src/components/ConfigPanel.tsx |
Settings panel (threshold, model selection, OOD toggle) |
frontend/src/components/ResultsDashboard.tsx |
Shows classification results, tables, charts, export options |
frontend/src/components/ChatWidget.tsx |
AI Assistant floating chat window (legacy - now in Header) |
frontend/src/api.ts |
Functions to call backend API endpoints |
| File | What it does |
|---|---|
binary_classifiers/predict_class.py |
Core classification logic - takes DNA sequence, returns Virus/Host prediction |
binary_classifiers/transformers/kmers_transformer.py |
Converts raw DNA into overlapping 6-mers |
binary_classifiers/evaluation.py |
Loads labeled data and computes classifier metrics |
retrain_model.py |
Retrains saved SVM and RandomForest artifacts from local FASTA files |
metaseq/train.py |
Separate configurable training pipeline for experiments and future consolidation |
| File | What it does |
|---|---|
retrain_model.py |
Standalone script to retrain the ML model |
predict_class.py |
Quick prediction script for testing |
scripts/evaluate_binary_classifier.py |
Evaluates the deployed model on labeled host/virus FASTA or FASTQ data |
| Requirement | Version | Notes |
|---|---|---|
| Python | 3.10+ (3.12 recommended) | Required for backend |
| Node.js | 18+ | Required for frontend |
| Git | Any recent version | For cloning |
| Docker | Optional | For containerized deployment |
git clone https://github.com/oss-slu/baio.git
cd baioOption A: Conda (Recommended)
conda env create -f environment.yml
conda activate baioOption B: Python venv
python3 -m venv .venv
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows
pip install -r requirements.txtCreate .env in project root:
GOOGLE_API_KEY=your_google_api_key_here
# Build and run both frontend and backend
docker compose up --build
# Run in background
docker compose up -d --build
# Stop
docker compose downAccess:
- π Frontend: http://localhost:4173
- π Backend API: http://localhost:8080
- π API Docs: http://localhost:8080/docs
# Activate environment
conda activate baio
# OR: source .venv/bin/activate
# Start API server
uvicorn api.main:app --reload --port 8080Backend ready at: http://localhost:8080
cd frontend
# Install dependencies (first time only)
npm install
# Start development server
npm run devFrontend ready at: http://localhost:5173
# Check backend health
curl http://localhost:8080/health
# Frontend should load automatically
# Open: http://localhost:5173-
Open Browser: Navigate to http://localhost:5173
-
Enter Sequences:
- Paste DNA sequences directly, OR
- Upload a FASTA file
-
Configure Settings (optional):
- Confidence threshold (default: 0.75)
- Enable/disable open-set detection
- Select model type
-
Classify: Click "Classify Sequences"
-
View Results:
- Classification (Virus/Host)
- Confidence bars
- Risk indicators (Low/Moderate/High)
- GC content
-
Expand Row: Click any row to see:
- Prediction explanation
- Confidence and novelty indicators
- Sequence preview and organism-name heuristic
-
Export: Download as JSON, CSV, or PDF
-
AI Assistant: Use the chat widget to ask questions about your sequences
If you need to retrain the model:
python retrain_model.pyThis will:
- Load training data from
data/ - Extract k-mer features
- Train both RandomForest and SVM classifiers
- Save model and vectorizer artifacts under
binary_classifiers/
The default training data in data/covid_reads5.fasta and data/human_reads5.fasta is only a tiny demo dataset. It is useful for development, but not enough for a robust biological classifier.
To evaluate the current saved models on labeled files:
python scripts/evaluate_binary_classifier.py --model RandomForest
python scripts/evaluate_binary_classifier.py --model SVM --output runs/metrics/svm_eval.jsonThe evaluation script reports:
- accuracy, precision, recall, F1, and ROC-AUC
- confusion matrix
- per-class report
- misclassified sequence IDs with confidence and virus probability
Run Tests:
# Activate environment first
conda activate baio
# Run all tests
pytest tests/
# Run specific test
pytest tests/test_api_classification.py
# With coverage
pytest --cov=. tests/Code Quality:
# Format code
black .
# Lint
ruff check .
# Type check
mypy .
# All at once
black . && ruff check . && mypy .| Issue | Cause | Solution |
|---|---|---|
python command not found |
Python not in PATH | Use python3 |
| Port 8080 in use | Another process | lsof -i :8080 then kill -9 <PID> |
| ModuleNotFoundError | Environment not activated | conda activate baio |
| npm install fails | Node version | Upgrade to Node.js 18+ |
| Gemini API error | Invalid API key | Check .env file |
| Model returns "Novel" | Low confidence or heuristic novelty threshold triggered | Inspect confidence_threshold and ood_threshold, then evaluate/retrain with broader data |
| Docker build fails | Cache issue | docker system prune -a |
| Conda env fails | Conflict | Use mamba env create -f environment.yml |
- Small Demo Dataset: The default demo retraining data in
data/is very small: 5 virus reads and 5 host reads. - Heuristic Novelty Score: The novelty score is heuristic-based, so "Novel" should be treated as "needs further validation," not proof of a new pathogen.
- Limited Scope: Current models distinguish only Virus vs Host. Multi-class classification coming in future versions.
| Service | URL |
|---|---|
| Frontend (Dev) | http://localhost:5173 |
| Frontend (Docker) | http://localhost:4173 |
| API Docs | http://localhost:8080/docs |
| API Health | http://localhost:8080/health |
| GitHub Repository | https://github.com/oss-slu/baio |
MIT License - see LICENSE
- Mainuddin - Tech Lead
- Luis Palmejar - Developer
- Kevin Yang - Developer
- Vrinda Thakur - Developer
Note: This is a research prototype. For production use in clinical settings, additional validation and regulatory approval may be required.