brkai-2500

Pipeline and API to ingest cyber_act_docs/ PDFs, store embeddings in MongoDB Atlas, and serve a Retrieval-Augmented Generation (RAG) endpoint powered by Azure OpenAI.

Setup

Copy .env.example to .env and fill in settings:

# Required
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-key
AZURE_OPENAI_API_VERSION=2024-02-15-preview
AZURE_OPENAI_EMBED_DEPLOYMENT=text-embedding-ada-002
AZURE_OPENAI_CHAT_DEPLOYMENT=gpt-4o
MONGODB_URI=mongodb+srv://...

# Optional (with defaults)
MONGODB_DB=cyber_act
MONGODB_COLLECTION=chunks
MONGODB_VECTOR_INDEX=vector_index

Install dependencies (recommend a virtualenv):

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Create a MongoDB Atlas Vector Search index named vector_index on the embedding field (see cyber_act_rag_architecture.md for the JSON definition).

Ingest PDFs

python -m ingest.pipeline --env .env
# optional override
python -m ingest.pipeline --docs-dir ./cyber_act_docs

This parses PDFs, chunks text, embeds with Azure OpenAI, and upserts vectors into MongoDB.

Run RAG API

python -m service.api  # uses HOST/PORT/RELOAD envs if set

Endpoints:

GET /health
POST /query with payload {"question": "...", "top_k": 5, "filters": {"doc_id": "cyber_act_11_20_2024.pdf"}}

MCP Server

The MCP server exposes the RAG API as a tool for AI assistants.

Server script: mcp_servers/cyber_act_rag_server.py
Requires CYBER_ACT_API_URL (defaults to http://localhost:8000)
Exposed tool: ask_cyber_act
- question (string, required)
- top_k (int, default 5)
- filters (object, e.g., {"doc_id": "cyber_act_11_20_2024.pdf"})

Testing with MCP Inspector:

# Make sure RAG API is running first
python -m service.api &

# Launch inspector with your MCP server
npx @modelcontextprotocol/inspector python mcp_servers/cyber_act_rag_server.py

Use Case 1: Automated Compliance Checking

Analyze code files for CRA compliance risks using the MCP server:

python use_case_1/auto_compliance_check.py path/to/code.py \
  --server-cmd python \
  --server-args mcp_servers/cyber_act_rag_server.py \
  --top-k 6

Use Case 2: CI/CD Integration

See use_case_2/README.md for GitHub Actions and other CI/CD pipeline examples.

Notes

Chunk sizing defaults are token-based (800 tokens, 120 overlap); tune via env vars.
Ingestion batches embeddings (default 12) with retries/backoff via tenacity.
The same Azure OpenAI endpoint/key is used for both embeddings and chat; ensure deployments exist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

brkai-2500

Setup

Ingest PDFs

Run RAG API

MCP Server

Use Case 1: Automated Compliance Checking

Use Case 2: CI/CD Integration

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.cursor		.cursor
.github/instructions		.github/instructions
.windsurf/rules		.windsurf/rules
cyber_act_docs		cyber_act_docs
ingest		ingest
mcp_servers		mcp_servers
service		service
use_case_1		use_case_1
use_case_2		use_case_2
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
cyber_act_rag_architecture.md		cyber_act_rag_architecture.md
eu_cyber_act_rule.instructions.md		eu_cyber_act_rule.instructions.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

brkai-2500

Setup

Ingest PDFs

Run RAG API

MCP Server

Use Case 1: Automated Compliance Checking

Use Case 2: CI/CD Integration

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages