Pipeline and API to ingest cyber_act_docs/ PDFs, store embeddings in MongoDB Atlas, and serve a Retrieval-Augmented Generation (RAG) endpoint powered by Azure OpenAI.
- Copy
.env.exampleto.envand fill in settings:# Required AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/ AZURE_OPENAI_API_KEY=your-key AZURE_OPENAI_API_VERSION=2024-02-15-preview AZURE_OPENAI_EMBED_DEPLOYMENT=text-embedding-ada-002 AZURE_OPENAI_CHAT_DEPLOYMENT=gpt-4o MONGODB_URI=mongodb+srv://... # Optional (with defaults) MONGODB_DB=cyber_act MONGODB_COLLECTION=chunks MONGODB_VECTOR_INDEX=vector_index
- Install dependencies (recommend a virtualenv):
python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt
- Create a MongoDB Atlas Vector Search index named
vector_indexon theembeddingfield (seecyber_act_rag_architecture.mdfor the JSON definition).
python -m ingest.pipeline --env .env
# optional override
python -m ingest.pipeline --docs-dir ./cyber_act_docsThis parses PDFs, chunks text, embeds with Azure OpenAI, and upserts vectors into MongoDB.
python -m service.api # uses HOST/PORT/RELOAD envs if setEndpoints:
GET /healthPOST /querywith payload{"question": "...", "top_k": 5, "filters": {"doc_id": "cyber_act_11_20_2024.pdf"}}
The MCP server exposes the RAG API as a tool for AI assistants.
- Server script:
mcp_servers/cyber_act_rag_server.py - Requires
CYBER_ACT_API_URL(defaults tohttp://localhost:8000) - Exposed tool:
ask_cyber_actquestion(string, required)top_k(int, default 5)filters(object, e.g.,{"doc_id": "cyber_act_11_20_2024.pdf"})
Testing with MCP Inspector:
# Make sure RAG API is running first
python -m service.api &
# Launch inspector with your MCP server
npx @modelcontextprotocol/inspector python mcp_servers/cyber_act_rag_server.pyAnalyze code files for CRA compliance risks using the MCP server:
python use_case_1/auto_compliance_check.py path/to/code.py \
--server-cmd python \
--server-args mcp_servers/cyber_act_rag_server.py \
--top-k 6See use_case_2/README.md for GitHub Actions and other CI/CD pipeline examples.
- Chunk sizing defaults are token-based (800 tokens, 120 overlap); tune via env vars.
- Ingestion batches embeddings (default 12) with retries/backoff via
tenacity. - The same Azure OpenAI endpoint/key is used for both embeddings and chat; ensure deployments exist.