This repository contains a production-inspired AI workflow prototype demonstrating how to ingest, enrich, store, and expose structured and unstructured data using modern LLM-centric architecture.
The project reflects real-world AI system design principles while remaining intentionally lightweight and easy to reason about for a technical assessment.
This prototype satisfies the following requirements:
- Ingest data from two distinct sources
- One structured data source
- One unstructured text-based source
- Clean, normalize, and combine both data sources
- Enrich data using a Large Language Model
- Orchestrate the workflow using Python and LangChain
- Persist enriched data in a relational database
- Enable semantic search using a vector database
- Expose results through an API and a basic user interface
- Language: Python 3.10+
- API Framework: FastAPI
- LLM Provider: OpenAI
- LLM Orchestration: LangChain
- Relational Database: SQLite
- Vector Database: Pinecone
- Data Processing: Pandas
- Text Matching: RapidFuzz
- Framework: Streamlit
A CSV-based dataset representing structured entities such as products, tickets, or records.
Typical fields:
- ID
- Name
- Category
- Price or metadata
Loaded and processed using Pandas.
A text-based dataset such as:
- User reviews
- Support tickets
- Free-form notes or comments
Stored as JSON or text files and cleaned prior to enrichment.
This design explicitly demonstrates handling heterogeneous data sources.
- Load structured CSV data
- Load unstructured text documents
- Normalize text and categorical values
- Remove duplicates and invalid rows
- Apply fuzzy matching where required to associate documents with structured entities
OpenAI models are used via LangChain to perform:
- Sentiment classification
- Topic extraction
- One-sentence summarization
- Metadata normalization
LangChain chains provide deterministic execution and extensibility.
- SQLite
Stores the final enriched dataset, including:
- Structured metadata
- Source text
- LLM-generated sentiment
- Extracted topics
- Generated summaries
- Pinecone
Used to:
- Store embeddings for unstructured text
- Enable semantic similarity search
- Support hybrid search workflows when combined with relational filters
Embeddings are generated during ingestion and upserted with metadata.
The backend exposes a RESTful API using FastAPI.
GET /recordsGET /records/{id}GET /search?query=
Interactive API documentation is available at /docs.
A lightweight Streamlit interface provides:
- Browsing of enriched records
- Filtering by sentiment and topic
- Display of LLM-generated summaries
- Semantic search powered by Pinecone
The frontend consumes data exclusively through the FastAPI backend.
ai-workflow-demo/
│
├── data/
│ ├── structured.csv
│ └── unstructured.json
│
├── src/
│ ├── ingest.py
│ ├── clean.py
│ ├── merge.py
│ ├── enrich_llm.py
│ ├── embeddings.py
│ ├── database.py
│ ├── api.py
│ └── run_pipeline.py
│
├── streamlit_app.py
├── requirements.txt
├── README.md
└── .env.example
pip install -r requirements.txt
Create a .env file:
OPENAI_API_KEY=your_openai_key
PINECONE_API_KEY=your_pinecone_key
PINECONE_INDEX_NAME=your_index_name
python src/run_pipeline.py
uvicorn src.api:app --reload
streamlit run streamlit_app.py
If extended to production, this system could scale by:
- Migrating SQLite to PostgreSQL or a data warehouse
- Running LLM enrichment as async batch jobs
- Introducing a queue-based ingestion pipeline
- Caching LLM outputs to control cost
- Expanding Pinecone usage for RAG workflows
- Adding observability and cost monitoring
- Deploying via CI/CD with containerized services
This project demonstrates:
- Multi-source data ingestion
- Structured and unstructured data integration
- LLM-powered enrichment using OpenAI
- Workflow orchestration with LangChain
- Separation of relational and vector storage
- Clean API design
- Lightweight and usable frontend
It reflects production-minded AI engineering principles in a concise, extensible prototype.
This project includes configuration files for deploying to Fly.io's free tier.
-
Install Fly CLI
- Windows (PowerShell):
powershell -Command "iwr https://fly.io/install.ps1 -useb | iex"
- macOS/Linux:
curl -L https://fly.io/install.sh | sh
- Windows (PowerShell):
-
Login to Fly.io
fly auth login
This will open a browser window for authentication. Complete the login process.
-
Create and Launch the Application
fly launch
During the launch process, you'll be prompted with several questions:
- App name: Press Enter to use
product-review(or type a different name) - Region: Press Enter to use
dfw(or select a different region) - Postgres database: Type
nand press Enter (we're using SQLite) - Redis: Type
nand press Enter (not needed for this project) - Deploy now?: Type
nand press Enter (we'll set secrets first)
This command creates the app on Fly.io and prepares it for deployment.
- App name: Press Enter to use
-
Set Environment Variables (Secrets)
Set your API keys as secrets (these are encrypted and stored securely):
fly secrets set OPENAI_API_KEY=your_openai_key fly secrets set PINECONE_API_KEY=your_pinecone_key fly secrets set PINECONE_INDEX_NAME=your_index_name
Replace
your_openai_key,your_pinecone_key, andyour_index_namewith your actual values.Note: You can set all secrets at once:
fly secrets set OPENAI_API_KEY=your_openai_key PINECONE_API_KEY=your_pinecone_key PINECONE_INDEX_NAME=your_index_name -
Deploy the Application
fly deploy
This command will:
- Build the Docker image using the
Dockerfile - Push the image to Fly.io
- Deploy the application to the cloud
- Start the application
The deployment process may take a few minutes. You'll see build logs and deployment progress.
- Build the Docker image using the
-
Verify Deployment
fly status fly logs
-
Open Your Application
fly open
Or visit:
https://product-review.fly.dev
-
Run the Data Pipeline: After deployment, you MUST run the data ingestion and enrichment pipeline to populate the database. The database file is created automatically, but it starts empty. Run:
fly ssh console -C "python src/run_pipeline.py"Important: The pipeline will:
- Load and clean the data from
data/structured.csvanddata/unstructured.json - Enrich it with LLM (sentiment, topics, summaries)
- Store it in the SQLite database at
/app/enriched_data.db
Note: Make sure your
DB_PATHenvironment variable matches where you want the database (default is/app/enriched_data.db).Debug: If you see 0 records, check the debug endpoint:
curl https://your-app.fly.dev/debug/database
This will show you:
- Database file location and size
- Table structure
- Record count
- Sample records (if any)
- Instructions if the database is empty
- Load and clean the data from
-
View Logs: Monitor your application logs:
fly logs
-
Scale Resources (if needed): The default configuration uses 256MB RAM (free tier). To scale:
fly scale memory 512
-
SQLite Persistence: The SQLite database is ephemeral by default. For persistent storage, create a volume:
fly volumes create data --size 1
Then mount the volume and set the database path environment variable:
# Mount the volume (add this to fly.toml under [mounts]) # Or use: fly volumes attach data
Set the database path environment variable:
fly secrets set DB_PATH=/enriched_data.dbNote: If you're using a volume, make sure the volume is mounted. You can check the volume mount in your
fly.tomlfile or by runningfly volumes list.Troubleshooting: If you're getting empty results from the API:
-
Check if there's a DB_PATH secret that might be overriding fly.toml:
fly secrets list
Important: If you see
DB_PATHin the secrets list with a Windows path (likeC:/...), remove it:fly secrets unset DB_PATHSecrets take precedence over environment variables in
fly.toml, so a bad secret will override your correct setting. -
Find where your database file actually is:
fly ssh console -C "find / -name 'enriched_data.db' 2>/dev/null" -
Check the current working directory and look for the database:
fly ssh console -C "pwd && ls -la *.db"(On Fly.io, this should be
/appbased on the Dockerfile) -
Set the DB_PATH correctly:
Option A: Use fly.toml (recommended for fixed paths)
- Edit
fly.tomland setDB_PATH = '/app/enriched_data.db'in the[env]section - This is already done if you followed the setup
Option B: Use secrets (if you need different paths per environment)
# If database is in /app (default): fly secrets set DB_PATH=/app/enriched_data.db # If database is in root: fly secrets set DB_PATH=/enriched_data.db # If using a volume mounted at /data: fly secrets set DB_PATH=/data/enriched_data.db
Note: Secrets override fly.toml env vars, so if you set a secret, it will be used instead.
- Edit
-
Check the health endpoint (shows database path, environment info, and record count):
curl https://your-app.fly.dev/health
This will show:
- The exact path being used
- The DB_PATH environment variable value
- Whether the file exists
- The record count
-
Verify the database has data:
# Replace /app/enriched_data.db with the actual path from step 2 fly ssh console -C "sqlite3 /app/enriched_data.db 'SELECT COUNT(*) FROM enriched_records;'"
-
Redeploy after making changes:
fly deploy
-
-
Free Tier Limits:
- 256MB RAM
- Shared CPU
- 3GB storage
- Auto-stops when idle (auto-starts on request)
-
Health Checks: The application includes a
/healthendpoint that Fly.io monitors automatically. -
API Documentation: Once deployed, access interactive API docs at
https://your-app.fly.dev/docs
The current deployment only includes the FastAPI backend. To also deploy the Streamlit frontend:
Option 1: Deploy Streamlit as a Separate App (Recommended for free tier)
-
Deploy the FastAPI backend first (using the steps above)
-
Deploy Streamlit as a separate app:
fly launch --config fly.streamlit.toml --dockerfile Dockerfile.streamlit
When prompted:
- App name: Use a different name like
product-review-streamlit - Region: Use the same region as your FastAPI app
- Postgres/Redis: Type
nfor both - Deploy now?: Type
n
- App name: Use a different name like
-
Set the API URL (replace with your FastAPI app URL):
fly secrets set API_BASE_URL=https://product-review-prototype.fly.dev -
Deploy Streamlit:
fly deploy --config fly.streamlit.toml --dockerfile Dockerfile.streamlit
Option 2: Access API Only
You can use the FastAPI endpoints directly:
- API:
https://your-app.fly.dev - Interactive docs:
https://your-app.fly.dev/docs - Health check:
https://your-app.fly.dev/health
The Streamlit app is designed to work with the API, so you can run it locally and point it to your deployed API, or deploy it separately as shown above.