An intelligent data analysis chatbot built with Streamlit, LangChain, and Azure OpenAI that provides conversational insights and dynamic visualizations for the Titanic dataset (or any CSV data).
- Data Analysis Agent: Performs pandas-based data analysis
- Router Agent: Intelligently routes between chart generation and insights
- Insights Agent: Converts raw analysis into human-readable insights
- Chart Code Generator: Dynamically creates Python visualization code
- Chart Execution Agent: Safely executes generated code to create visualizations
- Smart Chart Selection: Automatically chooses appropriate chart types based on your question
- Custom Code Generation: Creates matplotlib/seaborn code tailored to your specific query
- Multiple Chart Types: Histograms, scatter plots, bar charts, heatmaps, box plots, and more
- Interactive Code Viewing: See the generated Python code behind each visualization
- LangSmith Integration: Full tracing and monitoring of AI agent decisions
- Session Tracking: Conversation grouping and user journey analysis
- Performance Metrics: Token usage, latency, and cost monitoring
- Error Tracking: Comprehensive debugging capabilities
- Chat-Based Interaction: Natural language queries about your data
- Real-Time Responses: Instant analysis and visualization generation
- Sample Questions: Pre-built queries to get you started quickly
- Data Upload: Support for custom CSV files or default Titanic dataset
- Python 3.8+
- OpenAI account with GPT-4 deployment
- (Optional) LangSmith account for tracing
git clone <repository-url>
cd titanic-ai-analystpip install streamlit pandas matplotlib seaborn python-dotenv
pip install langchain-openai langchain-experimental langgraph
pip install numpy pathlibCreate a .env file in the project root:
# OpenAI Configuration (Required)
OPENAI_API_KEY=your__openai_api_key
# LangSmith Tracing Configuration (Optional)
LANGSMITH_TRACING=true
LANGSMITH_ENDPOINT=https://api.smith.langchain.com
LANGSMITH_API_KEY=your_langsmith_api_key
LANGSMITH_PROJECT=your_project_namestreamlit run app.pyThe application will open in your browser at http://localhost:8501
- Launch the Application: Run
streamlit run app.py - Load Data:
- Click "Load Default Titanic Data" for the classic dataset
- Or upload your own CSV file using the file uploader
- Start Asking Questions: Use natural language to explore your data
- "What is the survival rate?"
- "How many passengers were in each class?"
- "What's the average age of survivors vs non-survivors?"
- "Analyze survival patterns by gender and class"
- "Calculate the fare statistics by passenger class"
- "Show me a histogram of passenger ages"
- "Create a bar chart of survival by class"
- "Plot the correlation heatmap"
- "Show fare distribution by class as a boxplot"
- "Visualize survival rate by gender"
- Click the file uploader in the sidebar
- Select your CSV file
- Click "Load Uploaded File"
- Start analyzing your custom dataset
- View real-time traces of AI agent decisions
- Monitor performance and costs
- Debug issues with detailed logs
- Click "๐ Open LangSmith Dashboard" to access the monitoring interface
User Question โ Data Agent โ Router โ [Chart Path OR Insight Path] โ Response
โ โ
Chart Code Agent โ Chart Execution Agent
โ โ
Insight Agent โโโโโโโโโ Final Response
- StateGraph (LangGraph): Orchestrates the multi-agent workflow
- ** OpenAI**: Powers the intelligent agents with GPT-4
- Pandas DataFrame Agent: Enables natural language data analysis
- Dynamic Code Generation: Creates custom visualization code
- Safe Code Execution: Sandboxed environment for running generated code
- Streamlit Interface: Provides the chat-based user experience
- Restricted Execution Environment: Limited built-in functions for safety
- Input Validation: Comprehensive error handling and validation
- Environment Variable Security: API keys stored securely in
.env - Code Filtering: Removes potentially dangerous import statements
LangSmith provides powerful monitoring and debugging capabilities for your AI agents.
Visit smith.langchain.com and create an account
- Go to Settings โ API Keys
- Create a new API key
- Copy the key (starts with
lsv2_pt_...)
Add to your .env file:
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here
LANGSMITH_PROJECT=titanic-ai-analyst- Restart the application
- All AI agent interactions will be traced
- View traces in the LangSmith dashboard
- Analyze performance, costs, and user patterns
Modify the _generate_chart_code method to include new visualization types:
# Add to the chart types section in the prompt
- Custom chart type: for specific use casesAdd new agents to the workflow:
# In the _build method
graph.add_node("new_agent", self._new_agent_method)
graph.add_edge("existing_agent", "new_agent")Extend the load_data method to support:
- Database connections
- API integrations
- Multiple file formats (Excel, JSON, etc.)
- Check credentials: Verify your OpenAI endpoint, API key, and deployment name
- Verify deployment: Ensure your GPT-4 model is deployed and accessible
- Test connection: Use Portal to test your OpenAI resource
- Data compatibility: Ensure your dataset has the required columns
- Missing values: The system handles NaN values, but extreme cases may fail
- Code generation: Check the generated code in the expandable section
- File format: Ensure your CSV file is properly formatted
- File size: Very large files may cause memory issues
- Encoding: Try UTF-8 encoding for special characters
- API Key: Verify your LangSmith API key is correct
- Project Name: Ensure the project exists in your LangSmith account
- Network: Check firewall settings for
api.smith.langchain.com
Enable verbose logging by modifying the pandas agent:
agent = create_pandas_dataframe_agent(
self.llm,
self.df,
verbose=True, # Enable detailed logging
allow_dangerous_code=True
)- Sampling: For datasets >100k rows, consider sampling for faster analysis
- Chunking: Process large files in chunks
- Memory Management: Monitor RAM usage with large datasets
- Caching: Implement caching for repeated queries
- Async Processing: Use async for non-blocking operations
- Model Selection: Consider using smaller models for simple queries
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and test thoroughly
- Submit a pull request with a clear description
- Follow PEP 8 Python style guidelines
- Add docstrings to all functions
- Include error handling for new features
- Test with both Titanic and custom datasets
- New visualization types: 3D plots, interactive charts
- Data source integrations: Databases, APIs, cloud storage
- Enhanced AI agents: Statistical testing, prediction models
- UI improvements: Better error messages, user onboarding
- Performance optimizations: Caching, async processing
This project is licensed under the MIT License - see the LICENSE file for details.
- Streamlit: For the amazing web app framework
- LangChain: For the powerful AI agent capabilities
- ** OpenAI**: For providing the GPT-4 model
- LangSmith: For excellent tracing and monitoring
- Titanic Dataset: Classic machine learning dataset from Kaggle
For questions, issues, or contributions:
- GitHub Issues: Open an issue for bugs or feature requests
- Documentation: Check this README and inline code comments
- LangSmith Support: Visit smith.langchain.com for tracing issues
- ** Support**: Check documentation for OpenAI service issues
- Multi-dataset Support: Compare and analyze multiple datasets simultaneously
- Advanced Statistics: Integration with scipy for statistical tests
- Machine Learning: Built-in ML model training and evaluation
- Real-time Data: Support for streaming data sources
- Collaborative Features: Share analyses and insights with teams
- Export Capabilities: PDF reports, presentation slides
- Voice Interface: Speech-to-text for voice queries
- Mobile Optimization: Responsive design for mobile devices
Built with โค๏ธ using Streamlit, LangChain, and OpenAI