Deployed at: https://nlpproject-pl6e.onrender.com/
Demo at: Video Demo
This NLP project is built to streamline the extraction, analysis, and accessibility of critical information from CVE (Common Vulnerabilities and Exposures) entries. By employing advanced NLP models and interactive visualizations, this project enables users to gain deep insights into security vulnerabilities and seamlessly integrate the data into security workflows.
- Data Extraction: Automated data collection of CVEs using web scraping and APIs.
- Data Cleaning: Preprocessing and cleaning of raw data for consistency and accuracy.
- Database Storage: Processed data is stored in MongoDB Atlas for easy management and retrieval.
- Web Application: Flask-based website with a user-friendly dashboard for CVE analysis.
- Interactive Visualization: Integrated Plotly for interactive visualizations to enhance data insights.
- Search Functionality: Allows users to search specific CVE IDs and access comprehensive details.
- NLP Inference: Implemented NLP inference using GROQ and an open-source LLaMA 3.1 8B model to provide sophisticated language understanding and response generation.
- Similar Search: User can find top 5 similar CVEs for a given description of a cyber attack.
- Data Extraction: Gathered CVE data through scraping and API integrations.
- Data Cleaning: Processed data to ensure consistency, remove duplicates, and handle missing values.
- Storage: Stored the cleaned data in MongoDB Atlas, making it accessible for both display and further analysis.
- Web Application Development: Built a Flask-based web app with a dashboard to display CVE information.
- Visualization: Added Plotly-based charts for interactive data visualization, helping users explore CVE data insights.
- Search and Analysis: Users can search for CVEs by ID, view detailed information, and analyze key metrics.
- NLP Model Integration: Utilized GROQ with LLaMA 3.1 8B model to support natural language responses based on CVE data.
- Similar Search Integration: Applied RAG for similar CVE search.
- Backend: Flask
- Database: MongoDB Atlas, Pinecone
- Frontend: HTML, CSS, JavaScript
- Visualization: Plotly
- NLP Model: GROQ with LLaMA 3.1 8B model, RAG with vector search
- Transformers: Bert for Tokenization, AutoModel for Embedding
- Python 3.12 used
- MongoDB Atlas account
- Pinecone account
- API keys for data sources (contact us for data)
-
Clone the repository:
git clone https://github.com/yourusername/nlp-cve-analysis.git cd nlpProject -
Install dependencies:
pip install -r requirements.txt
-
Configure MongoDB Atlas:
- Update
.envwith your GROQ_API_KEY. - Update
.envwith your PINECONE_API_KEY. - Update
.envwith your MongoDB connection string. Or reach out to us for data.
- Update
-
Run the Flask app:
python main.py
-
Access the Application: Open your browser and navigate to
http://localhost:5000.
- Dashboard: View CVE data insights through interactive visualizations.
- Search: Enter CVE IDs to retrieve detailed vulnerability information.
- NLP Responses: Ask questions related to CVE data, and get responses generated by the LLaMA 3.1 8B model.
- Similar Search: User can find top 5 similar CVEs for a given description of a cyber attack.
- Expand NLP model capabilities for broader question answering.
- Enhance dashboard with additional metrics and visualizations.
For further inquiries, please contact msa23010@iiitl.ac.in, msd23007@iiitl.ac.in, msd23004@iiitl.ac.in, msa23004@iiitl.ac.in, msd23024@iiitl.ac.in .
