An intelligent resume matching system powered by semantic embeddings and vector search
- About
- Key Features
- Architecture
- Getting Started
- Dataset
- Model Comparison
- Pipeline Overview
- Roadmap
- Contributing
- License
- Contact
- Acknowledgments
Get Hired is an AI-powered recruitment assistant that revolutionizes the resume screening process. By leveraging state-of-the-art transformer models and vector databases, it creates semantic embeddings of resumes to enable intelligent matching and retrieval based on job requirements.
Traditional keyword-based resume screening systems often miss qualified candidates due to rigid keyword matching, inability to understand context, poor handling of synonyms and related terms, and time-consuming manual processes.
Get Hired addresses these limitations by understanding semantic meaning beyond keywords, finding candidates with transferable skills, providing similarity scores for objective comparison, and scaling to thousands of resumes in seconds.
- Semantic Understanding: Uses transformer-based models (BERT, Sentence-BERT) to capture deep semantic meaning from resume text
- Vector Storage: Efficient storage and retrieval using Weaviate's vector database with sub-second query times
- Intelligent Matching: Find best-matching resumes based on natural language job descriptions
- Multiple Model Support: Compare results across CBOW, Skip-gram, and Transformer models
- Hybrid Search: Combine vector similarity with metadata filtering (experience, skills, category)
- Scalable Architecture: Process and query thousands of resumes efficiently
- Natural language queries using plain English
- Batch processing for large resume datasets with configurable batch sizes
- Automated text cleaning, tokenization, and lemmatization
- Built-in tools to compare embedding quality across different models
- Robust error handling with detailed logging and recovery mechanisms
Data Ingestion Layer
- Resume parsing and validation
- Text extraction and preprocessing
Embedding Generation
- HuggingFace Transformers (all-MiniLM-L6-v2)
- Custom Word2Vec models (CBOW, Skip-gram)
Vector Storage
- Weaviate vector database
- Schema-based structured storage
- Hybrid search capabilities
Query Interface
- Natural language processing
- Similarity search
- Metadata filtering
- Python 3.8 or higher
- 4GB+ RAM recommended
- GPU optional (for faster embedding generation)
git clone https://github.com/dhou22/Get-Hired-Project.git
cd Get-Hired-Project# Create virtual environment
python -m venv venv
# Activate virtual environment
venv\Scripts\activate
The notebook contains the following sections:
- Data Loading: Import and explore resume dataset
- Preprocessing: Clean and prepare text data
- Embedding Generation: Create vector representations
- Weaviate Setup: Configure schema and upload data
- Query & Retrieval: Test semantic search functionality
- Benchmarking: Compare model performance
Dataset source on Kaggle : https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset
- Total Resumes: 2,484
- Categories: 25 job categories
- Average Length: ~500 words per resume
- Format: Structured CSV with text fields
| Category | Count | Examples |
|---|---|---|
| Data Science | 245 | Machine Learning Engineer, Data Analyst |
| Web Development | 312 | Full-Stack Developer, Frontend Engineer |
| DevOps | 189 | Cloud Engineer, SRE |
| Mobile Development | 156 | iOS Developer, Android Engineer |
Lowercase Conversion
- Ensures consistency across text
- Example: "Python" → "python"
Remove Punctuation & Special Characters
- Eliminates noise from embeddings
- Removes numbers unless contextually relevant
Tokenization
- Splits text into individual words/tokens
- Uses NLTK's word tokenizer
Stopword Removal
- Removes common words with little semantic value
- Preserves domain-specific terms
Lemmatization
- Converts words to base form
- Example: "running" → "run", "better" → "good"
Text Reconstruction
- Rejoins tokens into clean text
- Ready for embedding models
1. User Interaction
- HR professional sends query through the interface
- Query captured by ChatBot Interface (Angular frontend)
- Communication handled via FastAPI backend (localhost:4200)
2. Query Transfer
- Query transmitted from frontend to backend
- Routed to the Orchestration Layer for processing
3. Orchestration Layer - Routing Agent
- Intent Classification: Analyzes query using keyword matching and intent classification
- Route Determination: Decides which specialized layer handles the request
- Model: Mistral-large-latest (128k context window)
- Memory Management: Maintains conversation history via ConversationBufferMemory
4. Query Branching Routes to appropriate specialized layer based on intent:
Path A: Candidate Search Queries → Retrieval Layer (5)
Path B: Market Analysis Queries → Web Search Layer (5)
5A. Retrieval Layer - Candidate Search
- Vector Database: Weaviate stores candidate profiles
- Embedding Model: HuggingFace MiniLM-L6 (3k documents)
- Search Method: 384-dim vector cosine similarity matching
- Returns relevant candidate profiles
5B. Web Search Layer - Market Intelligence
- Tool: Tavily for web scraping
- Sources: Deep crawl of 5+ sources
- Gathers real-time market data and trends
6. Scoring Layer
- Scoring Agent: LLAma-70B model
- Evaluation: 4-phase weighted scoring system
- Output: Ranked list of top 5 candidates
- Processes retrieved candidates against job requirements
7. Analysis Layer
- Analysis Agent: Mistral-7B model
- Data Processing: Extracts and structures market data
- Output Generation: Creates formatted salary tables and market insights reports
8. Output Delivery Two types of outputs generated:
Candidate Search Output
- Ranked list of top 5 candidates with scores
- Delivered to user interface
Market Report Output
- Comprehensive salary table and market insights
- Sourced data with analysis
- Delivered to user interface
9. Response Cycle
- Results displayed in ChatBot Interface
- User receives either candidate recommendations or market analysis
- Conversation history stored for context in future queries
10. Completion
- User gets final answer through the HR interface
- System ready for next query with maintained conversation context
| Model | Embedding Dim | Inference Speed | OOV Handling | Quality Score |
|---|---|---|---|---|
| CBOW | 100 | Fast | Poor | Good |
| Skip-gram | 100 | Medium | Poor | Very Good |
| all-MiniLM-L6-v2 | 384 | Slower | Excellent | Excellent |
Strengths:
- Fast inference speed (~0.1ms per resume)
- Compact embeddings (100D)
- Excellent for frequent words
- Low memory footprint
Limitations:
- Limited to training vocabulary
- Poor performance on rare words
- Cannot handle out-of-vocabulary (OOV) terms
Best Use Cases:
- High-speed production systems
- Domain-specific vocabularies
- Resource-constrained environments
Strengths:
- Better semantic relationships
- Works well with rare words
- Captures fine-grained meanings
- Good for analogies and similarities
Limitations:
- Slower than CBOW
- Still limited to training vocabulary
- Requires more training data
Best Use Cases:
- Semantic similarity tasks
- Small to medium datasets
- Custom corpus training
hugging face source model : https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Strengths:
- Handles any word (no OOV issues)
- Pre-trained on 1B+ sentence pairs
- Higher dimensional embeddings (384D)
- State-of-the-art quality
- Transfer learning capabilities
- Sentence-level understanding
Limitations:
- Slower inference (~10ms per resume)
- Larger model size (~80MB)
- Requires more computational resources
Best Use Cases:
- Production-ready applications
- General-purpose text matching
- When quality is priority
- Handling diverse vocabularies
Query: "experienced python developer machine learning"
Model | Top-3 Accuracy | Avg. Similarity | Query Time
---------------|----------------|-----------------|------------
CBOW | 72% | 0.68 | 0.02s
Skip-gram | 78% | 0.71 | 0.03s
MiniLM-L6-v2 | 94% | 0.85 | 0.08s
Use CBOW/Skip-gram when:
- Maximum speed is required
- Working with domain-specific vocabulary
- Training on custom corpus
- Memory/size is constrained
- Simple keyword matching suffices
Use Sentence Transformer when:
- Need robust OOV handling
- Working with sentences/phrases
- Want state-of-the-art quality
- Inference speed is acceptable
- Require transfer learning capabilities
- Production deployment is planned
- Basic resume embedding and storage
- Semantic search functionality
- Multiple model support (CBOW, Skip-gram, Transformers)
- Batch processing
- Model benchmarking
- REST API for integration
- Web-based UI dashboard
- Real-time resume parsing from PDFs
- Fine-tuned models on resume corpus
- Multi-language support
- Explainability features
- Active learning feedback loop
- Candidate ranking algorithms
- Integration with ATS systems
- Bias detection and mitigation
- Skills gap analysis
- Automated interview question generation
See the open issues for proposed features and known issues.
Contributions are welcome. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
# Install development dependencies
pip install -r requirements-dev.txt
# Run tests
pytest tests/
# Format code
black .
isort .
# Lint code
flake8 .
pylint src/- Write clear, commented code
- Add unit tests for new features
- Update documentation for API changes
- Follow PEP 8 style guidelines
- Keep pull requests focused and small
- Provide detailed PR descriptions
Distributed under the MIT License. See LICENSE for more information.
Project Maintainer: dhou22
- GitHub: @dhou22
- Project Link: https://github.com/dhou22/Get-Hired-Project
- Issues: Report a Bug
This project was made possible by:
- HuggingFace - State-of-the-art transformer models
- Weaviate - Vector database technology
- Sentence-Transformers - Pre-trained semantic embedding models
- NLTK - Natural language processing tools
- Gensim - Word2Vec implementations
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- Mikolov, T., et al. (2013). Efficient Estimation of Word Representations in Vector Space
- Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers












