Get Hired - AI Recruitment Assistant

An intelligent resume matching system powered by semantic embeddings and vector search

About The Project

Get Hired is an AI-powered recruitment assistant that revolutionizes the resume screening process. By leveraging state-of-the-art transformer models and vector databases, it creates semantic embeddings of resumes to enable intelligent matching and retrieval based on job requirements.

Motivation

Traditional keyword-based resume screening systems often miss qualified candidates due to rigid keyword matching, inability to understand context, poor handling of synonyms and related terms, and time-consuming manual processes.

Get Hired addresses these limitations by understanding semantic meaning beyond keywords, finding candidates with transferable skills, providing similarity scores for objective comparison, and scaling to thousands of resumes in seconds.

Key Features

Core Capabilities

Semantic Understanding: Uses transformer-based models (BERT, Sentence-BERT) to capture deep semantic meaning from resume text
Vector Storage: Efficient storage and retrieval using Weaviate's vector database with sub-second query times
Intelligent Matching: Find best-matching resumes based on natural language job descriptions
Multiple Model Support: Compare results across CBOW, Skip-gram, and Transformer models
Hybrid Search: Combine vector similarity with metadata filtering (experience, skills, category)
Scalable Architecture: Process and query thousands of resumes efficiently

Advanced Features

Natural language queries using plain English
Batch processing for large resume datasets with configurable batch sizes
Automated text cleaning, tokenization, and lemmatization
Built-in tools to compare embedding quality across different models
Robust error handling with detailed logging and recovery mechanisms

Architecture

System Components

Data Ingestion Layer

Resume parsing and validation
Text extraction and preprocessing

Embedding Generation

HuggingFace Transformers (all-MiniLM-L6-v2)
Custom Word2Vec models (CBOW, Skip-gram)

Vector Storage

Weaviate vector database
Schema-based structured storage
Hybrid search capabilities

Query Interface

Natural language processing
Similarity search
Metadata filtering

Getting Started

Prerequisites

Python 3.8 or higher
4GB+ RAM recommended
GPU optional (for faster embedding generation)

Installation

1. Clone the Repository

git clone https://github.com/dhou22/Get-Hired-Project.git
cd Get-Hired-Project

2. Create Virtual Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
venv\Scripts\activate

The notebook contains the following sections:

Data Loading: Import and explore resume dataset
Preprocessing: Clean and prepare text data
Embedding Generation: Create vector representations
Weaviate Setup: Configure schema and upload data
Query & Retrieval: Test semantic search functionality
Benchmarking: Compare model performance

Dataset

Dataset source on Kaggle : https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset

Dataset Statistics

Total Resumes: 2,484
Categories: 25 job categories
Average Length: ~500 words per resume
Format: Structured CSV with text fields

Sample Categories

Category	Count	Examples
Data Science	245	Machine Learning Engineer, Data Analyst
Web Development	312	Full-Stack Developer, Frontend Engineer
DevOps	189	Cloud Engineer, SRE
Mobile Development	156	iOS Developer, Android Engineer

Data Schema in Weaviate

Resume Processing Pipeline

Pipeline Steps

1. Weaviate DB ETL

Lowercase Conversion

Ensures consistency across text
Example: "Python" → "python"

Remove Punctuation & Special Characters

Eliminates noise from embeddings
Removes numbers unless contextually relevant

Tokenization

Splits text into individual words/tokens
Uses NLTK's word tokenizer

Stopword Removal

Removes common words with little semantic value
Preserves domain-specific terms

Lemmatization

Converts words to base form
Example: "running" → "run", "better" → "good"

Text Reconstruction

Rejoins tokens into clean text
Ready for embedding models

2. HR Chatbot Workflow Steps

1. User Interaction

HR professional sends query through the interface
Query captured by ChatBot Interface (Angular frontend)
Communication handled via FastAPI backend (localhost:4200)

2. Query Transfer

Query transmitted from frontend to backend
Routed to the Orchestration Layer for processing

3. Orchestration Layer - Routing Agent

Intent Classification: Analyzes query using keyword matching and intent classification
Route Determination: Decides which specialized layer handles the request
Model: Mistral-large-latest (128k context window)
Memory Management: Maintains conversation history via ConversationBufferMemory

4. Query Branching Routes to appropriate specialized layer based on intent:

Path A: Candidate Search Queries → Retrieval Layer (5)

Path B: Market Analysis Queries → Web Search Layer (5)

5A. Retrieval Layer - Candidate Search

Vector Database: Weaviate stores candidate profiles
Embedding Model: HuggingFace MiniLM-L6 (3k documents)
Search Method: 384-dim vector cosine similarity matching
Returns relevant candidate profiles

5B. Web Search Layer - Market Intelligence

Tool: Tavily for web scraping
Sources: Deep crawl of 5+ sources
Gathers real-time market data and trends

6. Scoring Layer

Scoring Agent: LLAma-70B model
Evaluation: 4-phase weighted scoring system
Output: Ranked list of top 5 candidates
Processes retrieved candidates against job requirements

7. Analysis Layer

Analysis Agent: Mistral-7B model
Data Processing: Extracts and structures market data
Output Generation: Creates formatted salary tables and market insights reports

8. Output Delivery Two types of outputs generated:

Candidate Search Output

Ranked list of top 5 candidates with scores
Delivered to user interface

Market Report Output

Comprehensive salary table and market insights
Sourced data with analysis
Delivered to user interface

9. Response Cycle

Results displayed in ChatBot Interface
User receives either candidate recommendations or market analysis
Conversation history stored for context in future queries

10. Completion

User gets final answer through the HR interface
System ready for next query with maintained conversation context

3. AI Agents

Model Comparison & Benchmarks

Performance Comparison

Model	Embedding Dim	Inference Speed	OOV Handling	Quality Score
CBOW	100	Fast	Poor	Good
Skip-gram	100	Medium	Poor	Very Good
all-MiniLM-L6-v2	384	Slower	Excellent	Excellent

Model Details

Word2VC models

1. CBOW (Continuous Bag of Words)

Strengths:

Fast inference speed (~0.1ms per resume)
Compact embeddings (100D)
Excellent for frequent words
Low memory footprint

Limitations:

Limited to training vocabulary
Poor performance on rare words
Cannot handle out-of-vocabulary (OOV) terms

Best Use Cases:

High-speed production systems
Domain-specific vocabularies
Resource-constrained environments

2. Skip-gram

Strengths:

Better semantic relationships
Works well with rare words
Captures fine-grained meanings
Good for analogies and similarities

Limitations:

Slower than CBOW
Still limited to training vocabulary
Requires more training data

Best Use Cases:

Semantic similarity tasks
Small to medium datasets
Custom corpus training

Sentence Transformer (all-MiniLM-L6-v2)

hugging face source model : https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Strengths:

Handles any word (no OOV issues)
Pre-trained on 1B+ sentence pairs
Higher dimensional embeddings (384D)
State-of-the-art quality
Transfer learning capabilities
Sentence-level understanding

Limitations:

Slower inference (~10ms per resume)
Larger model size (~80MB)
Requires more computational resources

Best Use Cases:

Production-ready applications
General-purpose text matching
When quality is priority
Handling diverse vocabularies

Benchmark Results

Query: "experienced python developer machine learning"

Model          | Top-3 Accuracy | Avg. Similarity | Query Time
---------------|----------------|-----------------|------------
CBOW           | 72%           | 0.68            | 0.02s
Skip-gram      | 78%           | 0.71            | 0.03s
MiniLM-L6-v2   | 94%           | 0.85            | 0.08s

Recommendations

Use CBOW/Skip-gram when:

Maximum speed is required
Working with domain-specific vocabulary
Training on custom corpus
Memory/size is constrained
Simple keyword matching suffices

Use Sentence Transformer when:

Need robust OOV handling
Working with sentences/phrases
Want state-of-the-art quality
Inference speed is acceptable
Require transfer learning capabilities
Production deployment is planned

Results

Semantic Candidat esearch in the weavaiate DB and ranking

Search in the web in order to help the HR manager

Roadmap

Current Version (v1.0)

Basic resume embedding and storage
Semantic search functionality
Multiple model support (CBOW, Skip-gram, Transformers)
Batch processing
Model benchmarking

Upcoming Features (v1.1)

REST API for integration
Web-based UI dashboard
Real-time resume parsing from PDFs
Fine-tuned models on resume corpus
Multi-language support
Explainability features

Future Enhancements (v2.0)

Active learning feedback loop
Candidate ranking algorithms
Integration with ATS systems
Bias detection and mitigation
Skills gap analysis
Automated interview question generation

See the open issues for proposed features and known issues.

Contributing

Contributions are welcome. Any contributions you make are greatly appreciated.

How to Contribute

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
pytest tests/

# Format code
black .
isort .

# Lint code
flake8 .
pylint src/

Contribution Guidelines

Write clear, commented code
Add unit tests for new features
Update documentation for API changes
Follow PEP 8 style guidelines
Keep pull requests focused and small
Provide detailed PR descriptions

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Project Maintainer: dhou22

GitHub: @dhou22
Project Link: https://github.com/dhou22/Get-Hired-Project
Issues: Report a Bug

Acknowledgments

This project was made possible by:

HuggingFace - State-of-the-art transformer models
Weaviate - Vector database technology
Sentence-Transformers - Pre-trained semantic embedding models
NLTK - Natural language processing tools
Gensim - Word2Vec implementations

Research & References

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Mikolov, T., et al. (2013). Efficient Estimation of Word Representations in Vector Space
Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers

Additional Resources

Made by dhou22

If you find this project helpful, please consider giving it a star.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
agentic_RAG.py		agentic_RAG.py
agentic_research.py		agentic_research.py
agentic_scorer.py		agentic_scorer.py
final-complete-doc.md		final-complete-doc.md
requirements.txt		requirements.txt
resume-embedding-huggingface-weaviate-storage.ipynb		resume-embedding-huggingface-weaviate-storage.ipynb
resume-preprocessing-embedding.ipynb		resume-preprocessing-embedding.ipynb
weaviate_connect.ipynb		weaviate_connect.ipynb

dhou22/Get-Hired-Project

Folders and files

Latest commit

History

Repository files navigation

Get Hired - AI Recruitment Assistant

Table of Contents

About The Project

Motivation

Key Features

Core Capabilities

Advanced Features

Architecture

System Components

Getting Started

Prerequisites

Installation

1. Clone the Repository

2. Create Virtual Environment

Dataset

Dataset Statistics

Sample Categories

Data Schema in Weaviate

Resume Processing Pipeline

Pipeline Steps

1. Weaviate DB ETL

2. HR Chatbot Workflow Steps

3. AI Agents

Model Comparison & Benchmarks

Performance Comparison

Model Details

Word2VC models

1. CBOW (Continuous Bag of Words)

2. Skip-gram

Sentence Transformer (all-MiniLM-L6-v2)

Benchmark Results

Recommendations

Results

Semantic Candidat esearch in the weavaiate DB and ranking

Search in the web in order to help the HR manager

Roadmap

Current Version (v1.0)

Upcoming Features (v1.1)

Future Enhancements (v2.0)

Contributing

How to Contribute

Development Setup

Contribution Guidelines

License

Contact

Acknowledgments

Research & References

Additional Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages