Skip to content

A machine learning project that leverages HuggingFace transformers and Weaviate vector database to create semantic embeddings of resumes for intelligent matching and retrieval.

Notifications You must be signed in to change notification settings

dhou22/Get-Hired-Project

Repository files navigation

Get Hired - AI Recruitment Assistant

Get Hired Banner

Python Version License: MIT HuggingFace Weaviate

An intelligent resume matching system powered by semantic embeddings and vector search

ResultsBenchmarksContributing


Table of Contents


About The Project

Get Hired is an AI-powered recruitment assistant that revolutionizes the resume screening process. By leveraging state-of-the-art transformer models and vector databases, it creates semantic embeddings of resumes to enable intelligent matching and retrieval based on job requirements.

Motivation

Traditional keyword-based resume screening systems often miss qualified candidates due to rigid keyword matching, inability to understand context, poor handling of synonyms and related terms, and time-consuming manual processes.

Get Hired addresses these limitations by understanding semantic meaning beyond keywords, finding candidates with transferable skills, providing similarity scores for objective comparison, and scaling to thousands of resumes in seconds.


Key Features

Core Capabilities

  • Semantic Understanding: Uses transformer-based models (BERT, Sentence-BERT) to capture deep semantic meaning from resume text
  • Vector Storage: Efficient storage and retrieval using Weaviate's vector database with sub-second query times
  • Intelligent Matching: Find best-matching resumes based on natural language job descriptions
  • Multiple Model Support: Compare results across CBOW, Skip-gram, and Transformer models
  • Hybrid Search: Combine vector similarity with metadata filtering (experience, skills, category)
  • Scalable Architecture: Process and query thousands of resumes efficiently

Advanced Features

  • Natural language queries using plain English
  • Batch processing for large resume datasets with configurable batch sizes
  • Automated text cleaning, tokenization, and lemmatization
  • Built-in tools to compare embedding quality across different models
  • Robust error handling with detailed logging and recovery mechanisms

Architecture

Capture d'écran 2025-11-13 100806

System Components

Data Ingestion Layer

  • Resume parsing and validation
  • Text extraction and preprocessing

Embedding Generation

  • HuggingFace Transformers (all-MiniLM-L6-v2)
  • Custom Word2Vec models (CBOW, Skip-gram)

Vector Storage

  • Weaviate vector database
  • Schema-based structured storage
  • Hybrid search capabilities

Query Interface

  • Natural language processing
  • Similarity search
  • Metadata filtering

Getting Started

Prerequisites

  • Python 3.8 or higher
  • 4GB+ RAM recommended
  • GPU optional (for faster embedding generation)

Installation

1. Clone the Repository

git clone https://github.com/dhou22/Get-Hired-Project.git
cd Get-Hired-Project

2. Create Virtual Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
venv\Scripts\activate

The notebook contains the following sections:

  1. Data Loading: Import and explore resume dataset
  2. Preprocessing: Clean and prepare text data
  3. Embedding Generation: Create vector representations
  4. Weaviate Setup: Configure schema and upload data
  5. Query & Retrieval: Test semantic search functionality
  6. Benchmarking: Compare model performance

Dataset

Dataset source on Kaggle : https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset

Dataset Overview

Dataset Statistics

  • Total Resumes: 2,484
  • Categories: 25 job categories
  • Average Length: ~500 words per resume
  • Format: Structured CSV with text fields

Sample Categories

Category Count Examples
Data Science 245 Machine Learning Engineer, Data Analyst
Web Development 312 Full-Stack Developer, Frontend Engineer
DevOps 189 Cloud Engineer, SRE
Mobile Development 156 iOS Developer, Android Engineer

Data Schema in Weaviate

Capture d'écran 2025-10-04 221249

Resume Processing Pipeline

Pipeline Steps

1. Weaviate DB ETL

Solution Architecture

Lowercase Conversion

  • Ensures consistency across text
  • Example: "Python" → "python"

Remove Punctuation & Special Characters

  • Eliminates noise from embeddings
  • Removes numbers unless contextually relevant

Tokenization

  • Splits text into individual words/tokens
  • Uses NLTK's word tokenizer

Stopword Removal

  • Removes common words with little semantic value
  • Preserves domain-specific terms

Lemmatization

  • Converts words to base form
  • Example: "running" → "run", "better" → "good"

Text Reconstruction

  • Rejoins tokens into clean text
  • Ready for embedding models

2. HR Chatbot Workflow Steps

Capture d'écran 2025-11-13 100806

1. User Interaction

  • HR professional sends query through the interface
  • Query captured by ChatBot Interface (Angular frontend)
  • Communication handled via FastAPI backend (localhost:4200)

2. Query Transfer

  • Query transmitted from frontend to backend
  • Routed to the Orchestration Layer for processing

3. Orchestration Layer - Routing Agent

  • Intent Classification: Analyzes query using keyword matching and intent classification
  • Route Determination: Decides which specialized layer handles the request
  • Model: Mistral-large-latest (128k context window)
  • Memory Management: Maintains conversation history via ConversationBufferMemory

4. Query Branching Routes to appropriate specialized layer based on intent:

Path A: Candidate Search Queries → Retrieval Layer (5)

Path B: Market Analysis Queries → Web Search Layer (5)

5A. Retrieval Layer - Candidate Search

  • Vector Database: Weaviate stores candidate profiles
  • Embedding Model: HuggingFace MiniLM-L6 (3k documents)
  • Search Method: 384-dim vector cosine similarity matching
  • Returns relevant candidate profiles

5B. Web Search Layer - Market Intelligence

  • Tool: Tavily for web scraping
  • Sources: Deep crawl of 5+ sources
  • Gathers real-time market data and trends

6. Scoring Layer

  • Scoring Agent: LLAma-70B model
  • Evaluation: 4-phase weighted scoring system
  • Output: Ranked list of top 5 candidates
  • Processes retrieved candidates against job requirements

7. Analysis Layer

  • Analysis Agent: Mistral-7B model
  • Data Processing: Extracts and structures market data
  • Output Generation: Creates formatted salary tables and market insights reports

8. Output Delivery Two types of outputs generated:

Candidate Search Output

  • Ranked list of top 5 candidates with scores
  • Delivered to user interface

Market Report Output

  • Comprehensive salary table and market insights
  • Sourced data with analysis
  • Delivered to user interface

9. Response Cycle

  • Results displayed in ChatBot Interface
  • User receives either candidate recommendations or market analysis
  • Conversation history stored for context in future queries

10. Completion

  • User gets final answer through the HR interface
  • System ready for next query with maintained conversation context

3. AI Agents

image

image

image

Model Comparison & Benchmarks

Benchmark Results

Performance Comparison

Model Embedding Dim Inference Speed OOV Handling Quality Score
CBOW 100 Fast Poor Good
Skip-gram 100 Medium Poor Very Good
all-MiniLM-L6-v2 384 Slower Excellent Excellent
image

Model Details


Word2VC models

image

1. CBOW (Continuous Bag of Words)

Strengths:

  • Fast inference speed (~0.1ms per resume)
  • Compact embeddings (100D)
  • Excellent for frequent words
  • Low memory footprint

Limitations:

  • Limited to training vocabulary
  • Poor performance on rare words
  • Cannot handle out-of-vocabulary (OOV) terms

Best Use Cases:

  • High-speed production systems
  • Domain-specific vocabularies
  • Resource-constrained environments

2. Skip-gram

Strengths:

  • Better semantic relationships
  • Works well with rare words
  • Captures fine-grained meanings
  • Good for analogies and similarities

Limitations:

  • Slower than CBOW
  • Still limited to training vocabulary
  • Requires more training data

Best Use Cases:

  • Semantic similarity tasks
  • Small to medium datasets
  • Custom corpus training

Sentence Transformer (all-MiniLM-L6-v2)

hugging face source model : https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 Capture d'écran 2025-10-04 224155

Strengths:

  • Handles any word (no OOV issues)
  • Pre-trained on 1B+ sentence pairs
  • Higher dimensional embeddings (384D)
  • State-of-the-art quality
  • Transfer learning capabilities
  • Sentence-level understanding

Limitations:

  • Slower inference (~10ms per resume)
  • Larger model size (~80MB)
  • Requires more computational resources

Best Use Cases:

  • Production-ready applications
  • General-purpose text matching
  • When quality is priority
  • Handling diverse vocabularies

Benchmark Results

Query: "experienced python developer machine learning"

Model          | Top-3 Accuracy | Avg. Similarity | Query Time
---------------|----------------|-----------------|------------
CBOW           | 72%           | 0.68            | 0.02s
Skip-gram      | 78%           | 0.71            | 0.03s
MiniLM-L6-v2   | 94%           | 0.85            | 0.08s

Recommendations

Use CBOW/Skip-gram when:

  • Maximum speed is required
  • Working with domain-specific vocabulary
  • Training on custom corpus
  • Memory/size is constrained
  • Simple keyword matching suffices

Use Sentence Transformer when:

  • Need robust OOV handling
  • Working with sentences/phrases
  • Want state-of-the-art quality
  • Inference speed is acceptable
  • Require transfer learning capabilities
  • Production deployment is planned

Results

Semantic Candidat esearch in the weavaiate DB and ranking

Capture d'écran 2025-10-29 075811

Capture d'écran 2025-10-29 090432

Search in the web in order to help the HR manager

Capture d'écran 2025-10-29 090720

Roadmap

Current Version (v1.0)

  • Basic resume embedding and storage
  • Semantic search functionality
  • Multiple model support (CBOW, Skip-gram, Transformers)
  • Batch processing
  • Model benchmarking

Upcoming Features (v1.1)

  • REST API for integration
  • Web-based UI dashboard
  • Real-time resume parsing from PDFs
  • Fine-tuned models on resume corpus
  • Multi-language support
  • Explainability features

Future Enhancements (v2.0)

  • Active learning feedback loop
  • Candidate ranking algorithms
  • Integration with ATS systems
  • Bias detection and mitigation
  • Skills gap analysis
  • Automated interview question generation

See the open issues for proposed features and known issues.


Contributing

Contributions are welcome. Any contributions you make are greatly appreciated.

How to Contribute

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
pytest tests/

# Format code
black .
isort .

# Lint code
flake8 .
pylint src/

Contribution Guidelines

  • Write clear, commented code
  • Add unit tests for new features
  • Update documentation for API changes
  • Follow PEP 8 style guidelines
  • Keep pull requests focused and small
  • Provide detailed PR descriptions

License

Distributed under the MIT License. See LICENSE for more information.


Contact

Project Maintainer: dhou22


Acknowledgments

This project was made possible by:

Research & References

  • Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
  • Mikolov, T., et al. (2013). Efficient Estimation of Word Representations in Vector Space
  • Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers

Additional Resources


Made by dhou22

If you find this project helpful, please consider giving it a star.

GitHub stars

About

A machine learning project that leverages HuggingFace transformers and Weaviate vector database to create semantic embeddings of resumes for intelligent matching and retrieval.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •