- Overview
 - Project Structure and Components
 - Data
 - Methodology
 - Results
 - How to Run
 - Conclusion & Future Work
 
This project aims to develop a robust machine learning model for detecting fraudulent job postings. In today's digital age, online job platforms are a primary target for scammers. These fraudulent postings can mislead job seekers, waste their time, and potentially expose them to financial risks.
Leveraging a dataset of job advertisements, this solution employs natural language processing (NLP) techniques combined with traditional machine learning algorithms to identify suspicious patterns. The core of the model utilizes TF-IDF for text feature extraction and a Linear Support Vector Classifier (LinearSVC) for classification. The project demonstrates a complete workflow from data understanding and preprocessing to model training, evaluation, and persistence for future use.
The project is structured into several Python classes within the classes.ipynb Jupyter Notebook, each handling a specific part of the data science pipeline:
- 
DataClass:- Purpose: Handles the initial loading of the 
jobs.csvdataset and provides basic data exploration methods (e.g.,head(),shape(),info(),describe()). - Key Functionality: Provides fundamental insights into the dataset's structure, dimensions, data types, and summary statistics.
 
 - Purpose: Handles the initial loading of the 
 - 
DataPreprocessingClass:- Purpose: Focuses on cleaning and preparing the raw data for model training.
 - Key Functionality:
- Identifies and handles missing values (e.g., dropping rows with missing 
description, filling others with placeholders like 'Not Provided' or statistical measures like median/mode). - Performs feature engineering by creating new informative features from existing text fields (e.g., text lengths, count of specific characters like '$' or '!', presence of common scam keywords).
 - Saves the cleaned and engineered dataset to a new CSV file (
phase1cleaned.csv). 
 - Identifies and handles missing values (e.g., dropping rows with missing 
 
 - 
Graph,UnivariateAnalysis,BivariateAnalysisClasses:- Purpose: Dedicated to exploratory data analysis (EDA) through various visualizations.
 Graph: A base class providing individual plotting methods for different aspects of the data.UnivariateAnalysis: Inherits fromGraphand aggregates methods for plotting distributions of single variables (e.g.,fraudulentjob distribution,employment_typedistribution,required_educationlevels).BivariateAnalysis: Inherits fromGraphand provides methods for visualizing relationships between two variables (e.g., average salary vs. location, employment type vs. fraudulent status, fraudulent jobs by industry).
 - 
ModelClass:- Purpose: Encapsulates the core machine learning pipeline, including feature encoding, data splitting, model training, and evaluation.
 - Key Functionality:
- Feature Encoding: Combines relevant text fields into a single 'text' column and applies TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text into numerical features. Integrates other numerical features like 
telecommuting,has_company_logo, andhas_questions. - Data Splitting: Divides the processed data into training and testing sets (80/20 split) using stratified sampling to ensure the target variable's class distribution is maintained.
 - Model Training: Trains a 
LinearSVC(Linear Support Vector Classifier) fromsklearn.svm, known for its effectiveness in text classification. - Model Evaluation: Assesses the model's performance using standard metrics: confusion matrix, classification report (precision, recall, F1-score), and accuracy.
 - Model Persistence: Saves the trained 
LinearSVCmodel and theTfidfVectorizerobject using Python'spicklemodule for later use. 
 - Feature Encoding: Combines relevant text fields into a single 'text' column and applies TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text into numerical features. Integrates other numerical features like 
 
 - 
PickleModelClass:- Purpose: Provides a convenient way to load the previously saved machine learning model and vectorizer to make predictions on new, unseen data.
 - Key Functionality: Loads the 
svm_fraud_model.pklfile and offers apredict()method that prepares new data (combining text, vectorizing, stacking with other features) and generates predictions. 
 
The project utilizes a dataset of job postings, assumed to be named jobs.csv. This dataset contains various features describing job advertisements, including:
- Textual Features: 
title,description,requirements,benefits,company_profile. - Categorical Features: 
employment_type,required_experience,required_education,industry,function,location,department. - Numerical/Binary Features: 
salary_range(converted tosalary_avg),telecommuting,has_company_logo,has_questions. - Target Variable: 
fraudulent(a binary variable indicating whether a job posting is legitimate0or fraudulent1). 
The initial step involves loading the jobs.csv dataset into a Pandas DataFrame. The Data class is used to perform preliminary checks, such as:
- Viewing the first few rows (
head()) to understand the data format. - Checking the dimensions (
shape) to know the number of rows and columns. - Getting a concise summary of the DataFrame (
info()) to inspect data types and non-null counts. - Generating descriptive statistics (
describe(),describe_all()) for numerical and categorical columns. 
The DataPreprocessing class implements crucial steps to prepare the data:
- Handling Missing Values:
- Rows with missing 
descriptionare dropped, as this is a vital text field for fraud detection. - Other missing text fields (
company_profile,requirements,benefits) are filled with 'Not Provided'. salary_rangeis processed to extractsalary_avg(by taking the average of the range if available, otherwise filled with the median salary). The originalsalary_rangecolumn is then dropped.- Other categorical columns (
location,department,employment_type,required_experience,required_education,industry,function) are filled with their respective modes. 
 - Rows with missing 
 - Feature Extraction:
- New numerical features are engineered from text columns:
desc_length,req_length,benefits_length,title_length,profile_length: The character lengths of the respective text fields.desc_dollar_count,desc_exclaim_count: The number of dollar signs and exclamation marks in the job description, often indicators of suspicious language.has_scam_words: A binary indicator (0 or 1) if the description contains common scam-related phrases (e.g., 'money', 'investment', 'fast cash', 'work from home', 'no experience', 'quick earn').
 
 - New numerical features are engineered from text columns:
 
The Graph, UnivariateAnalysis, and BivariateAnalysis classes are used to visualize the data and uncover patterns:
- Univariate Analysis:
- Fraudulent Job Distribution: Shows a significant class imbalance, with a much larger number of legitimate jobs compared to fraudulent ones. This is a common challenge in fraud detection.
 - Employment Type Distribution: Visualizes the most common employment types.
 - Required Experience/Education: Displays the distribution of required experience levels and education backgrounds.
 - Telecommuting/Company Logo/Questions: Shows the counts of remote jobs, jobs with company logos, and jobs requiring screening questions. (e.g., fraudulent jobs often lack company logos or are remote).
 - Top Industries/Job Functions/Locations: Highlights the most frequent industries, job functions, and geographical locations.
 
 - Bivariate Analysis:
- Average Salary by Location: Illustrates how average salary varies across top locations.
 - Employment Type vs. Fraudulent: Breaks down fraudulent vs. non-fraudulent counts by employment type, revealing if certain types are more prone to fraud.
 - Fraud by Industry: Pinpoints which industries have the highest counts of fraudulent job postings.
 
 
The Model class orchestrates the machine learning pipeline:
- Combined Text Feature: A new 'text' column is created by concatenating 
title,description,requirements,benefits, andcompany_profile. This comprehensive text field is central to the NLP approach. - TF-IDF Vectorization: 
TfidfVectorizeris applied to the combined 'text' column. This technique transforms text into a matrix of numerical TF-IDF features, representing the importance of words in the document relative to the corpus.max_featuresis set to 50,000 to limit the vocabulary size. - Feature Stacking: The TF-IDF features are horizontally stacked (
hstack) with other numerical/binary features (telecommuting,has_company_logo,has_questions) to create the final feature matrixX. - Data Splitting: The dataset is split into training (80%) and testing (20%) sets using 
train_test_split. Crucially,stratify=self.yis used to ensure that the proportion of fraudulent jobs is maintained in both training and testing sets, addressing the class imbalance.random_state=42ensures reproducibility. - Model Training: A 
LinearSVC(Linear Support Vector Classifier) is trained on theX_trainandy_traindata.LinearSVCis suitable for large datasets and linear classification tasks, making it a good choice for TF-IDF features.max_iteris increased to 10,000 to ensure convergence. 
After training, the model's performance is evaluated on the unseen X_test data:
- Confusion Matrix: Provides a breakdown of correct and incorrect classifications (True Positives, True Negatives, False Positives, False Negatives).
 - Classification Report: Offers detailed metrics for each class:
- Precision: The proportion of positive identifications that were actually correct.
 - Recall (Sensitivity): The proportion of actual positives that were correctly identified.
 - F1-score: The harmonic mean of precision and recall, providing a balanced measure.
 
 - Accuracy Score: The overall proportion of correctly classified instances.
 
Model Performance Summary:
Based on the provided output, the LinearSVC model demonstrates strong performance in detecting fraudulent job postings:
              precision    recall  f1-score   support
           0       0.99      1.00      0.99      3403
           1       0.99      0.78      0.87       173
    accuracy                           0.99      3576
   macro avg       0.99      0.89      0.93      3576
weighted avg       0.99      0.99      0.99      3576
Accuracy: 0.9890939597315436
- Overall Accuracy: Approximately 98.9%. This high accuracy indicates that the model correctly classifies a large majority of job postings.
 - Fraudulent Class (Class 1) Performance:
- Precision (0.99): When the model predicts a job is fraudulent, it is correct 99% of the time. This is excellent for minimizing false alarms.
 - Recall (0.78): The model identifies 78% of all actual fraudulent job postings. While not perfect, this is a respectable recall, indicating it catches a significant portion of fraud.
 - F1-score (0.87): A strong F1-score for the minority class suggests a good balance between precision and recall in identifying fraud.
 
 
These results indicate that the model is highly effective in differentiating between legitimate and fraudulent job postings, making it a valuable tool for enhancing job platform security.
To execute this project and train the fraud detection model:
- 
Prerequisites: Ensure you have Python (version 3.x recommended) installed, along with the following libraries:
pandasnumpymatplotlibscikit-learnscipycopy(built-in)pickle(built-in)
You can install them via pip:
pip install pandas numpy matplotlib scikit-learn scipy
 - 
Dataset: Make sure the
jobs.csvfile is located in the same directory as theclasses.ipynbnotebook. - 
Execute the Notebook:
- Open the 
classes.ipynbfile using Jupyter Notebook or JupyterLab. - Run all cells sequentially. The script will perform data loading, preprocessing, feature engineering, text vectorization, data splitting, model training, and evaluation.
 - Upon successful execution, a trained model and TF-IDF vectorizer will be saved as 
svm_fraud_model.pklin your project directory. 
 - Open the 
 - 
Making New Predictions:
- To use the trained model for new predictions, load the 
PickleModelclass and use itspredict()method. Ensure your new data DataFrame has the required columns (title,description,requirements,benefits,company_profile,telecommuting,has_company_logo,has_questions). 
 - To use the trained model for new predictions, load the 
 
This project successfully developed a robust fraud detection model capable of classifying job postings with high accuracy. The combination of domain-specific feature engineering, TF-IDF for text representation, and a powerful LinearSVC classifier proved effective in handling the complexity and imbalanced nature of the dataset. The model's high precision for the fraudulent class ensures that legitimate job postings are rarely misflagged, which is crucial for user experience on job platforms.