This Python project contains an NLP machine learning pipeline that processes and cleans raw data, stores it in a SQL database, retrieves the data from the SQL database, processes it in a machine learning pipeline (using bow, pos, tf-idf, lemmatization), and trains a random forest. The best parameters for the model are found using grid search before the model is trained. The resulting model is then saved in a classifier.pkl file. The package also includes a web app (app.py) that loads the data and transfers it to a web app, allowing other texts to be processed using the machine learning model.
The package has the following directory structure:
- NLP_Pipline_disaster_response (Main folder)
- data (sub-folder)
disaster_messages.csvdisaster_categories.csvprocess_data.pycleaned_data_sql.db
- models (sub-folder)
train_classifier.pytext_length_extractor.pyclassifier.pkl
- app (sub-folder)
run.pytext_length_extractor.py- templates (sub-folder)
go.htmlmaster.html
- data (sub-folder)
The data folder contains the following files:
disaster_messages.csv: Contains raw disaster-related text messages.disaster_categories.csv: Contains the categories for each disaster-related message.process_data.py: A Python script that reads in thedisaster_messages.csvanddisaster_categories.csvfiles, cleans the data, and saves the resulting data in a SQLite database calledcleaned_data_sql.db.cleaned_data_sql.db: A SQLite database containing the cleaned data from thedisaster_messages.csvanddisaster_categories.csvfiles.
The models folder contains the following files:
train_classifier.py: A Python script that reads in the cleaned data from thecleaned_data_sql.dbfile, processes it using a machine learning pipeline (using bow, pos, tf-idf, lemmatization), and trains a random forest. The best parameters for the model are found using grid search. The resulting model is then saved in a file calledclassifier.pkl.zip.- With this model the following score is reached:
recall = 0.85f1-score = 0.60
- With this model the following score is reached:
text_length_extractor.py: A Python script that extracts the length of each text message in characters and words.- `classifier.pkl.zip: A file containing the trained machine learning model.
- To download the
classifier.py.zipfile, which has been uploaded to a Git LFS server (theclassifier.pklfile is >10GB), you will need to have Git LFS installed and configured. After installing Git LFS, you can activate it by running the commandgit lfs install. Once Git LFS is installed and configured, you can download theclassifier.py.zipfile by runninggit lfs pull. After downloading, the file will be in its compressed zip format and must be manually unzipped by the user in the same file likeclassifier.py.zip.
- To download the
The app folder contains the following files:
run.py: Start a Flask web application to use the machine learning model in the web app and show some graphs about the training data.text_length_extractor.py: A Python script that extracts the length of each text message in characters and words.- templates (sub-folder): A sub-folder containing the following files:
go.html: Template file for the output page of the web appmaster.html: Template file for building the overall structure and layout of the web app
To use the package, follow these steps:
-
Clone the repository to your local machine.
-
To successfully run this project, the following libraries are required:
jsonplotlypandasnltkFlaskjoblibsqlalchemysslpicklesklearntext_length_extractorsysre
Please ensure that these libraries are installed on your system before running the code. If any of these libraries are missing, you can install them using pip. For example, to install the pandas library, you can use the following command:
pip install pandas
Note that some of these libraries have already been imported multiple times in the code, so please ensure that there are no duplicates.
- Navigate to the data folder and run python
process_data.py. This will read in thedisaster_messages.csvanddisaster_categories.csvfiles, clean the data, and save the resulting data in a SQLite database calledcleaned_data_sql.db. - Navigate to the models folder and run python
train_classifier.py. This will read in the cleaned data from thecleaned_data_sql.dbfile, process it using a machine learning pipeline (using bow, pos, tf-idf, lemmatization), and train a random forest. - Navigate to the app folder and run python
run.py. This will start the web app.

