A machine learning system for predicting water quality based on physicochemical parameters. This project uses various ML algorithms to classify water quality into three categories: Excellent, Good, and Poor.
Water clarity/
βββ README.md # This file
βββ Water-Clarity-DS.csv # Dataset containing water quality measurements
βββ test.ipynb # Jupyter notebook for model development and training
βββ water_quality_model.pkl # Trained machine learning model (serialized)
βββ feature_names.json # List of feature names used by the model
βββ class_labels.json # Mapping of class indices to quality labels
βββ predict.py # Standalone prediction script
βββ api.py # API endpoint for model serving
βββ __pycache__/ # Python cache files
βββ api.cpython-311.pyc
The model uses the following water quality parameters to make predictions:
- Temperature (Β°C)
- Turbidity (cm)
- Dissolved Oxygen (mg/L)
- BOD (Biological Oxygen Demand, mg/L)
- pH value
- Ammonia concentration (mg/L)
- Nitrite concentration (mg/L)
The system predicts water quality in three categories:
- 0: Excellent water quality
- 1: Good water quality
- 2: Poor water quality
pip install pandas numpy scikit-learn joblib matplotlib seaborn- Using the prediction script directly:
from predict import predict_water_quality
# Example prediction
result = predict_water_quality(
temp=67.45,
turbidity=10.13,
do=0.208,
bod=7.474,
ph=4.752,
ammonia=0.286,
nitrite=4.355
)
print(f"Water Quality: {result['quality']}")
print(f"Confidence: {result['probabilities']}")- Running the API server:
python api.py- Training your own model:
- Open
test.ipynbin Jupyter Notebook - Run all cells to train and evaluate different models
- The best model will be automatically saved
- Open
The dataset (Water-Clarity-DS.csv) contains water quality measurements with:
- European decimal format (comma-separated)
- Multiple physicochemical parameters
- Balanced sampling across quality categories
The system evaluates multiple machine learning algorithms:
- Random Forest
- Gradient Boosting
- Support Vector Machine (SVM)
- Logistic Regression
- K-Nearest Neighbors
- Data Preprocessing: Handle European decimal format, balance classes
- Model Evaluation: Cross-validation with stratified sampling
- Hyperparameter Tuning: Grid search for optimal parameters
- Performance Analysis: Classification reports and confusion matrices
load_and_preprocess_data(): Load and clean the datasetbalance_classes(): Balance the dataset for fair trainingevaluate_models(): Compare different ML algorithmsoptimize_best_model(): Hyperparameter tuning for best modelanalyze_feature_importance(): Understand feature contributions
The system automatically selects the best performing model based on cross-validation scores. Typical performance metrics include:
- Accuracy scores for each class
- F1-scores per quality category
- Confusion matrix analysis
- Feature importance rankings
The API provides a REST endpoint for making predictions:
# Example API call structure
POST /predict
{
"temp": 67.45,
"turbidity": 10.13,
"do": 0.208,
"bod": 7.474,
"ph": 4.752,
"ammonia": 0.286,
"nitrite": 4.355
}The system generates several important files:
water_quality_model.pkl: Serialized trained modelfeature_names.json: Feature names in correct orderclass_labels.json: Quality label mappings
The model analyzes which parameters are most important for water quality prediction. This helps understand:
- Which measurements have the strongest impact on water quality
- How different parameters contribute to the final classification
- Insights for water quality monitoring priorities
The Jupyter notebook includes:
- Data distribution plots
- Model performance comparisons
- Feature importance visualizations
- Confusion matrix heatmaps
import pandas as pd
from predict import predict_water_quality
# Load your data
data = pd.read_csv('new_water_samples.csv')
# Make predictions for each row
predictions = []
for _, row in data.iterrows():
result = predict_water_quality(
temp=row['temp'],
turbidity=row['turbidity'],
do=row['do'],
bod=row['bod'],
ph=row['ph'],
ammonia=row['ammonia'],
nitrite=row['nitrite']
)
predictions.append(result['quality'])
data['predicted_quality'] = predictions# Integrate with sensor data
def monitor_water_quality(sensor_data):
result = predict_water_quality(**sensor_data)
if result['prediction'] == 2: # Poor quality
send_alert(f"Poor water quality detected: {result['quality']}")
return result- Add new features to the dataset
- Update
feature_names.jsonaccordingly - Retrain the model using
test.ipynb - Test with new prediction scripts
- Collect more training data
- Experiment with feature engineering
- Try advanced algorithms (XGBoost, Neural Networks)
- Implement ensemble methods
The system provides comprehensive evaluation:
- Cross-validation scores: Generalization performance
- Test accuracy: Final model performance
- Per-class metrics: Precision, recall, F1-score
- Confusion matrix: Classification details
To retrain the model with new data:
- Update
Water-Clarity-DS.csvwith new samples - Run the complete pipeline in
test.ipynb - New model artifacts will be automatically saved
- Update prediction scripts if needed
- Data Quality: Ensure measurements are accurate and consistent
- Feature Scaling: The system handles scaling automatically
- Class Balance: Dataset balancing is implemented for fair training
- Validation: Always validate predictions with ground truth when possible
- Fork the repository
- Create your feature branch
- Add tests for new functionality
- Update documentation
- Submit a pull request
This project is open source and available under the MIT License.
For issues or questions:
- Check the Jupyter notebook for detailed examples
- Review the prediction script for usage patterns
- Examine the API code for integration examples
Note: This system is designed for educational and research purposes. For production water quality monitoring, please validate against certified laboratory measurements and follow local regulations.