AutoML Pipeline for Data Streams (AML4S)
This repository contains a loan data stream generator and a fully automated online machine learning method for data streams. It also contains visualization tools, experiments and examples of the method.
- To install AML4S from GitHub use:
git clone https://github.com/AuthEceSoftEng/automl-data-streams.git - To create a loan dataset, use the
create_loandatasetfunction. - To convert a dataset from arff to csv use the
convert_arff_to_csvfunction. - To prepare a dataset for the pipeline (if it’s not a list of dictionaries) from a CSV or from real datasets of River, use the
prepare_datafunction. - To create and use an instance of AML4S, use the [
AML4S] class.- Create an instance of AML4S with
__init__. - Create a small training data set.
- Train AML4S for the first time with
init_train. - Predict using AML4S with
predict_one. - Train AML4S with a new instance with
learn_one.
- Create an instance of AML4S with
- To evaluate the created pipelines (one or more), use the
evaluationfunction. - To create plots for the evaluations, use the
create_plotsfunction. - To create plots of dataset features, use the
data_plotfunction. - To create interactive diagrams from saved files of the experiments with online methods run file plots_online_exp.py.
- To create interactive diagrams from saved files of the experiments with OAML run file plots_oaml_exp.py.
- To create comparison diagrams from saved files of the experiments with online methods run file comparison_with_online.py.
- To create comparison diagrams from saved files of the experiments with OAML run file comparison_with_online.py.
A good example of how to use the AML4S is included in the AML4S_Usage file.
Some good examples of how to use all the functions are included in the Exeperiments directory.
- File:
AML4S_class.py - Description: Contains the functions and the parameters of the AML4S object.
- Description: Creates the object AML4S (constructor).
AML4S(target, data_drift_detector, consept_drift_detector)
target(str): The target variable for the model to predict.data_drift_detector(boolean): True if there is data drift detector, else False.consept_drift_detector(boolean): True if there is concept drift detector, else False.seed(int | None): Random seed for reproducibility
- Description: Trains the pipeline for the first time with a provided dataset.
init_train(self, init_train_data)
init_train_data(list[dict]): List of dictionaries with the training data.
- Description: Predicts the target variable given the features.
predict_one(self, x)
x(dict): Sample of data with the features.
y(int): Predicted target values.
- Description: Training sample by sample of the pipeline
learn_one(self, x, y)
x(dict): Sample of data with the features.y(int): Predicted target values.
- File:
AML4S_Usage.py - Description: Executes the AutoML pipeline on the provided dataset, including data drift and concept drift detection.
use_AML4S(data, target, data_drift_detector, consept_drift_detector)
data(list): The dataset to be processed by the pipeline.target(str): The target variable for the model to predict.data_drift_detector(boolean): True if there is data drift detector, else False.consept_drift_detector(boolean): True if there is concept drift detector, else False.
y_real(list): Real target values.y_pred(list): Predicted target values.pipeline.data_drifts(list): Detected data drifts.pipeline.concept_drifts(list): Detected concept drifts.
- File:
Find_best_pipeline.py - Description: Finds the best-performing pipeline among various models and configurations, using data and concept drift detection methods.
find_best_pipeline(x_train, y_train, data_drift_detector_method, concept_drift_detector_method)
x_train(list): Data with feature values for training.y_train(list): Data with target values for training.data_drift_detector_method(object): Method for detecting data drift.concept_drift_detector_method(object): Method for detecting concept drift.
pipeline(object): The selected best pipeline.accuracy(object): The accuracy of the selected pipeline.data_drift_detectors(object): Data drift detectors used in the selected pipeline.concept_drift_detector(object): The concept drift detector used in the selected pipeline.
- File:
Change_pipeline.py - Description: Trains and evaluates a new AutoML pipeline, selecting it if it performs better than the current one.
change_pipeline(pipeline_old, x_train, y_train, data_drift_detectors_old, concept_drift_detector_old, data_drift_detector_method, concept_drift_detector_method, buffer_accuracy)
pipeline_old(object): The existing classifier pipeline.x_train(list): Data with feature values for training.y_train(list): Data with target values for training.data_drift_detectors_old(object): The existing pipeline's data drift detectors.concept_drift_detector_old(object): The existing pipeline's concept drift detector.data_drift_detector_method(object): The method for detecting data drift.concept_drift_detector_method(object): The method for detecting concept drift.buffer_accuracy(object): The accuracy of the current model in the buffer.
pipeline(object): The selected pipeline (either new or old).accuracy(object): The accuracy of the selected pipeline.data_drift_detectors(object): The data drift detectors used in the selected pipeline.concept_drift_detector(object): The concept drift detector used in the selected pipeline.
- File:
Simple_pipeline_use.py - Description: Constructs a simple machine learning pipeline using a model, an optional preprocessor, and an optional feature selector. It then trains and evaluates the pipeline on the provided dataset.
simple_pipeline(model, preprocessor, feature_selector, data, target)
model(object): The machine learning model to be used in the pipeline.preprocessor(object or None): An optional preprocessing object. IfNone, no preprocessing is applied.feature_selector(object or None): An optional feature selector object. IfNone, no feature selection is applied.data(list): The dataset to be used for training and prediction. Each element should be a dictionary of features.target(str): The name of the target variable in the dataset.
y_real(list): The actual target values from the dataset.y_pred(list): The predicted target values from the pipeline.data_drifts(list): A placeholder list, empty in this implementation.concept_drifts(list): A placeholder list, empty in this implementation.
- File:
Convert_arff_to_csv - Description: File converter from arff to csv.
convert_arff_to_csv('arff_name.arff', 'csv_name.csv')
arff_file(string): path for arff file e.g. 'arff_name.arff'csv_name(string): path for new csv file
- saved vsc file
- File:
Create_loandataset.py - Description: Creates a loan dataset with specified drifts.
create_loandataset(class_num, datalimit, conceptdriftpoints, datadriftpoints, seed)
class_num(2, 3, 4): Number of class in the output of the generatordatalist(int): Number of data samples in the dataset (e.g., 30000).conceptdriftpoints(list[dict]): Points of drifts with function names (e.g., [4000: "crisis", 10000: "normal"]).datadriftpoints(list[dict]): Points of drifts with function names (e.g., [2000: "crisis", 8000: "normal"]).seed(int): Seed for dataset reproducibility (e.g., 42).
data(list[dict]): List of dictionaries containing the created dataset.
- File:
Prepare_data.py - Description: Prepares the dataset for the pipeline.
prepare_data(dataset)
dataset(str or River dataset): Path of a CSV file or a River dataset.
data(list[dict]): List of dictionaries with the dataset's data.
- File:
Evaluation.py - Description: Evaluates the pipelines created.
evaluation(y_real, y_predicted, metric_algorithm)
y_real(list[list]): Real target values from each pipeline.y_predicted(list[list]): Predicted target values from each pipeline.metric_algorithm(object): Instance of the metric for evaluation.
results(list[list]): Evaluation results for each pipeline.
- File:
Create_Plots.py - Description: Creates plots for the evaluation metrics of each pipeline.
create_plots(evaluates, data_drifts, concept_drifts)
evaluates(list[list]): Evaluation results from theevaluationfunction.data_drifts(list[list]): Data drift points from each pipeline.concept_drifts(list[list]): Concept drift points from each pipeline.
- Plot of the metric we used in evaluation for all pipelines used.
- File:
Comparison_with_OAML_basic_plot.py - Description: Creates plots in same figure to compare metric results of some methods with OAML-basic.
compare_with_oaml(results)
results(list[list]): Evaluation results from theevaluationfunction and OAML results with step 1000 and start 6000.
- Figure with the metric plot of every method.
- File:
Data_plot.py - Description: Creates plots for dataset features.
data_plot(data, step)
data(list[dict]): List of dictionaries containing the data.step(int): Step of the visualization of the dataset.
- Plots of each feature in the dataset.
- File:
Accuracy_check.py - Description: Compares accuracy against a mean accuracy to decide if a model retrain is needed.
accuracy_check(mean_accuracy, y_true_buffer, y_predicted_buffer)
mean_accuracy(float): The mean accuracy to compare against.y_true_buffer(list): True target values of the last samples.y_predicted_buffer(list): Predicted target values of the last samples.
need_change(boolean): Indicates if the accuracy difference exceeds a threshold.
- File:
Split_data.py - Description: Splits the data into features and target.
split_data(dictionary, target_key)
dictionary(dict): Dictionary containing features and target value.target_key(string): Name of the target variable.
features(dict): Features of the input sample.target: Target value of the sample.
The generator can produce datasets with data and concept drifts at specified points.
- File for 2 class output:
loandataset_2_class.py - File for 3 class output:
loandataset_3_class.py - File for 4 class output:
loandataset_4_class.py - Description: Loandataset generator
- crisis: Tighter limits.
- normal: Normal limits.
- growth: Looser limits.
- crisis: Smaller salaries.
- normal: Normal salaries.
- growth: Bigger salaries.
To create a loan dataset, use the create_loandataset function.