diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 26ba2c2..5d84c5c 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -14,7 +14,7 @@ When creating an issue for a bug or other undefined/unwanted behaviour remember ## Pull Request process -1. [Fork](https://github.com/riibeirogabriel/learningOrchestra/fork) the repository +1. [Fork](https://github.com/learningOrchestra/docs/fork) the repository 2. [Clone](https://git-scm.com/docs/git-clone) your fork to your local environment 3. Navigate into the project root directory 4. Create a new [branch](https://git-scm.com/book/en/v2/Git-Branching-Basic-Branching-and-Merging), the branch should be named according to what feature/fix you're implementing @@ -26,7 +26,7 @@ When creating an issue for a bug or other undefined/unwanted behaviour remember 7. Create a Pull Request Remember to describe what feature or fix you're implementing in the Pull Request window. -In the Pull Request window remember to include a quick summary of what the committed code does and how it is an improvement. +In the Pull Request window remember to include a quick summary of what the committed code does and how it is an improvement. After the Pull Request the repository owner will review your request.\ Be patient, if they require you to make changes to your request, do so. @@ -37,4 +37,4 @@ Don't be rude, use crude language or harass other users. ## License The repository is currently licenses under GNU General Public License v3.0. -By contributing to the project you agree that your contributions will be licensed under the same license and provisions. \ No newline at end of file +By contributing to the project you agree that your contributions will be licensed under the same license and provisions. diff --git a/README.md b/README.md index 6314f4b..6f0a72f 100644 --- a/README.md +++ b/README.md @@ -2,16 +2,14 @@ # learningOrchestra Docs -To make changes clone the repo: - -`git clone https://github.com/learningOrchestra/learningOrchestra-docs.git` - -`cd learningOrchestra-docs` - -Install mkdocs: - +This repository contains the files to generate the user documentation of [learningOrchestra](https://github.com/learningOrchestra/learningOrchestra). The content of the documentation is created manually and can be found in [the docs folder](https://github.com/learningOrchestra/docs/tree/main/docs). The documentation website is created with [MkDocs](https://www.mkdocs.org/). + +To make changes please read the [contributing guide](https://github.com/learningOrchestra/docs/blob/main/CONTRIBUTING.md) then: +1. Clone the repo +`git clone https://github.com/learningOrchestra/docs.git` +2. Move inside the repo +`cd docs` +3. Install mkdocs `pip install mkdocs` - -Run: - +4. Run MkDocs `mkdocs serve` diff --git a/docs/about.md b/docs/about.md new file mode 100644 index 0000000..0429868 --- /dev/null +++ b/docs/about.md @@ -0,0 +1,13 @@ +# About learningOrchestra + +Nowadays, **data science relies on a wide range of computer science skills**, from data management to algorithm design, from code optimization to cloud infrastructures. Data scientists are expected to have expertise in these diverse fields, especially when working in small teams or for academia. + +This situation can constitute a barrier to the actual extraction of new knowledge from collected data, +which is why the last two decades have seen more efforts to facilitate and streamline the development of +data mining workflows. The tools created can be sorted into two categories: **high-level** tools facilitate +the building of **automatic data processing pipelines** (e.g. [Weka](https://www.cs.waikato.ac.nz/ml/weka/)) +while **low-level** ones support the setup of appropriate physical and virtual infrastructure (e.g. [Spark](https://spark.apache.org/)). + +However, this landscape is still missing a tool that **encompasses all steps and needs of a typical data science project**. This is where learningOrchestra comes in. + +Read our [first research monograph](https://drive.google.com/file/d/1ZDrTR58pBuobpgwB_AOOFTlfmZEY6uQS/view) (under construction) to know more about the research behind the project. diff --git a/docs/database-python.md b/docs/database-python.md new file mode 100644 index 0000000..47147f9 --- /dev/null +++ b/docs/database-python.md @@ -0,0 +1,43 @@ +## Database API + +### read_resume_files + +```python +read_resume_files(pretty_response=True) +``` +* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`) +(default `True`, if `False`, return dict) + +### read_file + +```python +read_file(filename, skip=0, limit=10, query={}, pretty_response=True) +``` + +* `filename` : name of file +* `skip`: number of rows to skip in pagination(default: `0`) +* `limit`: number of rows to return in pagination(default: `10`) +(maximum is set at `20` rows per request) +* `query`: query to make in MongoDB(default: `empty query`) +* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`) + +### create_file + +```python +create_file(filename, url, pretty_response=True) +``` + +* `filename`: name of file to be created +* `url`: url to CSV file +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### delete_file + +```python +delete_file(filename, pretty_response=True) +``` + +* `filename`: name of the file to be deleted +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) diff --git a/docs/database-api.md b/docs/database-rest.md similarity index 95% rename from docs/database-api.md rename to docs/database-rest.md index 4c3be5d..b6890e8 100644 --- a/docs/database-api.md +++ b/docs/database-rest.md @@ -1,12 +1,12 @@ -# Database API +# Database API -The **Database API** microservice creates a level of abstraction through a REST API. +The **Database API** microservice creates a level of abstraction through a REST API. Using MongoDB, datasets are downloaded in CSV format and parsed into JSON format where the primary key for each document is the filename field contained in the JSON file POST request. ## GUI tool to handle database files -There are GUI tools to handle database files, like [NoSQLBooster](https://nosqlbooster.com) can interact with mongoDB used in database, and makes several tasks which are limited in `learning-orchestra-client` package, as schema visualization and files extraction and download to formats as CSV and JSON. +There are GUI tools to handle database files, like [NoSQLBooster](https://nosqlbooster.com) can interact with mongoDB used in database, and makes several tasks which are limited in `learning-orchestra-client` package, as schema visualization and files extraction and download to formats as CSV and JSON. You also can navigate in all inserted files in easy way and visualize each row from determined file, to use this tool connect with the url `cluster\_ip:27017` and use the credentials: @@ -91,7 +91,7 @@ Returns an array of metadata files from the database, where each file contains a * `F1` - F1 Score from model accuracy * `accuracy` - Accuracy from model prediction * `classificator` - Initials from used classificator -* `filename` - Name of the file +* `filename` - Name of the file * `fit_time` - Time taken for the model to be fit during training ## List file content @@ -112,7 +112,7 @@ The first row in the query is always the metadata file. `POST CLUSTER_IP:5000/files` Insert a CSV into the database using the POST method, JSON must be contained in the body of the HTTP request. -The following fields are required: +The following fields are required: ```json { diff --git a/docs/datatype-python.md b/docs/datatype-python.md new file mode 100644 index 0000000..182ffb0 --- /dev/null +++ b/docs/datatype-python.md @@ -0,0 +1,12 @@ +## Data type handler API + +### change_file_type + +```python +change_file_type(filename, fields_dict, pretty_response=True) +``` + +* `filename`: name of file +* `fields_dict`: dictionary with `field`:`number` or `field`:`string` keys +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) diff --git a/docs/datatype-api.md b/docs/datatype-rest.md similarity index 100% rename from docs/datatype-api.md rename to docs/datatype-rest.md diff --git a/docs/histogram-python.md b/docs/histogram-python.md new file mode 100644 index 0000000..1dc6e9d --- /dev/null +++ b/docs/histogram-python.md @@ -0,0 +1,15 @@ + +## Histogram API + +### create_histogram + +```python +create_histogram(filename, histogram_filename, fields, + pretty_response=True) +``` + +* `filename`: name of file to make histogram +* `histogram_filename`: name of file used to create histogram +* `fields`: list with fields to make histogram +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) diff --git a/docs/histogram-api.md b/docs/histogram-rest.md similarity index 100% rename from docs/histogram-api.md rename to docs/histogram-rest.md diff --git a/docs/index.md b/docs/index.md index a2ac8b8..825cef7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,18 +1,23 @@ -# learningOrchestra Docs +# learningOrchestra user documentation -**learningOrchestra** is a distributed processing tool that facilitates and streamlines iterative processes in a Data Science project pipeline like: +learningOrchestra aims to facilitate the development of complex data mining workflows by **seamlessly interfacing different data science tools and services**. From a single interoperable Application Programming Interface (API), users can **design their analytical pipelines and deploy them in an environment with the appropriate capabilities**. -* Data Gathering -* Data Cleaning -* Model Building -* Validating the Model -* Presenting the Results +learningOrchestra is designed for data scientists from both engineering and academia backgrounds, so that they can **focus on the discovery of new knowledge** in their data rather than library or maintenance issues. -With learningOrchestra, you can: +learningOrchestra is organised into interoperable microservices. They offer access to third-party libraries, frameworks and software to **gather data**, **clean data**, **train machine learning models**, **tune machine learning models**, **evaluate machine learning models** and **visualize data and results**. -* load a dataset from an URL (in CSV format). -* accomplish several pre-processing tasks with datasets. -* create highly customised model predictions against a specific dataset by providing their own pre-processing code. -* build prediction models with different classifiers simultaneously using a spark cluster transparently. +The current version of learningOrchestra offers 7 microservices: +- The **Database API is the central microservice**. It holds all the data, including the analysis results. +- The **Data type API is a preprocessing microservice** dedicated to changing the type of data fields. +- The **Projection, Histogram, t-SNE and PCA APIs are data exploration microservices**. They transform the map the data into new representation spaces so it can be visualized. They can be used on the raw data as well as on the intermediate and final results of the analysis pipeline. +- The **Model builder API is the main analysis microservice**. It includes some preprocessing features and machine learning features to train models, evaluate models and predict information using trained models. -And so much more! Check the [Usage](https://learningorchestra.github.io/docs/usage/) section for more. +The microservices can be called on from any computer, including one that is not part of the cluster learningOrchestra is deployed on. learningOrchestra provides two options to access its features: a **microservice REST API** and a **Python package**. + +Use this documentation to [learn more about the learningOrchestra project](about.md), [learn how to install and deploy learningOrchestra on a cluster](install.md), learn how to use the [REST APIs](rest-apis.md) and [Python package](python-package.md) to access learningOrchestra microservices, or [find options to get support](support.md) + +You can also visit the repositories of the learningOrchestra project: +- [learningOrchestra](https://github.com/learningOrchestra/learningOrchestra) for the definition of the microservices and the REST APIs, +- [learningOrchestra-python-client](https://github.com/learningOrchestra/learningOrchestra-python-client) for the Python package, +- [docs](https://github.com/learningOrchestra/docs) for the content of the present documentation, and +- [learningOrchestra.github.io](https://github.com/learningOrchestra/learningOrchestra.github.io) for the code of the present website. diff --git a/docs/install.md b/docs/install.md new file mode 100644 index 0000000..96a9559 --- /dev/null +++ b/docs/install.md @@ -0,0 +1,70 @@ +# Install and deploy learningOrchestra on a cluster + +:bell: This documentation assumes that the users are familiar with a number of advanced computer science concepts. We have tried to link to learning resources to support beginners, as well as introduce some of the concepts in the [last section](#concepts). But if something is still not clear, don't hesitate to [ask for help](support.md). + +## Setting up your cluster + +learningOrchestra operates from a [cluster](#what-is-a-cluster) of Docker [containers](#what-is-a-container). + +All your hosts must operate under Linux distributions and have [Docker Engine](https://docs.docker.com/engine/install/) installed. + +Configure your cluster in [swarm mode](https://docs.docker.com/engine/swarm/swarm-tutorial/create-swarm/). Install [Docker Compose](https://docs.docker.com/compose/install/) on your manager instance. + +You are ready to deploy! :tada: + +## Deploy learningOrchestra + +Clone the main learningOrchestra repository on your manager instance. +- Using HTTP protocol, `git clone https://github.com/learningOrchestra/learningOrchestra.git` +- Using SSH protocol, `git clone git@github.com:learningOrchestra/learningOrchestra.git` +- Using GitHub CLI, `gh repo clone learningOrchestra/learningOrchestra` + +Move to the root of the directory, `cd learningOrchestra`. + +Deploy with `sudo ./run.sh`. The deploy process should take a dozen minutes. + +### Interrupt learningOrchestra + +Run `docker stack rm microservice`. + +### Check cluster status + +To check the deployed microservices and machines of your cluster, run `CLUSTER_IP:80` where *CLUSTER_IP* is replaced by the external IP of a machine in your cluster. + +The same can be done to check Spark cluster state with `CLUSTER_IP:8080`. + +## Install-and-deploy questions + +###### My computer runs on Windows/OSX, can I still use learningOrchestra? + +You can use the microservices that run on a cluster where learningOrchestra is deployed, but **not deploy learningOrchestra**. + +###### I have a single computer, can I still use learningOrchestra? + +Theoretically, you can, if your machine has 12 Gb of RAM, a quad-core processor and 100 Gb of disk. However, your single machine won't be able to cope with the computing demanding for a real-life sized dataset. + +###### What happens if learningOrchestra is killed while using a microservice? + +If your cluster fails while a microservice is processing data, the task may be lost. Some fails might corrupt the database systems. + +If no processing was in progress when your cluster fails, the learningOrchestra will automatically re-deploy and reboot the affected microservices. + +###### What happens if my instances loose the connection to each other? + +If the connection between cluster instances is shutdown, learningOrchestra will try to re-deploy the microservices from the lost instances on the remaining active instances of the cluster. + +## Concepts + +###### What is a container? + +Containers are a software that package code and everything needed to run this code together, so that the code can be run simply in any environment. They also isolate the code from the rest of the machine. They are [often compared to shipping containers](https://www.ctl.io/developers/blog/post/docker-and-shipping-containers-a-useful-but-imperfect-analogy). + +###### What is a cluster? + +A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. (From [Wikipedia](https://en.wikipedia.org/wiki/Computer_cluster)) + +###### What are microservices? + +Microservices - also known as the microservice architecture - is an architectural style that structures an application as a collection of services that are: highly maintainable and testable, loosely coupled, independently deployable, organized around business capabilities, owned by small team. + +[An overview of microservice architecture](https://medium.com/hashmapinc/the-what-why-and-how-of-a-microservices-architecture-4179579423a9) diff --git a/docs/installation.md b/docs/installation.md deleted file mode 100644 index 635f851..0000000 --- a/docs/installation.md +++ /dev/null @@ -1,38 +0,0 @@ -# Installation - -## Requirements - -* Linux hosts -* [Docker Engine](https://docs.docker.com/engine/install/) must be installed in all instances of your cluster -* Cluster configured in swarm mode, check [creating a swarm](https://docs.docker.com/engine/swarm/swarm-tutorial/create-swarm/) -* [Docker Compose](https://docs.docker.com/compose/install/) must be installed in the manager instance of your cluster - -*Ensure that your cluster environment does not block any traffic such as firewall rules in your network or in your hosts.* - -*If in case, you have firewalls or other traffic-blockers, add learningOrchestra as an exception.* - -Ex: In Google Cloud Platform each of the VMs must allow both http and https traffic. - -## Deployment - -In the manager Docker swarm machine, clone the repo using: - -``` -git clone https://github.com/riibeirogabriel/learningOrchestra.git -``` - -Navigate into the `learningOrchestra` directory and run: - -``` -cd learningOrchestra -sudo ./run.sh -``` - -That's it! learningOrchestra has been deployed in your swarm cluster! - -## Cluster State - -`CLUSTER_IP:80` - To visualize cluster state (deployed microservices and cluster's machines). -`CLUSTER_IP:8080` - To visualize spark cluster state. - -*\** `CLUSTER_IP` *is the external IP of a machine in your cluster.* \ No newline at end of file diff --git a/docs/microservices.md b/docs/microservices.md new file mode 100644 index 0000000..47cef4e --- /dev/null +++ b/docs/microservices.md @@ -0,0 +1,249 @@ +# Description of the microservices + +learningOrchestra is organised into interoperable microservices. They offer access to third-party libraries, frameworks and software to **gather data**, **clean data**, **train machine learning models**, **tune machine learning models**, **evaluate machine learning models** and **visualize data and results**. + +The current version of learningOrchestra offers 7 microservices: +- The **Database API is the central microservice**. It holds all the data, including the analysis results. +- The **Data type API is a preprocessing microservice** dedicated to changing the type of data fields. +- The **Projection, Histogram, t-SNE and PCA APIs are data exploration microservices**. They transform the map the data into new representation spaces so it can be visualized. They can be used on the raw data as well as on the intermediate and final results of the analysis pipeline. +- The **Model builder API is the main analysis microservice**. It includes some preprocessing features and machine learning features to train models, evaluate models and predict information using trained models. + +The microservices can be called on from any computer, including one that is not part of the cluster learningOrchestra is deployed on. learningOrchestra provides two options to access its features: a **microservice REST API** and a **Python package**. + + + +- [Available microservices](#available-microservices) + - [Database microservice](#database-microservice) + - [Combine the Database microservice with a GUI](#combine-the-database-microservice-with-a-gui) + - [Data type microservice](#data-type-microservice) + - [Projection microservice](#projection-microservice) + - [Histogram microservice](#histogram-microservice) + - [t-SNE microservice](#t-sne-microservice) + - [PCA microservice](#pca-microservice) + - [Model builder microservice](#model-builder-microservice) +- [Additional information](#additional-information) + - [Spark Microservices](#spark-microservices) + + + +## Available microservices + +### Database microservice + +The Database microservice is an abstraction layer of a [MongoDB](https://www.mongodb.com/) database. MongoDB uses [NoSQL, aka non-relational, databases](https://en.wikipedia.org/wiki/NoSQL), so the data is stored as [JSON](https://www.json.org/json-en.html)-like documents. + +The Database microservice is organised so each database document corresponds to a CSV file. The key of a file is its filename. The file metadata is saved as its first row. + +The microservice provides entry points to add a CSV file to the database, delete a CSV file from the database, retrieve the content of a CSV file in the database and list all files in the database. + +The Database microservice serves as a central pivot for the others microservices. They all use the Database microservice as their data source. All but the t-SNE and the PCA microservices send their results to the Database microservice to save. + +For additional details, see the [REST API](database-rest.md) and [Python package](database-python.md) documentations. + +#### Combine the Database microservice with a GUI + +GUI database managers like [NoSQLBooster](https://nosqlbooster.com) can interact directly with MongoDB. Using one will let you perform additional tasks which are not implemented in the Database microservice, such as schema visualization, file extractionor direct CSV or JSON download. + +Using a GUI is fully compatible with using learningOrchestra Database microservice. + +You can connect a MongoDB-compatible GUI to your learningOrchestra database with the url `cluster_ip:27017`, where `cluster_ip` is the IP address to an instance of your cluster. You will need to provide the following credentials: +``` +username = root +password = owl45#21 +``` + +### Data type microservice + +The Data type microservice revolves around casting the data for a given field (= column for data organised as a table) to a new type. The microservice can cast fields into *strings* or into number types (*float* by default, *int* if appropriate). + + +For additional details, see the [REST API](datatype-rest.md) and [Python package](datatype-python.md) documentations. + +### Projection microservice + +The Projection microservice is a data manipulation microservice. It provides an entry point to simplify a dataset by selecting only certain fields (= column for data organised as a table). + +It runs as a Spark microservice and can be spread over [several instances](#spark-microservices). + +For additional details, see the [REST API](projection-rest.md) and [Python package](projection-python.md) documentations. + +### Histogram microservice + +The Histogram microservice transform the data of a given source into an aggregate with observation counts for each value bin. The aggregate data is saved into the database from the Database microservice and can then be used to generate an histogram representation of the source data. + +For additional details, see the [REST API](histogram-rest.md) and [Python package](histogram-python.md) documentations. + +### t-SNE microservice + +The t-SNE microservice transforms the data of a given source using the [T-distributed Stochastic Neighbor Embedding]((https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding)) algorithm, generates the t-SNE graphical representations, and manages the generated images. + +t-SNE is a machine learning algorithm for visualization of high-dimensional data. It relies on a non-linear dimensionality reduction technique to project high-dimensional data into a low-dimensional space (two or three dimensions). It models each high-dimensional object by a point in a low-dimensional space in such a way that similar objects are represented by nearby points with high probability, and conversely dissimilar objects are represented by distant points with high probability. + +The t-SNE microservice provides entry points to create and store a t-SNE graphical representation of a dataset, to list all the images previously stored by the microservice, to download one of these images, and to delete one of these images. It relies on a dedicated storage in the Spark cluster rather than the Database microservice. + +It runs as a Spark microservice and can be spread over [several instances](#spark-microservices). + +For additional details, see the [REST API](t-sne-rest.md) and [Python package](t-sne-python.md) documentations. + +### PCA microservice + +The PCA microservice decompose the data of a given source into a set of orthogonal components that explain a maximum amount of the variance, plots the data into the space defined by those components, and manages the generated images. + +The implementation of this microservice relies on the [scikit-learn libray](https://scikit-learn.org/stable/modules/decomposition.html#pca). + +The PCA microservice provides entry points to create and store a PCA graphical representation of a dataset, to list all the images previously stored by the microservice, to download one of these images, and to delete one of these images. It relies on a dedicated storage in the Spark cluster rather than the Database microservice. + +It runs as a Spark microservice and can be spread over [several instances](#spark-microservices). + +For additional details, see the [REST API](pca-rest.md) and [Python package](pca-python.md) documentations. + +### Model builder microservice + +The Model builder microservice is a all-in-one entry point to train, evaluate and apply classification models. It loads datasets from the Database microservice, preprocesses their content using a user-specified Python script, trains each of the specified classifiers on the training dataset, evaluates the accuracy of the trained model on an evaluation dataset, predicts the labels of the unlabelled testing dataset and saves the accuracy results and predicted labels. + +It runs as a Spark microservice and can be spread over [several instances](#spark-microservices). + +For additional details, see the [REST API](modelbuilder-rest.md) and [Python package](modelbuilder-python.md) documentations. + +#### Available classifiers + +The following classifiers are currently available through the Model builder microservice, in their Pyspark implementation: +* Logistic regression +* Decision tree classifier +* Random forest classifier +* Gradient-boosted tree classifier +* Naive Bayes + +#### Preprocessing script + +The preprocessing script must be written by the user in Python 3 and include the Pyspark library. + +:exclamation: The variable names currently used are not the names typically used in machine learning libraries. Please take care to read their descriptions to understand their actual role. + +The following environment instances are made available to it: +- `training_df`: a Spark Dataframe instance holding the training-and-evaluation dataset loaded from the Database microservice, +- `testing_df`: a Spark Dataframe instance holding the unlabelled dataset loaded from the Database microservice. + +The preprocessing script must rename the label column as "label" in the training-and-evaluation dataset and create a zero-value "label" column in the unlabelled dataset. + +The preprocessing script must instantiate the following variables using Pyspark VectorAssembler: +- `features_training`: Spark Dataframe instance with the preprocessed training dataset, **including** the "label" column, +* `features_evaluation`: Spark Dataframe instance with the preprocessed testing dataset to measure classification accuracy, **including** the "label" column, +* `features_testing`: Spark Dataframe instance with the unlabelled dataset on which to apply the model, **including** the zero-value "label" column. + +In case you don't want to evaluate the model, `features_evaluation` can be set to `None`. + +##### Example of preprocessing script + +This example uses the [titanic challengue datasets](https://www.kaggle.com/c/titanic/overview). + +```python +from pyspark.ml import Pipeline +from pyspark.sql.functions import ( + mean, col, split, + regexp_extract, when, lit) + +from pyspark.ml.feature import ( + VectorAssembler, + StringIndexer +) + +TRAINING_DF_INDEX = 0 +TESTING_DF_INDEX = 1 + +training_df = training_df.withColumnRenamed('Survived', 'label') +testing_df = testing_df.withColumn('label', lit(0)) +datasets_list = [training_df, testing_df] + +for index, dataset in enumerate(datasets_list): + dataset = dataset.withColumn( + "Initial", + regexp_extract(col("Name"), "([A-Za-z]+)\.", 1)) + datasets_list[index] = dataset + +misspelled_initials = [ + 'Mlle', 'Mme', 'Ms', 'Dr', + 'Major', 'Lady', 'Countess', + 'Jonkheer', 'Col', 'Rev', + 'Capt', 'Sir', 'Don' +] +correct_initials = [ + 'Miss', 'Miss', 'Miss', 'Mr', + 'Mr', 'Mrs', 'Mrs', + 'Other', 'Other', 'Other', + 'Mr', 'Mr', 'Mr' +] +for index, dataset in enumerate(datasets_list): + dataset = dataset.replace(misspelled_initials, correct_initials) + datasets_list[index] = dataset + + +initials_age = {"Miss": 22, + "Other": 46, + "Master": 5, + "Mr": 33, + "Mrs": 36} +for index, dataset in enumerate(datasets_list): + for initial, initial_age in initials_age.items(): + dataset = dataset.withColumn( + "Age", + when((dataset["Initial"] == initial) & + (dataset["Age"].isNull()), initial_age).otherwise( + dataset["Age"])) + datasets_list[index] = dataset + + +for index, dataset in enumerate(datasets_list): + dataset = dataset.na.fill({"Embarked": 'S'}) + datasets_list[index] = dataset + + +for index, dataset in enumerate(datasets_list): + dataset = dataset.withColumn("Family_Size", col('SibSp')+col('Parch')) + dataset = dataset.withColumn('Alone', lit(0)) + dataset = dataset.withColumn( + "Alone", + when(dataset["Family_Size"] == 0, 1).otherwise(dataset["Alone"])) + datasets_list[index] = dataset + + +text_fields = ["Sex", "Embarked", "Initial"] +for column in text_fields: + for index, dataset in enumerate(datasets_list): + dataset = StringIndexer( + inputCol=column, outputCol=column+"_index").\ + fit(dataset).\ + transform(dataset) + datasets_list[index] = dataset + + +non_required_columns = ["Name", "Embarked", "Sex", "Initial"] +for index, dataset in enumerate(datasets_list): + dataset = dataset.drop(*non_required_columns) + datasets_list[index] = dataset + + +training_df = datasets_list[TRAINING_DF_INDEX] +testing_df = datasets_list[TESTING_DF_INDEX] + +assembler = VectorAssembler( + inputCols=training_df.columns[:], + outputCol="features") +assembler.setHandleInvalid('skip') + +features_training = assembler.transform(training_df) +(features_training, features_evaluation) =\ + features_training.randomSplit([0.8, 0.2], seed=33) +features_testing = assembler.transform(testing_df) +``` + +## Additional information +### Spark Microservices + +The Projection, t-SNE, PCA and Model builder microservices uses the Spark microservice to work. + +By default, this microservice has only one instance. In case youyou require more computing power, you can scale this microservice by running the following command in the manager machine of the swarm cluster on which learningOrchestra is deployed: + +`docker service scale microservice_sparkworker=NUMBER_OF_INSTANCES` + +where `NUMBER_OF_INSTANCES` is the number of Spark microservice instances which you require, chosen according to your cluster resources and your computing power needs. diff --git a/docs/modelbuilder-python.md b/docs/modelbuilder-python.md new file mode 100644 index 0000000..024437b --- /dev/null +++ b/docs/modelbuilder-python.md @@ -0,0 +1,56 @@ + +## Model builder API + +### create_model + +```python +create_model(training_filename, test_filename, preprocessor_code, + model_classificator, pretty_response=True) +``` + +* `training_filename`: name of file to be used in training +* `test_filename`: name of file to be used in test +* `preprocessor_code`: Python3 code for pyspark preprocessing model +* `model_classificator`: list of initial classificators to be used in model +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +#### model_classificator + +* `lr`: LogisticRegression +* `dt`: DecisionTreeClassifier +* `rf`: RandomForestClassifier +* `gb`: Gradient-boosted tree classifier +* `nb`: NaiveBayes + +to send a request with LogisticRegression and NaiveBayes Classifiers: + +```python +create_model(training_filename, test_filename, preprocessor_code, ["lr", "nb"]) +``` + +#### preprocessor_code environment + +The Python 3 preprocessing code must use the environment instances as below: + +* `training_df` (Instantiated): Spark Dataframe instance training filename +* `testing_df` (Instantiated): Spark Dataframe instance testing filename + +The preprocessing code must instantiate the variables as below, all instances must be transformed by pyspark VectorAssembler: + +* `features_training` (Not Instantiated): Spark Dataframe instance for training the model +* `features_evaluation` (Not Instantiated): Spark Dataframe instance for evaluating trained model +* `features_testing` (Not Instantiated): Spark Dataframe instance for testing the model + +In case you don't want to evaluate the model, set `features_evaluation` as `None`. + +##### Handy methods + +```python +self.fields_from_dataframe(dataframe, is_string) +``` + +This method returns `string` or `number` fields as a `string` list from a DataFrame. + +* `dataframe`: DataFrame instance +* `is_string`: Boolean parameter(if `True`, the method returns the string DataFrame fields, otherwise, returns the numbers DataFrame fields) diff --git a/docs/modelbuilder-api.md b/docs/modelbuilder-rest.md similarity index 99% rename from docs/modelbuilder-api.md rename to docs/modelbuilder-rest.md index 3898e40..e4a97c1 100644 --- a/docs/modelbuilder-api.md +++ b/docs/modelbuilder-rest.md @@ -1,6 +1,6 @@ # Model Builder API -Model Builder microservice provides a REST API to create several model predictions using your own preprocessing code using a defined set of classifiers. +Model Builder microservice provides a REST API to create several model predictions using your own preprocessing code using a defined set of classifiers. ## Create prediction model diff --git a/docs/pca-python.md b/docs/pca-python.md new file mode 100644 index 0000000..be0fc63 --- /dev/null +++ b/docs/pca-python.md @@ -0,0 +1,44 @@ +## PCA API + +### create_image_plot + +```python +create_image_plot(tsne_filename, parent_filename, + label_name=None, pretty_response=True) +``` + +* `parent_filename`: name of file to make histogram +* `pca_filename`: filename used to create image plot +* `label_name`: label name to dataset with labeled tuples (default: `None`, to +datasets without labeled tuples) +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### read_image_plot_filenames + +```python +read_image_plot_filenames(pretty_response=True) +``` + +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### read_image_plot + +```python +read_image_plot(pca_filename, pretty_response=True) +``` + +* `pca_filename`: filename of a created image plot +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### delete_image_plot + +```python +delete_image_plot(pca_filename, pretty_response=True) +``` + +* `pca_filename`: filename of a created image plot +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) diff --git a/docs/pca-api.md b/docs/pca-rest.md similarity index 97% rename from docs/pca-api.md rename to docs/pca-rest.md index 7608ba3..bd2a26c 100644 --- a/docs/pca-api.md +++ b/docs/pca-rest.md @@ -1,10 +1,10 @@ # PCA API -PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. +PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. In `scikit-learn` (used in this microservice), PCA is implemented as a transformer object that learns components in its fit method, and can be used on new data to project it on these components. -PCA centers but does not scale the input data for each feature before applying the SVD. The optional parameter `whiten = True` makes it possible to project the data onto the singular space while scaling each component to unit variance. +PCA centers but does not scale the input data for each feature before applying the SVD. The optional parameter `whiten = True` makes it possible to project the data onto the singular space while scaling each component to unit variance. This is often useful if the models down-stream make strong assumptions on the isotropy of the signal: this is for example the case for Support Vector Machines with the RBF kernel and the K-Means clustering algorithm, more information about this algorithm in [scikit-learn PCA docs](https://scikit-learn.org/stable/modules/decomposition.html#pca). @@ -41,7 +41,7 @@ Deletes an image plot from the database. `GET CLUSTER_IP:5006/images` Returns a list with all created image plot filenames. - + ## Read an image plot `GET CLUSTER_IP:5006/images/` diff --git a/docs/projection-python.md b/docs/projection-python.md new file mode 100644 index 0000000..c7ed36f --- /dev/null +++ b/docs/projection-python.md @@ -0,0 +1,13 @@ +## Projection API + +### create_projection + +```python +create_projection(filename, projection_filename, fields, pretty_response=True) +``` + +* `filename`: name of the file to make projection +* `projection_filename`: name of file used to create projection +* `fields`: list with fields to make projection +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) diff --git a/docs/projection-api.md b/docs/projection-rest.md similarity index 100% rename from docs/projection-api.md rename to docs/projection-rest.md diff --git a/docs/python-apis.md b/docs/python-apis.md deleted file mode 100644 index 2500d21..0000000 --- a/docs/python-apis.md +++ /dev/null @@ -1,233 +0,0 @@ -# python-client APIs - -## Database API - -### read_resume_files - -```python -read_resume_files(pretty_response=True) -``` -* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`) -(default `True`, if `False`, return dict) - -### read_file - -```python -read_file(filename, skip=0, limit=10, query={}, pretty_response=True) -``` - -* `filename` : name of file -* `skip`: number of rows to skip in pagination(default: `0`) -* `limit`: number of rows to return in pagination(default: `10`) -(maximum is set at `20` rows per request) -* `query`: query to make in MongoDB(default: `empty query`) -* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`) - -### create_file - -```python -create_file(filename, url, pretty_response=True) -``` - -* `filename`: name of file to be created -* `url`: url to CSV file -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### delete_file - -```python -delete_file(filename, pretty_response=True) -``` - -* `filename`: name of the file to be deleted -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## Projection API - -### create_projection - -```python -create_projection(filename, projection_filename, fields, pretty_response=True) -``` - -* `filename`: name of the file to make projection -* `projection_filename`: name of file used to create projection -* `fields`: list with fields to make projection -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## Data type handler API - -### change_file_type - -```python -change_file_type(filename, fields_dict, pretty_response=True) -``` - -* `filename`: name of file -* `fields_dict`: dictionary with `field`:`number` or `field`:`string` keys -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## Histogram API - -### create_histogram - -```python -create_histogram(filename, histogram_filename, fields, - pretty_response=True) -``` - -* `filename`: name of file to make histogram -* `histogram_filename`: name of file used to create histogram -* `fields`: list with fields to make histogram -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## t-SNE API - -### create_image_plot - -```python -create_image_plot(tsne_filename, parent_filename, - label_name=None, pretty_response=True) -``` - -* `parent_filename`: name of file to make histogram -* `tsne_filename`: name of file used to create image plot -* `label_name`: label name to dataset with labeled tuples (default: `None`, to -datasets without labeled tuples) -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### read_image_plot_filenames - -```python -read_image_plot_filenames(pretty_response=True) -``` - -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### read_image_plot - -```python -read_image_plot(tsne_filename, pretty_response=True) -``` - -* tsne_filename: filename of a created image plot -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### delete_image_plot - -```python -delete_image_plot(tsne_filename, pretty_response=True) -``` - -* `tsne_filename`: filename of a created image plot -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## PCA API - -### create_image_plot - -```python -create_image_plot(tsne_filename, parent_filename, - label_name=None, pretty_response=True) -``` - -* `parent_filename`: name of file to make histogram -* `pca_filename`: filename used to create image plot -* `label_name`: label name to dataset with labeled tuples (default: `None`, to -datasets without labeled tuples) -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### read_image_plot_filenames - -```python -read_image_plot_filenames(pretty_response=True) -``` - -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### read_image_plot - -```python -read_image_plot(pca_filename, pretty_response=True) -``` - -* `pca_filename`: filename of a created image plot -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### delete_image_plot - -```python -delete_image_plot(pca_filename, pretty_response=True) -``` - -* `pca_filename`: filename of a created image plot -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## Model builder API - -### create_model - -```python -create_model(training_filename, test_filename, preprocessor_code, - model_classificator, pretty_response=True) -``` - -* `training_filename`: name of file to be used in training -* `test_filename`: name of file to be used in test -* `preprocessor_code`: Python3 code for pyspark preprocessing model -* `model_classificator`: list of initial classificators to be used in model -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -#### model_classificator - -* `lr`: LogisticRegression -* `dt`: DecisionTreeClassifier -* `rf`: RandomForestClassifier -* `gb`: Gradient-boosted tree classifier -* `nb`: NaiveBayes - -to send a request with LogisticRegression and NaiveBayes Classifiers: - -```python -create_model(training_filename, test_filename, preprocessor_code, ["lr", "nb"]) -``` - -#### preprocessor_code environment - -The Python 3 preprocessing code must use the environment instances as below: - -* `training_df` (Instantiated): Spark Dataframe instance training filename -* `testing_df` (Instantiated): Spark Dataframe instance testing filename - -The preprocessing code must instantiate the variables as below, all instances must be transformed by pyspark VectorAssembler: - -* `features_training` (Not Instantiated): Spark Dataframe instance for training the model -* `features_evaluation` (Not Instantiated): Spark Dataframe instance for evaluating trained model -* `features_testing` (Not Instantiated): Spark Dataframe instance for testing the model - -In case you don't want to evaluate the model, set `features_evaluation` as `None`. - -##### Handy methods - -```python -self.fields_from_dataframe(dataframe, is_string) -``` - -This method returns `string` or `number` fields as a `string` list from a DataFrame. - -* `dataframe`: DataFrame instance -* `is_string`: Boolean parameter(if `True`, the method returns the string DataFrame fields, otherwise, returns the numbers DataFrame fields) diff --git a/docs/python-package.md b/docs/python-package.md new file mode 100644 index 0000000..b64bff5 --- /dev/null +++ b/docs/python-package.md @@ -0,0 +1,17 @@ +# learningOrchestra Python package documentation + +**learning-orchestra-client** is a Python 3 package available through the Python Package Index. Install it with `pip install learning-orchestra-client`. + +All your scripts must import the package and create a link to the cluster by providing the IP address to an instance of your cluster. Preface your scripts with the following code: +``` +from learning_orchestra_client import * +cluster_ip = "xx.xx.xxx.xxx" +Context(cluster_ip) +``` + + +The current version of learningOrchestra offers 7 microservices, each corresponding to a Python class: +- The **[Database](#database-python.md) API is the central microservice**. It holds all the data, including the analysis results. +- The **[Data type](#datatype-python.md) API is a preprocessing microservice** dedicated to changing the type of data fields. +- The **[Projection](#projection-python.md), [Histogram](histogram-python.md), [t-SNE](t-sne-python.md) and [PCA](pca-python.md) APIs are data exploration microservices**. They transform the map the data into new representation spaces so it can be visualized. They can be used on the raw data as well as on the intermediate and final results of the analysis pipeline. +- The **[Model builder](modelbuilder-python.md) API is the main analysis microservice**. It includes some preprocessing features and machine learning features to train models, evaluate models and predict information using trained models. diff --git a/docs/rest-apis.md b/docs/rest-apis.md new file mode 100644 index 0000000..8adef29 --- /dev/null +++ b/docs/rest-apis.md @@ -0,0 +1,9 @@ +# learningOrchestra REST APIs documentation + +The current version of learningOrchestra offers 7 microservices: +- The **[Database](#database-rest.md) API is the central microservice**. It holds all the data, including the analysis results. +- The **[Data type](#datatype-rest.md) API is a preprocessing microservice** dedicated to changing the type of data fields. +- The **[Projection](#projection-rest.md), [Histogram](histogram-rest.md), [t-SNE](t-sne-rest.md) and [PCA](pca-rest.md) APIs are data exploration microservices**. They transform the map the data into new representation spaces so it can be visualized. They can be used on the raw data as well as on the intermediate and final results of the analysis pipeline. +- The **[Model builder](modelbuilder-rest.md) API is the main analysis microservice**. It includes some preprocessing features and machine learning features to train models, evaluate models and predict information using trained models. + +To access those microservices through their REST APIs, we recommand using a **GUI REST API** caller like [Postman](https://www.postman.com/product/api-client/) or [Insomnia](https://insomnia.rest/). Of course, regular `curl` commands from the terminal remain a possibility. diff --git a/docs/support.md b/docs/support.md new file mode 100644 index 0000000..ecdfe9e --- /dev/null +++ b/docs/support.md @@ -0,0 +1,10 @@ +# Get support as a user + +To get support as a learningOrchestra user, please fill an issue to the corresponding repository. + +When creating an issue for a bug or another undefined/unwanted behaviour, please remember to include the steps required to reproduce the behaviour in your issue. + +You have generic questions about learningOrchestra or you need help using the REST APIs: https://github.com/learningOrchestra/learningOrchestra/issues +You need help using the Python package: https://github.com/learningOrchestra/learningOrchestra-python-client/issues +You have noticed an issue with the content of the documentation: https://github.com/learningOrchestra/docs/issues +You have noticed an issue with this website, other than content issues: https://github.com/learningOrchestra/learningOrchestra.github.io/issues diff --git a/docs/t-sne-python.md b/docs/t-sne-python.md new file mode 100644 index 0000000..3053d5a --- /dev/null +++ b/docs/t-sne-python.md @@ -0,0 +1,45 @@ + +## t-SNE API + +### create_image_plot + +```python +create_image_plot(tsne_filename, parent_filename, + label_name=None, pretty_response=True) +``` + +* `parent_filename`: name of file to make histogram +* `tsne_filename`: name of file used to create image plot +* `label_name`: label name to dataset with labeled tuples (default: `None`, to +datasets without labeled tuples) +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### read_image_plot_filenames + +```python +read_image_plot_filenames(pretty_response=True) +``` + +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### read_image_plot + +```python +read_image_plot(tsne_filename, pretty_response=True) +``` + +* tsne_filename: filename of a created image plot +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### delete_image_plot + +```python +delete_image_plot(tsne_filename, pretty_response=True) +``` + +* `tsne_filename`: filename of a created image plot +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) diff --git a/docs/t-sne-api.md b/docs/t-sne-rest.md similarity index 95% rename from docs/t-sne-api.md rename to docs/t-sne-rest.md index b825285..975fac8 100644 --- a/docs/t-sne-api.md +++ b/docs/t-sne-rest.md @@ -1,8 +1,8 @@ # t-SNE API -The T-distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm for visualization. +The T-distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm for visualization. -It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. +It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability, more information about this algorithm in its [Wiki page]( https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). @@ -39,7 +39,7 @@ Deletes an image plot by specifying its file name. `GET CLUSTER_IP:5005/images` Returns a list with all created images plot file name. - + ## Read an image plot `GET CLUSTER_IP:5005/images/` diff --git a/mkdocs.yml b/mkdocs.yml index 5b60c9e..919a54f 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,18 +1,13 @@ site_name: learningOrchestra Docs nav: - Home: index.md - - Installation: installation.md - - Usage: usage.md - - API Guide: - - Microservices REST APIs: - - Database API: database-api.md - - Projection API: projection-api.md - - Data Type API: datatype-api.md - - Histogram API: histogram-api.md - - t-SNE API: t-sne-api.md - - PCA API: pca-api.md - - Model builder API: modelbuilder-api.md - - python-client APIs: python-apis.md + - About: about.md + - Install and deploy: install.md + - Use: + - Description of the microservices: microservices.md + - REST APIs: rest-apis.md + - Python package: python-package.md + - Get support: support.md theme: name: readthedocs features: