From ff7c14c05b6098278e740d4b91c4a08c1b0ab8de Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 16:33:54 +0200 Subject: [PATCH 01/27] Update repo path in README --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 6314f4b..7d821c5 100644 --- a/README.md +++ b/README.md @@ -4,9 +4,9 @@ To make changes clone the repo: -`git clone https://github.com/learningOrchestra/learningOrchestra-docs.git` +`git clone https://github.com/learningOrchestra/docs.git` -`cd learningOrchestra-docs` +`cd docs` Install mkdocs: From ba3fd572817cd2fe63d729ff9a7b81e970f7a779 Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 16:36:28 +0200 Subject: [PATCH 02/27] Update repo path in CONTRIBUTING --- CONTRIBUTING.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 26ba2c2..5d84c5c 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -14,7 +14,7 @@ When creating an issue for a bug or other undefined/unwanted behaviour remember ## Pull Request process -1. [Fork](https://github.com/riibeirogabriel/learningOrchestra/fork) the repository +1. [Fork](https://github.com/learningOrchestra/docs/fork) the repository 2. [Clone](https://git-scm.com/docs/git-clone) your fork to your local environment 3. Navigate into the project root directory 4. Create a new [branch](https://git-scm.com/book/en/v2/Git-Branching-Basic-Branching-and-Merging), the branch should be named according to what feature/fix you're implementing @@ -26,7 +26,7 @@ When creating an issue for a bug or other undefined/unwanted behaviour remember 7. Create a Pull Request Remember to describe what feature or fix you're implementing in the Pull Request window. -In the Pull Request window remember to include a quick summary of what the committed code does and how it is an improvement. +In the Pull Request window remember to include a quick summary of what the committed code does and how it is an improvement. After the Pull Request the repository owner will review your request.\ Be patient, if they require you to make changes to your request, do so. @@ -37,4 +37,4 @@ Don't be rude, use crude language or harass other users. ## License The repository is currently licenses under GNU General Public License v3.0. -By contributing to the project you agree that your contributions will be licensed under the same license and provisions. \ No newline at end of file +By contributing to the project you agree that your contributions will be licensed under the same license and provisions. From 28ded9714eb2d15f2e9c5285e03a007d4ccf72ee Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 16:43:48 +0200 Subject: [PATCH 03/27] Add short introduction to README --- README.md | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 7d821c5..6f0a72f 100644 --- a/README.md +++ b/README.md @@ -2,16 +2,14 @@ # learningOrchestra Docs -To make changes clone the repo: +This repository contains the files to generate the user documentation of [learningOrchestra](https://github.com/learningOrchestra/learningOrchestra). The content of the documentation is created manually and can be found in [the docs folder](https://github.com/learningOrchestra/docs/tree/main/docs). The documentation website is created with [MkDocs](https://www.mkdocs.org/). +To make changes please read the [contributing guide](https://github.com/learningOrchestra/docs/blob/main/CONTRIBUTING.md) then: +1. Clone the repo `git clone https://github.com/learningOrchestra/docs.git` - +2. Move inside the repo `cd docs` - -Install mkdocs: - +3. Install mkdocs `pip install mkdocs` - -Run: - +4. Run MkDocs `mkdocs serve` From 5d55c3dbedce69e33f91aaf56092b820c21d52e1 Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 17:02:53 +0200 Subject: [PATCH 04/27] Update website plan --- docs/about.md | 0 docs/index.md | 29 +++++++++++++++++------------ docs/install.md | 0 docs/python-package.md | 0 docs/rest-apis.md | 0 docs/support.md | 0 mkdocs.yml | 11 ++++++++--- 7 files changed, 25 insertions(+), 15 deletions(-) create mode 100644 docs/about.md create mode 100644 docs/install.md create mode 100644 docs/python-package.md create mode 100644 docs/rest-apis.md create mode 100644 docs/support.md diff --git a/docs/about.md b/docs/about.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/index.md b/docs/index.md index a2ac8b8..240cc70 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,18 +1,23 @@ # learningOrchestra Docs -**learningOrchestra** is a distributed processing tool that facilitates and streamlines iterative processes in a Data Science project pipeline like: +learningOrchestra aims to facilitate the development of complex data mining workflows by **seamlessly interfacing different data science tools and services**. From a single interoperable Application Programming Interface (API), users can **design their analytical pipelines and deploy them in an environment with the appropriate capabilities**. -* Data Gathering -* Data Cleaning -* Model Building -* Validating the Model -* Presenting the Results +learningOrchestra is designed for data scientists from both engineering and academia backgrounds, so that they can **focus on the discovery of new knowledge** in their data rather than library or maintenance issues. -With learningOrchestra, you can: +learningOrchestra is organised into interoperable microservices. They offer access to third-party libraries, frameworks and software to **gather data**, **clean data**, **train machine learning models**, **tune machine learning models**, **evaluate machine learning models** and **visualize data and results**. -* load a dataset from an URL (in CSV format). -* accomplish several pre-processing tasks with datasets. -* create highly customised model predictions against a specific dataset by providing their own pre-processing code. -* build prediction models with different classifiers simultaneously using a spark cluster transparently. +The current version of learningOrchestra offers 7 microservices: +- The **Database API is the central microservice**. It holds all the data, including the analysis results. +- The **Data type API is a preprocessing microservice** dedicated to changing the type of data fields. +- The **Projection, Histogram, t-SNE and PCA APIs are data exploration microservices**. They transform the map the data into new representation spaces so it can be visualized. They can be used on the raw data as well as on the intermediate and final results of the analysis pipeline. +- The **Model builder API is the main analysis microservice**. It includes some preprocessing features and machine learning features to train models, evaluate models and predict information using trained models. -And so much more! Check the [Usage](https://learningorchestra.github.io/docs/usage/) section for more. +The microservices can be called on from any computer, including one that is not part of the cluster learningOrchestra is deployed on. learningOrchestra provides two options to access its features: a **microservice REST API** and a **Python package**. + +Use this documentation to [learn more about the learningOrchestra project](https://learningorchestra.github.io/docs/about), [learn how to install and deploy learningOrchestra on a cluster](https://learningorchestra.github.io/docs/install), learn how to use the [REST APIs](https://learningorchestra.github.io/docs/rest-apis) and [Python package](https://learningorchestra.github.io/docs/python-package) to access learningOrchestra microservices, or [find options to get support](https://learningorchestra.github.io/docs/support) + +You can also visit the repositories of the learningOrchestra project: +- [learningOrchestra](https://github.com/learningOrchestra/learningOrchestra) for the definition of the microservices and the REST APIs, +- [learningOrchestra-python-client](https://github.com/learningOrchestra/learningOrchestra-python-client) for the Python package, +- [docs](https://github.com/learningOrchestra/docs) for the content of the present documentation, and +- [learningOrchestra.github.io](https://github.com/learningOrchestra/learningOrchestra.github.io) for the code of the present website. diff --git a/docs/install.md b/docs/install.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/python-package.md b/docs/python-package.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/rest-apis.md b/docs/rest-apis.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/support.md b/docs/support.md new file mode 100644 index 0000000..e69de29 diff --git a/mkdocs.yml b/mkdocs.yml index 5b60c9e..6bb99e6 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,10 +1,15 @@ site_name: learningOrchestra Docs nav: - Home: index.md - - Installation: installation.md + - About: about.md + - Installation: install.md + - Usage: + - REST APIs: rest-apis.md + - Python package: python-package.md + - Get support: support.md - Usage: usage.md - - API Guide: - - Microservices REST APIs: + - API Guide: + - Microservices REST APIs: - Database API: database-api.md - Projection API: projection-api.md - Data Type API: datatype-api.md From 97b2570b9172ac8cc27e7370e43f853a870c426b Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 17:10:30 +0200 Subject: [PATCH 05/27] About page inspired by main repo readme --- docs/about.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/about.md b/docs/about.md index e69de29..7f8a757 100644 --- a/docs/about.md +++ b/docs/about.md @@ -0,0 +1,11 @@ +Nowadays, **data science relies on a wide range of computer science skills**, from data management to algorithm design, from code optimization to cloud infrastructures. Data scientists are expected to have expertise in these diverse fields, especially when working in small teams or for academia. + +This situation can constitute a barrier to the actual extraction of new knowledge from collected data, +which is why the last two decades have seen more efforts to facilitate and streamline the development of +data mining workflows. The tools created can be sorted into two categories: **high-level** tools facilitate +the building of **automatic data processing pipelines** (e.g. [Weka](https://www.cs.waikato.ac.nz/ml/weka/)) +while **low-level** ones support the setup of appropriate physical and virtual infrastructure (e.g. [Spark](https://spark.apache.org/)). + +However, this landscape is still missing a tool that **encompasses all steps and needs of a typical data science project**. This is where learningOrchestra comes in. + +Read our [first research monograph](https://drive.google.com/file/d/1ZDrTR58pBuobpgwB_AOOFTlfmZEY6uQS/view) (under construction) to know more about the research behind the project. From af6a93021be8949ebc1729f0190a4d650c112e0b Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 17:21:09 +0200 Subject: [PATCH 06/27] Add support page redirecting to each repo --- docs/support.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/support.md b/docs/support.md index e69de29..bde0f4f 100644 --- a/docs/support.md +++ b/docs/support.md @@ -0,0 +1,8 @@ +To get support as a learningOrchestra user, please fill an issue to the corresponding repository. + +When creating an issue for a bug or another undefined/unwanted behaviour, please remember to include the steps required to reproduce the behaviour in your issue. + +You have generic questions about learningOrchestra or you need help using the REST APIs: https://github.com/learningOrchestra/learningOrchestra/issues +You need help using the Python package: https://github.com/learningOrchestra/learningOrchestra-python-client/issues +You have noticed an issue with the content of the documentation: https://github.com/learningOrchestra/docs/issues +You have noticed an issue with this website, other than content issues: https://github.com/learningOrchestra/learningOrchestra.github.io/issues From debd6add649dbd26ef75d2106bf69bbf3054e0c7 Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 17:27:09 +0200 Subject: [PATCH 07/27] Change syntax of internal links --- docs/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/index.md b/docs/index.md index 240cc70..4ac65bd 100644 --- a/docs/index.md +++ b/docs/index.md @@ -14,7 +14,7 @@ The current version of learningOrchestra offers 7 microservices: The microservices can be called on from any computer, including one that is not part of the cluster learningOrchestra is deployed on. learningOrchestra provides two options to access its features: a **microservice REST API** and a **Python package**. -Use this documentation to [learn more about the learningOrchestra project](https://learningorchestra.github.io/docs/about), [learn how to install and deploy learningOrchestra on a cluster](https://learningorchestra.github.io/docs/install), learn how to use the [REST APIs](https://learningorchestra.github.io/docs/rest-apis) and [Python package](https://learningorchestra.github.io/docs/python-package) to access learningOrchestra microservices, or [find options to get support](https://learningorchestra.github.io/docs/support) +Use this documentation to [learn more about the learningOrchestra project](about.md), [learn how to install and deploy learningOrchestra on a cluster](install.md), learn how to use the [REST APIs](rest-apis.md) and [Python package](python-package.md) to access learningOrchestra microservices, or [find options to get support](support.md) You can also visit the repositories of the learningOrchestra project: - [learningOrchestra](https://github.com/learningOrchestra/learningOrchestra) for the definition of the microservices and the REST APIs, From 3ae59fd7fafbcc13674dee7a099938ddf68c7ca4 Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 17:30:24 +0200 Subject: [PATCH 08/27] Update install page --- docs/install.md | 70 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) diff --git a/docs/install.md b/docs/install.md index e69de29..96a9559 100644 --- a/docs/install.md +++ b/docs/install.md @@ -0,0 +1,70 @@ +# Install and deploy learningOrchestra on a cluster + +:bell: This documentation assumes that the users are familiar with a number of advanced computer science concepts. We have tried to link to learning resources to support beginners, as well as introduce some of the concepts in the [last section](#concepts). But if something is still not clear, don't hesitate to [ask for help](support.md). + +## Setting up your cluster + +learningOrchestra operates from a [cluster](#what-is-a-cluster) of Docker [containers](#what-is-a-container). + +All your hosts must operate under Linux distributions and have [Docker Engine](https://docs.docker.com/engine/install/) installed. + +Configure your cluster in [swarm mode](https://docs.docker.com/engine/swarm/swarm-tutorial/create-swarm/). Install [Docker Compose](https://docs.docker.com/compose/install/) on your manager instance. + +You are ready to deploy! :tada: + +## Deploy learningOrchestra + +Clone the main learningOrchestra repository on your manager instance. +- Using HTTP protocol, `git clone https://github.com/learningOrchestra/learningOrchestra.git` +- Using SSH protocol, `git clone git@github.com:learningOrchestra/learningOrchestra.git` +- Using GitHub CLI, `gh repo clone learningOrchestra/learningOrchestra` + +Move to the root of the directory, `cd learningOrchestra`. + +Deploy with `sudo ./run.sh`. The deploy process should take a dozen minutes. + +### Interrupt learningOrchestra + +Run `docker stack rm microservice`. + +### Check cluster status + +To check the deployed microservices and machines of your cluster, run `CLUSTER_IP:80` where *CLUSTER_IP* is replaced by the external IP of a machine in your cluster. + +The same can be done to check Spark cluster state with `CLUSTER_IP:8080`. + +## Install-and-deploy questions + +###### My computer runs on Windows/OSX, can I still use learningOrchestra? + +You can use the microservices that run on a cluster where learningOrchestra is deployed, but **not deploy learningOrchestra**. + +###### I have a single computer, can I still use learningOrchestra? + +Theoretically, you can, if your machine has 12 Gb of RAM, a quad-core processor and 100 Gb of disk. However, your single machine won't be able to cope with the computing demanding for a real-life sized dataset. + +###### What happens if learningOrchestra is killed while using a microservice? + +If your cluster fails while a microservice is processing data, the task may be lost. Some fails might corrupt the database systems. + +If no processing was in progress when your cluster fails, the learningOrchestra will automatically re-deploy and reboot the affected microservices. + +###### What happens if my instances loose the connection to each other? + +If the connection between cluster instances is shutdown, learningOrchestra will try to re-deploy the microservices from the lost instances on the remaining active instances of the cluster. + +## Concepts + +###### What is a container? + +Containers are a software that package code and everything needed to run this code together, so that the code can be run simply in any environment. They also isolate the code from the rest of the machine. They are [often compared to shipping containers](https://www.ctl.io/developers/blog/post/docker-and-shipping-containers-a-useful-but-imperfect-analogy). + +###### What is a cluster? + +A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. (From [Wikipedia](https://en.wikipedia.org/wiki/Computer_cluster)) + +###### What are microservices? + +Microservices - also known as the microservice architecture - is an architectural style that structures an application as a collection of services that are: highly maintainable and testable, loosely coupled, independently deployable, organized around business capabilities, owned by small team. + +[An overview of microservice architecture](https://medium.com/hashmapinc/the-what-why-and-how-of-a-microservices-architecture-4179579423a9) From 93183423888c8f4ff55808c430acb95e43fa0659 Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 17:42:34 +0200 Subject: [PATCH 09/27] Update structure to add pages to define each microservice --- docs/database-rest.md | 0 docs/{database-api.md => database.md} | 0 docs/datatype-rest.md | 0 docs/{datatype-api.md => datatype.md} | 0 docs/histogram-rest.md | 0 docs/{histogram-api.md => histogram.md} | 0 docs/modelbuilder-rest.md | 0 docs/{modelbuilder-api.md => modelbuilder.md} | 0 docs/pca-rest.md | 0 docs/{pca-api.md => pca.md} | 0 docs/projection-rest.md | 0 docs/{projection-api.md => projection.md} | 0 docs/t-sne-rest.md | 0 docs/{t-sne-api.md => t-sne.md} | 0 mkdocs.yml | 12 ++++++++++-- 15 files changed, 10 insertions(+), 2 deletions(-) create mode 100644 docs/database-rest.md rename docs/{database-api.md => database.md} (100%) create mode 100644 docs/datatype-rest.md rename docs/{datatype-api.md => datatype.md} (100%) create mode 100644 docs/histogram-rest.md rename docs/{histogram-api.md => histogram.md} (100%) create mode 100644 docs/modelbuilder-rest.md rename docs/{modelbuilder-api.md => modelbuilder.md} (100%) create mode 100644 docs/pca-rest.md rename docs/{pca-api.md => pca.md} (100%) create mode 100644 docs/projection-rest.md rename docs/{projection-api.md => projection.md} (100%) create mode 100644 docs/t-sne-rest.md rename docs/{t-sne-api.md => t-sne.md} (100%) diff --git a/docs/database-rest.md b/docs/database-rest.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/database-api.md b/docs/database.md similarity index 100% rename from docs/database-api.md rename to docs/database.md diff --git a/docs/datatype-rest.md b/docs/datatype-rest.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/datatype-api.md b/docs/datatype.md similarity index 100% rename from docs/datatype-api.md rename to docs/datatype.md diff --git a/docs/histogram-rest.md b/docs/histogram-rest.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/histogram-api.md b/docs/histogram.md similarity index 100% rename from docs/histogram-api.md rename to docs/histogram.md diff --git a/docs/modelbuilder-rest.md b/docs/modelbuilder-rest.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/modelbuilder-api.md b/docs/modelbuilder.md similarity index 100% rename from docs/modelbuilder-api.md rename to docs/modelbuilder.md diff --git a/docs/pca-rest.md b/docs/pca-rest.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/pca-api.md b/docs/pca.md similarity index 100% rename from docs/pca-api.md rename to docs/pca.md diff --git a/docs/projection-rest.md b/docs/projection-rest.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/projection-api.md b/docs/projection.md similarity index 100% rename from docs/projection-api.md rename to docs/projection.md diff --git a/docs/t-sne-rest.md b/docs/t-sne-rest.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/t-sne-api.md b/docs/t-sne.md similarity index 100% rename from docs/t-sne-api.md rename to docs/t-sne.md diff --git a/mkdocs.yml b/mkdocs.yml index 6bb99e6..11f7edf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -2,8 +2,16 @@ site_name: learningOrchestra Docs nav: - Home: index.md - About: about.md - - Installation: install.md - - Usage: + - Install and deploy: install.md + - Use: + - Description of the microservices: + - Database microservice: database.md + - Projection microservice: projection.md + - Data Type microservice: datatype.md + - Histogram microservice: histogram.md + - t-SNE microservice: t-sne.md + - PCA microservice: pca.md + - Model builder microservice: modelbuilder.md - REST APIs: rest-apis.md - Python package: python-package.md - Get support: support.md From 28acfd43a1ab5549c0f254976b728f91e6c5e03d Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 17:45:59 +0200 Subject: [PATCH 10/27] Add and update page titles --- docs/about.md | 2 ++ docs/index.md | 2 +- docs/rest-apis.md | 9 +++++++++ docs/support.md | 2 ++ 4 files changed, 14 insertions(+), 1 deletion(-) diff --git a/docs/about.md b/docs/about.md index 7f8a757..0429868 100644 --- a/docs/about.md +++ b/docs/about.md @@ -1,3 +1,5 @@ +# About learningOrchestra + Nowadays, **data science relies on a wide range of computer science skills**, from data management to algorithm design, from code optimization to cloud infrastructures. Data scientists are expected to have expertise in these diverse fields, especially when working in small teams or for academia. This situation can constitute a barrier to the actual extraction of new knowledge from collected data, diff --git a/docs/index.md b/docs/index.md index 4ac65bd..825cef7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,4 +1,4 @@ -# learningOrchestra Docs +# learningOrchestra user documentation learningOrchestra aims to facilitate the development of complex data mining workflows by **seamlessly interfacing different data science tools and services**. From a single interoperable Application Programming Interface (API), users can **design their analytical pipelines and deploy them in an environment with the appropriate capabilities**. diff --git a/docs/rest-apis.md b/docs/rest-apis.md index e69de29..8adef29 100644 --- a/docs/rest-apis.md +++ b/docs/rest-apis.md @@ -0,0 +1,9 @@ +# learningOrchestra REST APIs documentation + +The current version of learningOrchestra offers 7 microservices: +- The **[Database](#database-rest.md) API is the central microservice**. It holds all the data, including the analysis results. +- The **[Data type](#datatype-rest.md) API is a preprocessing microservice** dedicated to changing the type of data fields. +- The **[Projection](#projection-rest.md), [Histogram](histogram-rest.md), [t-SNE](t-sne-rest.md) and [PCA](pca-rest.md) APIs are data exploration microservices**. They transform the map the data into new representation spaces so it can be visualized. They can be used on the raw data as well as on the intermediate and final results of the analysis pipeline. +- The **[Model builder](modelbuilder-rest.md) API is the main analysis microservice**. It includes some preprocessing features and machine learning features to train models, evaluate models and predict information using trained models. + +To access those microservices through their REST APIs, we recommand using a **GUI REST API** caller like [Postman](https://www.postman.com/product/api-client/) or [Insomnia](https://insomnia.rest/). Of course, regular `curl` commands from the terminal remain a possibility. diff --git a/docs/support.md b/docs/support.md index bde0f4f..ecdfe9e 100644 --- a/docs/support.md +++ b/docs/support.md @@ -1,3 +1,5 @@ +# Get support as a user + To get support as a learningOrchestra user, please fill an issue to the corresponding repository. When creating an issue for a bug or another undefined/unwanted behaviour, please remember to include the steps required to reproduce the behaviour in your issue. From b2f67bd2a7f5f2d9c8da844d08b7840f91daa1a2 Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 18:08:41 +0200 Subject: [PATCH 11/27] Moving python package info, no update of content --- docs/python-apis.md | 233 --------------------------------------- docs/python-package.md | 242 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 242 insertions(+), 233 deletions(-) delete mode 100644 docs/python-apis.md diff --git a/docs/python-apis.md b/docs/python-apis.md deleted file mode 100644 index 2500d21..0000000 --- a/docs/python-apis.md +++ /dev/null @@ -1,233 +0,0 @@ -# python-client APIs - -## Database API - -### read_resume_files - -```python -read_resume_files(pretty_response=True) -``` -* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`) -(default `True`, if `False`, return dict) - -### read_file - -```python -read_file(filename, skip=0, limit=10, query={}, pretty_response=True) -``` - -* `filename` : name of file -* `skip`: number of rows to skip in pagination(default: `0`) -* `limit`: number of rows to return in pagination(default: `10`) -(maximum is set at `20` rows per request) -* `query`: query to make in MongoDB(default: `empty query`) -* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`) - -### create_file - -```python -create_file(filename, url, pretty_response=True) -``` - -* `filename`: name of file to be created -* `url`: url to CSV file -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### delete_file - -```python -delete_file(filename, pretty_response=True) -``` - -* `filename`: name of the file to be deleted -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## Projection API - -### create_projection - -```python -create_projection(filename, projection_filename, fields, pretty_response=True) -``` - -* `filename`: name of the file to make projection -* `projection_filename`: name of file used to create projection -* `fields`: list with fields to make projection -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## Data type handler API - -### change_file_type - -```python -change_file_type(filename, fields_dict, pretty_response=True) -``` - -* `filename`: name of file -* `fields_dict`: dictionary with `field`:`number` or `field`:`string` keys -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## Histogram API - -### create_histogram - -```python -create_histogram(filename, histogram_filename, fields, - pretty_response=True) -``` - -* `filename`: name of file to make histogram -* `histogram_filename`: name of file used to create histogram -* `fields`: list with fields to make histogram -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## t-SNE API - -### create_image_plot - -```python -create_image_plot(tsne_filename, parent_filename, - label_name=None, pretty_response=True) -``` - -* `parent_filename`: name of file to make histogram -* `tsne_filename`: name of file used to create image plot -* `label_name`: label name to dataset with labeled tuples (default: `None`, to -datasets without labeled tuples) -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### read_image_plot_filenames - -```python -read_image_plot_filenames(pretty_response=True) -``` - -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### read_image_plot - -```python -read_image_plot(tsne_filename, pretty_response=True) -``` - -* tsne_filename: filename of a created image plot -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### delete_image_plot - -```python -delete_image_plot(tsne_filename, pretty_response=True) -``` - -* `tsne_filename`: filename of a created image plot -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## PCA API - -### create_image_plot - -```python -create_image_plot(tsne_filename, parent_filename, - label_name=None, pretty_response=True) -``` - -* `parent_filename`: name of file to make histogram -* `pca_filename`: filename used to create image plot -* `label_name`: label name to dataset with labeled tuples (default: `None`, to -datasets without labeled tuples) -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### read_image_plot_filenames - -```python -read_image_plot_filenames(pretty_response=True) -``` - -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### read_image_plot - -```python -read_image_plot(pca_filename, pretty_response=True) -``` - -* `pca_filename`: filename of a created image plot -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### delete_image_plot - -```python -delete_image_plot(pca_filename, pretty_response=True) -``` - -* `pca_filename`: filename of a created image plot -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## Model builder API - -### create_model - -```python -create_model(training_filename, test_filename, preprocessor_code, - model_classificator, pretty_response=True) -``` - -* `training_filename`: name of file to be used in training -* `test_filename`: name of file to be used in test -* `preprocessor_code`: Python3 code for pyspark preprocessing model -* `model_classificator`: list of initial classificators to be used in model -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -#### model_classificator - -* `lr`: LogisticRegression -* `dt`: DecisionTreeClassifier -* `rf`: RandomForestClassifier -* `gb`: Gradient-boosted tree classifier -* `nb`: NaiveBayes - -to send a request with LogisticRegression and NaiveBayes Classifiers: - -```python -create_model(training_filename, test_filename, preprocessor_code, ["lr", "nb"]) -``` - -#### preprocessor_code environment - -The Python 3 preprocessing code must use the environment instances as below: - -* `training_df` (Instantiated): Spark Dataframe instance training filename -* `testing_df` (Instantiated): Spark Dataframe instance testing filename - -The preprocessing code must instantiate the variables as below, all instances must be transformed by pyspark VectorAssembler: - -* `features_training` (Not Instantiated): Spark Dataframe instance for training the model -* `features_evaluation` (Not Instantiated): Spark Dataframe instance for evaluating trained model -* `features_testing` (Not Instantiated): Spark Dataframe instance for testing the model - -In case you don't want to evaluate the model, set `features_evaluation` as `None`. - -##### Handy methods - -```python -self.fields_from_dataframe(dataframe, is_string) -``` - -This method returns `string` or `number` fields as a `string` list from a DataFrame. - -* `dataframe`: DataFrame instance -* `is_string`: Boolean parameter(if `True`, the method returns the string DataFrame fields, otherwise, returns the numbers DataFrame fields) diff --git a/docs/python-package.md b/docs/python-package.md index e69de29..5c586e4 100644 --- a/docs/python-package.md +++ b/docs/python-package.md @@ -0,0 +1,242 @@ +# learningOrchestra Python package documentation + +**learning-orchestra-client** is a Python 3 package available through the Python Package Index. Install it with `pip install learning-orchestra-client`. + +All your scripts must import the package and create a link to the cluster by providing the IP address to an instance of your cluster. Preface your scripts with the following code: +``` +from learning_orchestra_client import * +cluster_ip = "xx.xx.xxx.xxx" +Context(cluster_ip) +``` + +## Database API + +### read_resume_files + +```python +read_resume_files(pretty_response=True) +``` +* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`) +(default `True`, if `False`, return dict) + +### read_file + +```python +read_file(filename, skip=0, limit=10, query={}, pretty_response=True) +``` + +* `filename` : name of file +* `skip`: number of rows to skip in pagination(default: `0`) +* `limit`: number of rows to return in pagination(default: `10`) +(maximum is set at `20` rows per request) +* `query`: query to make in MongoDB(default: `empty query`) +* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`) + +### create_file + +```python +create_file(filename, url, pretty_response=True) +``` + +* `filename`: name of file to be created +* `url`: url to CSV file +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### delete_file + +```python +delete_file(filename, pretty_response=True) +``` + +* `filename`: name of the file to be deleted +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +## Projection API + +### create_projection + +```python +create_projection(filename, projection_filename, fields, pretty_response=True) +``` + +* `filename`: name of the file to make projection +* `projection_filename`: name of file used to create projection +* `fields`: list with fields to make projection +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +## Data type handler API + +### change_file_type + +```python +change_file_type(filename, fields_dict, pretty_response=True) +``` + +* `filename`: name of file +* `fields_dict`: dictionary with `field`:`number` or `field`:`string` keys +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +## Histogram API + +### create_histogram + +```python +create_histogram(filename, histogram_filename, fields, + pretty_response=True) +``` + +* `filename`: name of file to make histogram +* `histogram_filename`: name of file used to create histogram +* `fields`: list with fields to make histogram +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +## t-SNE API + +### create_image_plot + +```python +create_image_plot(tsne_filename, parent_filename, + label_name=None, pretty_response=True) +``` + +* `parent_filename`: name of file to make histogram +* `tsne_filename`: name of file used to create image plot +* `label_name`: label name to dataset with labeled tuples (default: `None`, to +datasets without labeled tuples) +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### read_image_plot_filenames + +```python +read_image_plot_filenames(pretty_response=True) +``` + +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### read_image_plot + +```python +read_image_plot(tsne_filename, pretty_response=True) +``` + +* tsne_filename: filename of a created image plot +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### delete_image_plot + +```python +delete_image_plot(tsne_filename, pretty_response=True) +``` + +* `tsne_filename`: filename of a created image plot +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +## PCA API + +### create_image_plot + +```python +create_image_plot(tsne_filename, parent_filename, + label_name=None, pretty_response=True) +``` + +* `parent_filename`: name of file to make histogram +* `pca_filename`: filename used to create image plot +* `label_name`: label name to dataset with labeled tuples (default: `None`, to +datasets without labeled tuples) +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### read_image_plot_filenames + +```python +read_image_plot_filenames(pretty_response=True) +``` + +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### read_image_plot + +```python +read_image_plot(pca_filename, pretty_response=True) +``` + +* `pca_filename`: filename of a created image plot +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### delete_image_plot + +```python +delete_image_plot(pca_filename, pretty_response=True) +``` + +* `pca_filename`: filename of a created image plot +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +## Model builder API + +### create_model + +```python +create_model(training_filename, test_filename, preprocessor_code, + model_classificator, pretty_response=True) +``` + +* `training_filename`: name of file to be used in training +* `test_filename`: name of file to be used in test +* `preprocessor_code`: Python3 code for pyspark preprocessing model +* `model_classificator`: list of initial classificators to be used in model +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +#### model_classificator + +* `lr`: LogisticRegression +* `dt`: DecisionTreeClassifier +* `rf`: RandomForestClassifier +* `gb`: Gradient-boosted tree classifier +* `nb`: NaiveBayes + +to send a request with LogisticRegression and NaiveBayes Classifiers: + +```python +create_model(training_filename, test_filename, preprocessor_code, ["lr", "nb"]) +``` + +#### preprocessor_code environment + +The Python 3 preprocessing code must use the environment instances as below: + +* `training_df` (Instantiated): Spark Dataframe instance training filename +* `testing_df` (Instantiated): Spark Dataframe instance testing filename + +The preprocessing code must instantiate the variables as below, all instances must be transformed by pyspark VectorAssembler: + +* `features_training` (Not Instantiated): Spark Dataframe instance for training the model +* `features_evaluation` (Not Instantiated): Spark Dataframe instance for evaluating trained model +* `features_testing` (Not Instantiated): Spark Dataframe instance for testing the model + +In case you don't want to evaluate the model, set `features_evaluation` as `None`. + +##### Handy methods + +```python +self.fields_from_dataframe(dataframe, is_string) +``` + +This method returns `string` or `number` fields as a `string` list from a DataFrame. + +* `dataframe`: DataFrame instance +* `is_string`: Boolean parameter(if `True`, the method returns the string DataFrame fields, otherwise, returns the numbers DataFrame fields) From 074ea96956001ecce0a17fd7fa19d487cdade4bb Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 18:09:13 +0200 Subject: [PATCH 12/27] Removing outdated structure --- mkdocs.yml | 11 ----------- 1 file changed, 11 deletions(-) diff --git a/mkdocs.yml b/mkdocs.yml index 11f7edf..2b3a1d5 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -15,17 +15,6 @@ nav: - REST APIs: rest-apis.md - Python package: python-package.md - Get support: support.md - - Usage: usage.md - - API Guide: - - Microservices REST APIs: - - Database API: database-api.md - - Projection API: projection-api.md - - Data Type API: datatype-api.md - - Histogram API: histogram-api.md - - t-SNE API: t-sne-api.md - - PCA API: pca-api.md - - Model builder API: modelbuilder-api.md - - python-client APIs: python-apis.md theme: name: readthedocs features: From 0e9178e854ce1cbe8ac31abbf9e83a30f86e0eeb Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 18:16:00 +0200 Subject: [PATCH 13/27] Change structure again --- mkdocs.yml | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/mkdocs.yml b/mkdocs.yml index 2b3a1d5..919a54f 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -4,14 +4,7 @@ nav: - About: about.md - Install and deploy: install.md - Use: - - Description of the microservices: - - Database microservice: database.md - - Projection microservice: projection.md - - Data Type microservice: datatype.md - - Histogram microservice: histogram.md - - t-SNE microservice: t-sne.md - - PCA microservice: pca.md - - Model builder microservice: modelbuilder.md + - Description of the microservices: microservices.md - REST APIs: rest-apis.md - Python package: python-package.md - Get support: support.md From 9e198f4d67fc2bc5af8429fa8f31fac81e90c16f Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 18:16:42 +0200 Subject: [PATCH 14/27] Cleaning --- docs/installation.md | 38 -------------------------------------- 1 file changed, 38 deletions(-) delete mode 100644 docs/installation.md diff --git a/docs/installation.md b/docs/installation.md deleted file mode 100644 index 635f851..0000000 --- a/docs/installation.md +++ /dev/null @@ -1,38 +0,0 @@ -# Installation - -## Requirements - -* Linux hosts -* [Docker Engine](https://docs.docker.com/engine/install/) must be installed in all instances of your cluster -* Cluster configured in swarm mode, check [creating a swarm](https://docs.docker.com/engine/swarm/swarm-tutorial/create-swarm/) -* [Docker Compose](https://docs.docker.com/compose/install/) must be installed in the manager instance of your cluster - -*Ensure that your cluster environment does not block any traffic such as firewall rules in your network or in your hosts.* - -*If in case, you have firewalls or other traffic-blockers, add learningOrchestra as an exception.* - -Ex: In Google Cloud Platform each of the VMs must allow both http and https traffic. - -## Deployment - -In the manager Docker swarm machine, clone the repo using: - -``` -git clone https://github.com/riibeirogabriel/learningOrchestra.git -``` - -Navigate into the `learningOrchestra` directory and run: - -``` -cd learningOrchestra -sudo ./run.sh -``` - -That's it! learningOrchestra has been deployed in your swarm cluster! - -## Cluster State - -`CLUSTER_IP:80` - To visualize cluster state (deployed microservices and cluster's machines). -`CLUSTER_IP:8080` - To visualize spark cluster state. - -*\** `CLUSTER_IP` *is the external IP of a machine in your cluster.* \ No newline at end of file From cbce3155c2522c8ab22ccf7a4d2d4916aab9d2cb Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 18:17:30 +0200 Subject: [PATCH 15/27] Add page for microservice description --- docs/microservices.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 docs/microservices.md diff --git a/docs/microservices.md b/docs/microservices.md new file mode 100644 index 0000000..e69de29 From d8592a5d15599fc3f035e9ca6a7df8d1a85c4da7 Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 18:23:43 +0200 Subject: [PATCH 16/27] First draft of microservices description --- docs/microservices.md | 68 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) diff --git a/docs/microservices.md b/docs/microservices.md index e69de29..3035219 100644 --- a/docs/microservices.md +++ b/docs/microservices.md @@ -0,0 +1,68 @@ +# Description of the microservices + +learningOrchestra is organised into interoperable microservices. They offer access to third-party libraries, frameworks and software to **gather data**, **clean data**, **train machine learning models**, **tune machine learning models**, **evaluate machine learning models** and **visualize data and results**. + +The current version of learningOrchestra offers 7 microservices: +- The **Database API is the central microservice**. It holds all the data, including the analysis results. +- The **Data type API is a preprocessing microservice** dedicated to changing the type of data fields. +- The **Projection, Histogram, t-SNE and PCA APIs are data exploration microservices**. They transform the map the data into new representation spaces so it can be visualized. They can be used on the raw data as well as on the intermediate and final results of the analysis pipeline. +- The **Model builder API is the main analysis microservice**. It includes some preprocessing features and machine learning features to train models, evaluate models and predict information using trained models. + +The microservices can be called on from any computer, including one that is not part of the cluster learningOrchestra is deployed on. learningOrchestra provides two options to access its features: a **microservice REST API** and a **Python package**. + +## Available microservices + +### Database microservice + +Download and handle datasets in a database. +For additional details, see the [REST API](database-rest.md) and [Python package](python-package.md) documentations. + +### Data type microservice + +Change dataset fields type between number and text. +For additional details, see the [REST API](datatype-rest.md) and [Python package](python-package.md) documentations. + +### Projection microservice + +Make projections of stored datasets using Spark cluster. +For additional details, see the [REST API](projection-rest.md) and [Python package](python-package.md) documentations. + +### Histogram microservice + +Make histograms of stored datasets. +For additional details, see the [REST API](histogram-rest.md) and [Python package](python-package.md) documentations. + +### t-SNE microservice + +Make a t-SNE image plot of stored datasets. +For additional details, see the [REST API](t-sne-rest.md) and [Python package](python-package.md) documentations. + +### PCA microservice + +Make a PCA image plot of stored datasets. +For additional details, see the [REST API](pca-rest.md) and [Python package](python-package.md) documentations. + +### Model builder microservice + +Create a prediction model from pre-processed datasets using Spark cluster. +For additional details, see the [REST API](modelbuilder-rest.md) and [Python package](python-package.md) documentations. + +## Additional information +### Spark Microservices + +The Projection, t-SNE, PCA and Model builder microservices uses the Spark microservice to work. + +By default, this microservice has only one instance. In case your data processing requires more computing power, you can scale this microservice. + +To do this, with learningOrchestra already deployed, run the following in the manager machine of your Docker swarm cluster: + +`docker service scale microservice_sparkworker=NUMBER_OF_INSTANCES` + +*\** `NUMBER_OF_INSTANCES` *is the number of Spark microservice instances which you require. Choose it according to your cluster resources and your resource requirements.* + +### Database GUI + +NoSQLBooster- MongoDB GUI performs several database tasks such as file visualization, queries, projections and file extraction to CSV and JSON formats. +It can be util to accomplish some these tasks with your processed dataset or get your prediction results. + +Read the [Database API docs](https://learningorchestra.github.io/docs/database-api/) for more info on configuring this tool. From e8e789feaf490a22a83988ba5efa5674c3607ae1 Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 18:30:23 +0200 Subject: [PATCH 17/27] Split structure for python package --- docs/database-python.md | 0 docs/datatype-python.md | 0 docs/histogram-python.md | 0 docs/microservices.md | 14 +++++++------- docs/modelbuilder-python.md | 0 docs/pca-python.md | 0 docs/projection-python.md | 0 docs/python-package.md | 7 +++++++ docs/t-sne-python.md | 0 9 files changed, 14 insertions(+), 7 deletions(-) create mode 100644 docs/database-python.md create mode 100644 docs/datatype-python.md create mode 100644 docs/histogram-python.md create mode 100644 docs/modelbuilder-python.md create mode 100644 docs/pca-python.md create mode 100644 docs/projection-python.md create mode 100644 docs/t-sne-python.md diff --git a/docs/database-python.md b/docs/database-python.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/datatype-python.md b/docs/datatype-python.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/histogram-python.md b/docs/histogram-python.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/microservices.md b/docs/microservices.md index 3035219..76c7ac6 100644 --- a/docs/microservices.md +++ b/docs/microservices.md @@ -15,37 +15,37 @@ The microservices can be called on from any computer, including one that is not ### Database microservice Download and handle datasets in a database. -For additional details, see the [REST API](database-rest.md) and [Python package](python-package.md) documentations. +For additional details, see the [REST API](database-rest.md) and [Python package](database-python.md) documentations. ### Data type microservice Change dataset fields type between number and text. -For additional details, see the [REST API](datatype-rest.md) and [Python package](python-package.md) documentations. +For additional details, see the [REST API](datatype-rest.md) and [Python package](datatype-python.md) documentations. ### Projection microservice Make projections of stored datasets using Spark cluster. -For additional details, see the [REST API](projection-rest.md) and [Python package](python-package.md) documentations. +For additional details, see the [REST API](projection-rest.md) and [Python package](projection-python.md) documentations. ### Histogram microservice Make histograms of stored datasets. -For additional details, see the [REST API](histogram-rest.md) and [Python package](python-package.md) documentations. +For additional details, see the [REST API](histogram-rest.md) and [Python package](histogram-python.md) documentations. ### t-SNE microservice Make a t-SNE image plot of stored datasets. -For additional details, see the [REST API](t-sne-rest.md) and [Python package](python-package.md) documentations. +For additional details, see the [REST API](t-sne-rest.md) and [Python package](t-sne-python.md) documentations. ### PCA microservice Make a PCA image plot of stored datasets. -For additional details, see the [REST API](pca-rest.md) and [Python package](python-package.md) documentations. +For additional details, see the [REST API](pca-rest.md) and [Python package](pca-python.md) documentations. ### Model builder microservice Create a prediction model from pre-processed datasets using Spark cluster. -For additional details, see the [REST API](modelbuilder-rest.md) and [Python package](python-package.md) documentations. +For additional details, see the [REST API](modelbuilder-rest.md) and [Python package](modelbuilder-python.md) documentations. ## Additional information ### Spark Microservices diff --git a/docs/modelbuilder-python.md b/docs/modelbuilder-python.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/pca-python.md b/docs/pca-python.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/projection-python.md b/docs/projection-python.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/python-package.md b/docs/python-package.md index 5c586e4..9c8b64f 100644 --- a/docs/python-package.md +++ b/docs/python-package.md @@ -9,6 +9,13 @@ cluster_ip = "xx.xx.xxx.xxx" Context(cluster_ip) ``` + +The current version of learningOrchestra offers 7 microservices, each corresponding to a Python class: +- The **[Database](#database-python.md) API is the central microservice**. It holds all the data, including the analysis results. +- The **[Data type](#datatype-python.md) API is a preprocessing microservice** dedicated to changing the type of data fields. +- The **[Projection](#projection-python.md), [Histogram](histogram-python.md), [t-SNE](t-sne-python.md) and [PCA](pca-python.md) APIs are data exploration microservices**. They transform the map the data into new representation spaces so it can be visualized. They can be used on the raw data as well as on the intermediate and final results of the analysis pipeline. +- The **[Model builder](modelbuilder-python.md) API is the main analysis microservice**. It includes some preprocessing features and machine learning features to train models, evaluate models and predict information using trained models. + ## Database API ### read_resume_files diff --git a/docs/t-sne-python.md b/docs/t-sne-python.md new file mode 100644 index 0000000..e69de29 From cac9ac6ece3bb57adc19e5efc5833ab48565b2fb Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 9 Oct 2020 18:37:37 +0200 Subject: [PATCH 18/27] Split info and clean --- docs/database-python.md | 43 +++++++ docs/database-rest.md | 128 ++++++++++++++++++++ docs/database.md | 128 -------------------- docs/datatype-python.md | 12 ++ docs/datatype-rest.md | 16 +++ docs/datatype.md | 16 --- docs/histogram-python.md | 15 +++ docs/histogram-rest.md | 16 +++ docs/histogram.md | 16 --- docs/modelbuilder-python.md | 56 +++++++++ docs/modelbuilder-rest.md | 164 +++++++++++++++++++++++++ docs/modelbuilder.md | 164 ------------------------- docs/pca-python.md | 44 +++++++ docs/pca-rest.md | 61 ++++++++++ docs/pca.md | 61 ---------- docs/projection-python.md | 13 ++ docs/projection-rest.md | 14 +++ docs/projection.md | 14 --- docs/python-package.md | 232 ------------------------------------ docs/t-sne-python.md | 45 +++++++ docs/t-sne-rest.md | 59 +++++++++ docs/t-sne.md | 59 --------- 22 files changed, 686 insertions(+), 690 deletions(-) delete mode 100644 docs/database.md delete mode 100644 docs/datatype.md delete mode 100644 docs/histogram.md delete mode 100644 docs/modelbuilder.md delete mode 100644 docs/pca.md delete mode 100644 docs/projection.md delete mode 100644 docs/t-sne.md diff --git a/docs/database-python.md b/docs/database-python.md index e69de29..47147f9 100644 --- a/docs/database-python.md +++ b/docs/database-python.md @@ -0,0 +1,43 @@ +## Database API + +### read_resume_files + +```python +read_resume_files(pretty_response=True) +``` +* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`) +(default `True`, if `False`, return dict) + +### read_file + +```python +read_file(filename, skip=0, limit=10, query={}, pretty_response=True) +``` + +* `filename` : name of file +* `skip`: number of rows to skip in pagination(default: `0`) +* `limit`: number of rows to return in pagination(default: `10`) +(maximum is set at `20` rows per request) +* `query`: query to make in MongoDB(default: `empty query`) +* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`) + +### create_file + +```python +create_file(filename, url, pretty_response=True) +``` + +* `filename`: name of file to be created +* `url`: url to CSV file +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### delete_file + +```python +delete_file(filename, pretty_response=True) +``` + +* `filename`: name of the file to be deleted +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) diff --git a/docs/database-rest.md b/docs/database-rest.md index e69de29..b6890e8 100644 --- a/docs/database-rest.md +++ b/docs/database-rest.md @@ -0,0 +1,128 @@ +# Database API + +The **Database API** microservice creates a level of abstraction through a REST API. + +Using MongoDB, datasets are downloaded in CSV format and parsed into JSON format where the primary key for each document is the filename field contained in the JSON file POST request. + +## GUI tool to handle database files + +There are GUI tools to handle database files, like [NoSQLBooster](https://nosqlbooster.com) can interact with mongoDB used in database, and makes several tasks which are limited in `learning-orchestra-client` package, as schema visualization and files extraction and download to formats as CSV and JSON. + +You also can navigate in all inserted files in easy way and visualize each row from determined file, to use this tool connect with the url `cluster\_ip:27017` and use the credentials: + +``` +username = root +password = owl45#21 +``` + +## List all inserted files + +`GET CLUSTER_IP:5000/files` + +Returns an array of metadata files from the database, where each file contains a metadata file. + +### Downloaded files Metadata + +```json +{ + "fields": [ + "PassengerId", + "Survived", + "Pclass", + "Name", + "Sex", + "Age", + "SibSp", + "Parch", + "Ticket", + "Fare", + "Cabin", + "Embarked" + ], + "filename": "titanic_training", + "finished": true, + "time_created": "2020-07-28T22:16:10-00:00", + "url": "https://filebin.net/rpfdy8clm5984a4c/titanic_training.csv?t=gcnjz1yo" +} +``` + +* `fields` - Names of the columns in the file +* `filename` - Name of the file +* `finished` - Flag used to indicate if asynchronous processing from file downloader is finished +* `time_created` - Time of creation +* `url` - URL used to download the file + +### Preprocessed files metadata + +```json +{ + "fields": [ + "PassengerId", + "Survived", + "Pclass", + "Name", + "Sex", + "Age", + "SibSp", + "Parch", + "Embarked" + ], + "filename": "titanic_training_projection", + "finished": false, + "parent_filename": "titanic_training", + "time_created": "2020-07-28T12:01:44-00:00" +} +``` + +* `parent_filename` - The `filename` used to make a preprocess task, from which the current file is derived. + +### Classifier prediction files metadata + +```json +{ + "F1": "0.7030995388400528", + "accuracy": "0.7034883720930233", + "classificator": "nb", + "filename": "titanic_testing_new_prediction_nb", + "fit_time": 41.870062828063965 +} +``` + +* `F1` - F1 Score from model accuracy +* `accuracy` - Accuracy from model prediction +* `classificator` - Initials from used classificator +* `filename` - Name of the file +* `fit_time` - Time taken for the model to be fit during training + +## List file content + +`GET CLUSTER_IP:5000/files/?skip=number&limit=number&query={}` + +Returns rows of the file requested, with pagination. + +* `filename` - Name of file requests +* `skip` - Amount of lines to skip in the CSV file +* `limit` - Limit the query result, maximum limit set to 20 rows +* `query` - Query to find documents, if only pagination is requested, `query` should be empty curly brackets `query={}` + +The first row in the query is always the metadata file. + +## Post file + +`POST CLUSTER_IP:5000/files` + +Insert a CSV into the database using the POST method, JSON must be contained in the body of the HTTP request. +The following fields are required: + +```json +{ + "filename": "key_to_document_identification", + "url": "http://sitetojson.file/path/to/csv" +} +``` + +## Delete an existing file + +`DELETE CLUSTER_IP:5000/files/` + +Request of type `DELETE`, passing the `filename` field of an existing file in the request parameters, deleting the file in the database. diff --git a/docs/database.md b/docs/database.md deleted file mode 100644 index 4c3be5d..0000000 --- a/docs/database.md +++ /dev/null @@ -1,128 +0,0 @@ -# Database API - -The **Database API** microservice creates a level of abstraction through a REST API. - -Using MongoDB, datasets are downloaded in CSV format and parsed into JSON format where the primary key for each document is the filename field contained in the JSON file POST request. - -## GUI tool to handle database files - -There are GUI tools to handle database files, like [NoSQLBooster](https://nosqlbooster.com) can interact with mongoDB used in database, and makes several tasks which are limited in `learning-orchestra-client` package, as schema visualization and files extraction and download to formats as CSV and JSON. - -You also can navigate in all inserted files in easy way and visualize each row from determined file, to use this tool connect with the url `cluster\_ip:27017` and use the credentials: - -``` -username = root -password = owl45#21 -``` - -## List all inserted files - -`GET CLUSTER_IP:5000/files` - -Returns an array of metadata files from the database, where each file contains a metadata file. - -### Downloaded files Metadata - -```json -{ - "fields": [ - "PassengerId", - "Survived", - "Pclass", - "Name", - "Sex", - "Age", - "SibSp", - "Parch", - "Ticket", - "Fare", - "Cabin", - "Embarked" - ], - "filename": "titanic_training", - "finished": true, - "time_created": "2020-07-28T22:16:10-00:00", - "url": "https://filebin.net/rpfdy8clm5984a4c/titanic_training.csv?t=gcnjz1yo" -} -``` - -* `fields` - Names of the columns in the file -* `filename` - Name of the file -* `finished` - Flag used to indicate if asynchronous processing from file downloader is finished -* `time_created` - Time of creation -* `url` - URL used to download the file - -### Preprocessed files metadata - -```json -{ - "fields": [ - "PassengerId", - "Survived", - "Pclass", - "Name", - "Sex", - "Age", - "SibSp", - "Parch", - "Embarked" - ], - "filename": "titanic_training_projection", - "finished": false, - "parent_filename": "titanic_training", - "time_created": "2020-07-28T12:01:44-00:00" -} -``` - -* `parent_filename` - The `filename` used to make a preprocess task, from which the current file is derived. - -### Classifier prediction files metadata - -```json -{ - "F1": "0.7030995388400528", - "accuracy": "0.7034883720930233", - "classificator": "nb", - "filename": "titanic_testing_new_prediction_nb", - "fit_time": 41.870062828063965 -} -``` - -* `F1` - F1 Score from model accuracy -* `accuracy` - Accuracy from model prediction -* `classificator` - Initials from used classificator -* `filename` - Name of the file -* `fit_time` - Time taken for the model to be fit during training - -## List file content - -`GET CLUSTER_IP:5000/files/?skip=number&limit=number&query={}` - -Returns rows of the file requested, with pagination. - -* `filename` - Name of file requests -* `skip` - Amount of lines to skip in the CSV file -* `limit` - Limit the query result, maximum limit set to 20 rows -* `query` - Query to find documents, if only pagination is requested, `query` should be empty curly brackets `query={}` - -The first row in the query is always the metadata file. - -## Post file - -`POST CLUSTER_IP:5000/files` - -Insert a CSV into the database using the POST method, JSON must be contained in the body of the HTTP request. -The following fields are required: - -```json -{ - "filename": "key_to_document_identification", - "url": "http://sitetojson.file/path/to/csv" -} -``` - -## Delete an existing file - -`DELETE CLUSTER_IP:5000/files/` - -Request of type `DELETE`, passing the `filename` field of an existing file in the request parameters, deleting the file in the database. diff --git a/docs/datatype-python.md b/docs/datatype-python.md index e69de29..182ffb0 100644 --- a/docs/datatype-python.md +++ b/docs/datatype-python.md @@ -0,0 +1,12 @@ +## Data type handler API + +### change_file_type + +```python +change_file_type(filename, fields_dict, pretty_response=True) +``` + +* `filename`: name of file +* `fields_dict`: dictionary with `field`:`number` or `field`:`string` keys +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) diff --git a/docs/datatype-rest.md b/docs/datatype-rest.md index e69de29..b282182 100644 --- a/docs/datatype-rest.md +++ b/docs/datatype-rest.md @@ -0,0 +1,16 @@ +# Data Type API + +This microservice changes data type from stored file between `number` and `string`. + +## Change field types of inserted file + +`PATCH CLUSTER_IP:5003/fieldtypes/` + +The request uses `filename` as the id in the parameters and fields in the body, `fields` is an array with all fields from file to be changed, using `number` or string descriptor in each `Key:Value` to describe the new value of altered field of file. + +```json +{ + "field1": "number", + "field2": "string" +} +``` diff --git a/docs/datatype.md b/docs/datatype.md deleted file mode 100644 index b282182..0000000 --- a/docs/datatype.md +++ /dev/null @@ -1,16 +0,0 @@ -# Data Type API - -This microservice changes data type from stored file between `number` and `string`. - -## Change field types of inserted file - -`PATCH CLUSTER_IP:5003/fieldtypes/` - -The request uses `filename` as the id in the parameters and fields in the body, `fields` is an array with all fields from file to be changed, using `number` or string descriptor in each `Key:Value` to describe the new value of altered field of file. - -```json -{ - "field1": "number", - "field2": "string" -} -``` diff --git a/docs/histogram-python.md b/docs/histogram-python.md index e69de29..1dc6e9d 100644 --- a/docs/histogram-python.md +++ b/docs/histogram-python.md @@ -0,0 +1,15 @@ + +## Histogram API + +### create_histogram + +```python +create_histogram(filename, histogram_filename, fields, + pretty_response=True) +``` + +* `filename`: name of file to make histogram +* `histogram_filename`: name of file used to create histogram +* `fields`: list with fields to make histogram +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) diff --git a/docs/histogram-rest.md b/docs/histogram-rest.md index e69de29..1830def 100644 --- a/docs/histogram-rest.md +++ b/docs/histogram-rest.md @@ -0,0 +1,16 @@ +# Histogram API + +Microservice used to make a histogram from a stored file, storing the resulting histogram in a new file in MongoDB. + +## Create a Histogram from posted file + +`POST CLUSTER_IP:5004/histograms/` + +The request is sent in the body, `histogram_filename` is the name of the file in which the histogram result is saved to and `fields` is an array with all the fields necessary to make the histogram. + +```json +{ + "histogram_filename": "filename_to_save_the_histogram", + "fields": ["fields", "from", "filename"] +} +``` diff --git a/docs/histogram.md b/docs/histogram.md deleted file mode 100644 index 1830def..0000000 --- a/docs/histogram.md +++ /dev/null @@ -1,16 +0,0 @@ -# Histogram API - -Microservice used to make a histogram from a stored file, storing the resulting histogram in a new file in MongoDB. - -## Create a Histogram from posted file - -`POST CLUSTER_IP:5004/histograms/` - -The request is sent in the body, `histogram_filename` is the name of the file in which the histogram result is saved to and `fields` is an array with all the fields necessary to make the histogram. - -```json -{ - "histogram_filename": "filename_to_save_the_histogram", - "fields": ["fields", "from", "filename"] -} -``` diff --git a/docs/modelbuilder-python.md b/docs/modelbuilder-python.md index e69de29..024437b 100644 --- a/docs/modelbuilder-python.md +++ b/docs/modelbuilder-python.md @@ -0,0 +1,56 @@ + +## Model builder API + +### create_model + +```python +create_model(training_filename, test_filename, preprocessor_code, + model_classificator, pretty_response=True) +``` + +* `training_filename`: name of file to be used in training +* `test_filename`: name of file to be used in test +* `preprocessor_code`: Python3 code for pyspark preprocessing model +* `model_classificator`: list of initial classificators to be used in model +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +#### model_classificator + +* `lr`: LogisticRegression +* `dt`: DecisionTreeClassifier +* `rf`: RandomForestClassifier +* `gb`: Gradient-boosted tree classifier +* `nb`: NaiveBayes + +to send a request with LogisticRegression and NaiveBayes Classifiers: + +```python +create_model(training_filename, test_filename, preprocessor_code, ["lr", "nb"]) +``` + +#### preprocessor_code environment + +The Python 3 preprocessing code must use the environment instances as below: + +* `training_df` (Instantiated): Spark Dataframe instance training filename +* `testing_df` (Instantiated): Spark Dataframe instance testing filename + +The preprocessing code must instantiate the variables as below, all instances must be transformed by pyspark VectorAssembler: + +* `features_training` (Not Instantiated): Spark Dataframe instance for training the model +* `features_evaluation` (Not Instantiated): Spark Dataframe instance for evaluating trained model +* `features_testing` (Not Instantiated): Spark Dataframe instance for testing the model + +In case you don't want to evaluate the model, set `features_evaluation` as `None`. + +##### Handy methods + +```python +self.fields_from_dataframe(dataframe, is_string) +``` + +This method returns `string` or `number` fields as a `string` list from a DataFrame. + +* `dataframe`: DataFrame instance +* `is_string`: Boolean parameter(if `True`, the method returns the string DataFrame fields, otherwise, returns the numbers DataFrame fields) diff --git a/docs/modelbuilder-rest.md b/docs/modelbuilder-rest.md index e69de29..e4a97c1 100644 --- a/docs/modelbuilder-rest.md +++ b/docs/modelbuilder-rest.md @@ -0,0 +1,164 @@ +# Model Builder API + +Model Builder microservice provides a REST API to create several model predictions using your own preprocessing code using a defined set of classifiers. + +## Create prediction model + +`POST CLUSTER_IP:5002/models` + +```json +{ + "training_filename": "training filename", + "test_filename": "test filename", + "preprocessor_code": "Python3 code to preprocessing, using Pyspark library", + "classificators_list": "String list of classificators to be used" +} +``` + +### List of Classifiers + +* `lr`: LogisticRegression +* `dt`: DecisionTreeClassifier +* `rf`: RandomForestClassifier +* `gb`: Gradient-boosted tree classifier +* `nb`: NaiveBayes + +To send a request with LogisticRegression and NaiveBayes Classifiers: + +```json +{ + "training_filename": "training filename", + "test_filename": "test filename", + "preprocessor_code": "Python3 code to preprocessing, using Pyspark library", + "classificators_list": ["lr", "nb"] +} +``` + +### preprocessor_code environment + +The python 3 preprocessing code must use the environment instances in bellow: + +* `training_df` (Instantiated): Spark Dataframe instance training filename +* `testing_df` (Instantiated): Spark Dataframe instance testing filename + +The preprocessing code must instantiate the variables in below, all instances must be transformed by pyspark VectorAssembler: + +* `features_training` (Not Instantiated): Spark Dataframe instance for train the model +* `features_evaluation` (Not Instantiated): Spark Dataframe instance for evaluating trained model accuracy +* `features_testing` (Not Instantiated): Spark Dataframe instance for testing the model + +In case you don't want to evaluate the model, set `features_evaluation` as `None`. + +#### Handy methods + +```python +self.fields_from_dataframe(self, dataframe, is_string) +``` +This method returns string or number fields as a string list from a DataFrame. + +* `dataframe`: DataFrame instance +* `is_string`: Boolean parameter, if `True`, the method returns the string DataFrame fields, otherwise, returns the numbers DataFrame fields. + +#### preprocessor_code Example + +This example uses the [titanic challengue datasets](https://www.kaggle.com/c/titanic/overview). + +```python +from pyspark.ml import Pipeline +from pyspark.sql.functions import ( + mean, col, split, + regexp_extract, when, lit) + +from pyspark.ml.feature import ( + VectorAssembler, + StringIndexer +) + +TRAINING_DF_INDEX = 0 +TESTING_DF_INDEX = 1 + +training_df = training_df.withColumnRenamed('Survived', 'label') +testing_df = testing_df.withColumn('label', lit(0)) +datasets_list = [training_df, testing_df] + +for index, dataset in enumerate(datasets_list): + dataset = dataset.withColumn( + "Initial", + regexp_extract(col("Name"), "([A-Za-z]+)\.", 1)) + datasets_list[index] = dataset + +misspelled_initials = [ + 'Mlle', 'Mme', 'Ms', 'Dr', + 'Major', 'Lady', 'Countess', + 'Jonkheer', 'Col', 'Rev', + 'Capt', 'Sir', 'Don' +] +correct_initials = [ + 'Miss', 'Miss', 'Miss', 'Mr', + 'Mr', 'Mrs', 'Mrs', + 'Other', 'Other', 'Other', + 'Mr', 'Mr', 'Mr' +] +for index, dataset in enumerate(datasets_list): + dataset = dataset.replace(misspelled_initials, correct_initials) + datasets_list[index] = dataset + + +initials_age = {"Miss": 22, + "Other": 46, + "Master": 5, + "Mr": 33, + "Mrs": 36} +for index, dataset in enumerate(datasets_list): + for initial, initial_age in initials_age.items(): + dataset = dataset.withColumn( + "Age", + when((dataset["Initial"] == initial) & + (dataset["Age"].isNull()), initial_age).otherwise( + dataset["Age"])) + datasets_list[index] = dataset + + +for index, dataset in enumerate(datasets_list): + dataset = dataset.na.fill({"Embarked": 'S'}) + datasets_list[index] = dataset + + +for index, dataset in enumerate(datasets_list): + dataset = dataset.withColumn("Family_Size", col('SibSp')+col('Parch')) + dataset = dataset.withColumn('Alone', lit(0)) + dataset = dataset.withColumn( + "Alone", + when(dataset["Family_Size"] == 0, 1).otherwise(dataset["Alone"])) + datasets_list[index] = dataset + + +text_fields = ["Sex", "Embarked", "Initial"] +for column in text_fields: + for index, dataset in enumerate(datasets_list): + dataset = StringIndexer( + inputCol=column, outputCol=column+"_index").\ + fit(dataset).\ + transform(dataset) + datasets_list[index] = dataset + + +non_required_columns = ["Name", "Embarked", "Sex", "Initial"] +for index, dataset in enumerate(datasets_list): + dataset = dataset.drop(*non_required_columns) + datasets_list[index] = dataset + + +training_df = datasets_list[TRAINING_DF_INDEX] +testing_df = datasets_list[TESTING_DF_INDEX] + +assembler = VectorAssembler( + inputCols=training_df.columns[:], + outputCol="features") +assembler.setHandleInvalid('skip') + +features_training = assembler.transform(training_df) +(features_training, features_evaluation) =\ + features_training.randomSplit([0.8, 0.2], seed=33) +features_testing = assembler.transform(testing_df) +``` diff --git a/docs/modelbuilder.md b/docs/modelbuilder.md deleted file mode 100644 index 3898e40..0000000 --- a/docs/modelbuilder.md +++ /dev/null @@ -1,164 +0,0 @@ -# Model Builder API - -Model Builder microservice provides a REST API to create several model predictions using your own preprocessing code using a defined set of classifiers. - -## Create prediction model - -`POST CLUSTER_IP:5002/models` - -```json -{ - "training_filename": "training filename", - "test_filename": "test filename", - "preprocessor_code": "Python3 code to preprocessing, using Pyspark library", - "classificators_list": "String list of classificators to be used" -} -``` - -### List of Classifiers - -* `lr`: LogisticRegression -* `dt`: DecisionTreeClassifier -* `rf`: RandomForestClassifier -* `gb`: Gradient-boosted tree classifier -* `nb`: NaiveBayes - -To send a request with LogisticRegression and NaiveBayes Classifiers: - -```json -{ - "training_filename": "training filename", - "test_filename": "test filename", - "preprocessor_code": "Python3 code to preprocessing, using Pyspark library", - "classificators_list": ["lr", "nb"] -} -``` - -### preprocessor_code environment - -The python 3 preprocessing code must use the environment instances in bellow: - -* `training_df` (Instantiated): Spark Dataframe instance training filename -* `testing_df` (Instantiated): Spark Dataframe instance testing filename - -The preprocessing code must instantiate the variables in below, all instances must be transformed by pyspark VectorAssembler: - -* `features_training` (Not Instantiated): Spark Dataframe instance for train the model -* `features_evaluation` (Not Instantiated): Spark Dataframe instance for evaluating trained model accuracy -* `features_testing` (Not Instantiated): Spark Dataframe instance for testing the model - -In case you don't want to evaluate the model, set `features_evaluation` as `None`. - -#### Handy methods - -```python -self.fields_from_dataframe(self, dataframe, is_string) -``` -This method returns string or number fields as a string list from a DataFrame. - -* `dataframe`: DataFrame instance -* `is_string`: Boolean parameter, if `True`, the method returns the string DataFrame fields, otherwise, returns the numbers DataFrame fields. - -#### preprocessor_code Example - -This example uses the [titanic challengue datasets](https://www.kaggle.com/c/titanic/overview). - -```python -from pyspark.ml import Pipeline -from pyspark.sql.functions import ( - mean, col, split, - regexp_extract, when, lit) - -from pyspark.ml.feature import ( - VectorAssembler, - StringIndexer -) - -TRAINING_DF_INDEX = 0 -TESTING_DF_INDEX = 1 - -training_df = training_df.withColumnRenamed('Survived', 'label') -testing_df = testing_df.withColumn('label', lit(0)) -datasets_list = [training_df, testing_df] - -for index, dataset in enumerate(datasets_list): - dataset = dataset.withColumn( - "Initial", - regexp_extract(col("Name"), "([A-Za-z]+)\.", 1)) - datasets_list[index] = dataset - -misspelled_initials = [ - 'Mlle', 'Mme', 'Ms', 'Dr', - 'Major', 'Lady', 'Countess', - 'Jonkheer', 'Col', 'Rev', - 'Capt', 'Sir', 'Don' -] -correct_initials = [ - 'Miss', 'Miss', 'Miss', 'Mr', - 'Mr', 'Mrs', 'Mrs', - 'Other', 'Other', 'Other', - 'Mr', 'Mr', 'Mr' -] -for index, dataset in enumerate(datasets_list): - dataset = dataset.replace(misspelled_initials, correct_initials) - datasets_list[index] = dataset - - -initials_age = {"Miss": 22, - "Other": 46, - "Master": 5, - "Mr": 33, - "Mrs": 36} -for index, dataset in enumerate(datasets_list): - for initial, initial_age in initials_age.items(): - dataset = dataset.withColumn( - "Age", - when((dataset["Initial"] == initial) & - (dataset["Age"].isNull()), initial_age).otherwise( - dataset["Age"])) - datasets_list[index] = dataset - - -for index, dataset in enumerate(datasets_list): - dataset = dataset.na.fill({"Embarked": 'S'}) - datasets_list[index] = dataset - - -for index, dataset in enumerate(datasets_list): - dataset = dataset.withColumn("Family_Size", col('SibSp')+col('Parch')) - dataset = dataset.withColumn('Alone', lit(0)) - dataset = dataset.withColumn( - "Alone", - when(dataset["Family_Size"] == 0, 1).otherwise(dataset["Alone"])) - datasets_list[index] = dataset - - -text_fields = ["Sex", "Embarked", "Initial"] -for column in text_fields: - for index, dataset in enumerate(datasets_list): - dataset = StringIndexer( - inputCol=column, outputCol=column+"_index").\ - fit(dataset).\ - transform(dataset) - datasets_list[index] = dataset - - -non_required_columns = ["Name", "Embarked", "Sex", "Initial"] -for index, dataset in enumerate(datasets_list): - dataset = dataset.drop(*non_required_columns) - datasets_list[index] = dataset - - -training_df = datasets_list[TRAINING_DF_INDEX] -testing_df = datasets_list[TESTING_DF_INDEX] - -assembler = VectorAssembler( - inputCols=training_df.columns[:], - outputCol="features") -assembler.setHandleInvalid('skip') - -features_training = assembler.transform(training_df) -(features_training, features_evaluation) =\ - features_training.randomSplit([0.8, 0.2], seed=33) -features_testing = assembler.transform(testing_df) -``` diff --git a/docs/pca-python.md b/docs/pca-python.md index e69de29..be0fc63 100644 --- a/docs/pca-python.md +++ b/docs/pca-python.md @@ -0,0 +1,44 @@ +## PCA API + +### create_image_plot + +```python +create_image_plot(tsne_filename, parent_filename, + label_name=None, pretty_response=True) +``` + +* `parent_filename`: name of file to make histogram +* `pca_filename`: filename used to create image plot +* `label_name`: label name to dataset with labeled tuples (default: `None`, to +datasets without labeled tuples) +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### read_image_plot_filenames + +```python +read_image_plot_filenames(pretty_response=True) +``` + +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### read_image_plot + +```python +read_image_plot(pca_filename, pretty_response=True) +``` + +* `pca_filename`: filename of a created image plot +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### delete_image_plot + +```python +delete_image_plot(pca_filename, pretty_response=True) +``` + +* `pca_filename`: filename of a created image plot +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) diff --git a/docs/pca-rest.md b/docs/pca-rest.md index e69de29..bd2a26c 100644 --- a/docs/pca-rest.md +++ b/docs/pca-rest.md @@ -0,0 +1,61 @@ +# PCA API + +PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. + +In `scikit-learn` (used in this microservice), PCA is implemented as a transformer object that learns components in its fit method, and can be used on new data to project it on these components. + +PCA centers but does not scale the input data for each feature before applying the SVD. The optional parameter `whiten = True` makes it possible to project the data onto the singular space while scaling each component to unit variance. + +This is often useful if the models down-stream make strong assumptions on the isotropy of the signal: this is for example the case for Support Vector Machines with the RBF kernel and the K-Means clustering algorithm, more information about this algorithm in [scikit-learn PCA docs](https://scikit-learn.org/stable/modules/decomposition.html#pca). + +## Create an image plot + +`POST CLUSTER_IP:5006/images/` + +The request uses a `parent_filename` as a dataset filename, the body contains the json fields: + +```json +{ + "pca_filename": "image_plot_filename", + "label_name": "dataset_label_column" +} +``` + +The `label_name` is the label name column for machine learning datasets which has labeled tuples. In the case that the dataset used doesn't contain labeled tuples, define the value as null type in JSON: + +```json +{ + "pca_filename": "image_plot_filename", + "label_name": null +} +``` + +## Delete an image plot + +`DELETE CLUSTER_IP:5006/images/` + +Deletes an image plot from the database. + +## Read the filenames of the created images + +`GET CLUSTER_IP:5006/images` + +Returns a list with all created image plot filenames. + +## Read an image plot + +`GET CLUSTER_IP:5006/images/` + +Returns the image plot of `filename` specified. + +### Images plot examples + +This examples use the [titanic challengue datasets](https://www.kaggle.com/c/titanic/overview). + +#### Titanic Train dataset + +![](./pca_titanic_train.png) + +#### Titanic Test dataset + +![](./pca_titanic_test.png) diff --git a/docs/pca.md b/docs/pca.md deleted file mode 100644 index 7608ba3..0000000 --- a/docs/pca.md +++ /dev/null @@ -1,61 +0,0 @@ -# PCA API - -PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. - -In `scikit-learn` (used in this microservice), PCA is implemented as a transformer object that learns components in its fit method, and can be used on new data to project it on these components. - -PCA centers but does not scale the input data for each feature before applying the SVD. The optional parameter `whiten = True` makes it possible to project the data onto the singular space while scaling each component to unit variance. - -This is often useful if the models down-stream make strong assumptions on the isotropy of the signal: this is for example the case for Support Vector Machines with the RBF kernel and the K-Means clustering algorithm, more information about this algorithm in [scikit-learn PCA docs](https://scikit-learn.org/stable/modules/decomposition.html#pca). - -## Create an image plot - -`POST CLUSTER_IP:5006/images/` - -The request uses a `parent_filename` as a dataset filename, the body contains the json fields: - -```json -{ - "pca_filename": "image_plot_filename", - "label_name": "dataset_label_column" -} -``` - -The `label_name` is the label name column for machine learning datasets which has labeled tuples. In the case that the dataset used doesn't contain labeled tuples, define the value as null type in JSON: - -```json -{ - "pca_filename": "image_plot_filename", - "label_name": null -} -``` - -## Delete an image plot - -`DELETE CLUSTER_IP:5006/images/` - -Deletes an image plot from the database. - -## Read the filenames of the created images - -`GET CLUSTER_IP:5006/images` - -Returns a list with all created image plot filenames. - -## Read an image plot - -`GET CLUSTER_IP:5006/images/` - -Returns the image plot of `filename` specified. - -### Images plot examples - -This examples use the [titanic challengue datasets](https://www.kaggle.com/c/titanic/overview). - -#### Titanic Train dataset - -![](./pca_titanic_train.png) - -#### Titanic Test dataset - -![](./pca_titanic_test.png) diff --git a/docs/projection-python.md b/docs/projection-python.md index e69de29..c7ed36f 100644 --- a/docs/projection-python.md +++ b/docs/projection-python.md @@ -0,0 +1,13 @@ +## Projection API + +### create_projection + +```python +create_projection(filename, projection_filename, fields, pretty_response=True) +``` + +* `filename`: name of the file to make projection +* `projection_filename`: name of file used to create projection +* `fields`: list with fields to make projection +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) diff --git a/docs/projection-rest.md b/docs/projection-rest.md index e69de29..52334a7 100644 --- a/docs/projection-rest.md +++ b/docs/projection-rest.md @@ -0,0 +1,14 @@ +# Projection API + +`Projection API` microservice provides an API to make a projection from file inserted in database service, generating a new file and storing in database. + +## Create projection from an inserted file + +`POST CLUSTER_IP:5001/projections/` +Post request where `filename` is the name of the file to create a projection for. +```json +{ + "projection_filename" : "filename_to_save_projection", + "fields" : ["list", "of", "fields"] +} +``` diff --git a/docs/projection.md b/docs/projection.md deleted file mode 100644 index 52334a7..0000000 --- a/docs/projection.md +++ /dev/null @@ -1,14 +0,0 @@ -# Projection API - -`Projection API` microservice provides an API to make a projection from file inserted in database service, generating a new file and storing in database. - -## Create projection from an inserted file - -`POST CLUSTER_IP:5001/projections/` -Post request where `filename` is the name of the file to create a projection for. -```json -{ - "projection_filename" : "filename_to_save_projection", - "fields" : ["list", "of", "fields"] -} -``` diff --git a/docs/python-package.md b/docs/python-package.md index 9c8b64f..b64bff5 100644 --- a/docs/python-package.md +++ b/docs/python-package.md @@ -15,235 +15,3 @@ The current version of learningOrchestra offers 7 microservices, each correspond - The **[Data type](#datatype-python.md) API is a preprocessing microservice** dedicated to changing the type of data fields. - The **[Projection](#projection-python.md), [Histogram](histogram-python.md), [t-SNE](t-sne-python.md) and [PCA](pca-python.md) APIs are data exploration microservices**. They transform the map the data into new representation spaces so it can be visualized. They can be used on the raw data as well as on the intermediate and final results of the analysis pipeline. - The **[Model builder](modelbuilder-python.md) API is the main analysis microservice**. It includes some preprocessing features and machine learning features to train models, evaluate models and predict information using trained models. - -## Database API - -### read_resume_files - -```python -read_resume_files(pretty_response=True) -``` -* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`) -(default `True`, if `False`, return dict) - -### read_file - -```python -read_file(filename, skip=0, limit=10, query={}, pretty_response=True) -``` - -* `filename` : name of file -* `skip`: number of rows to skip in pagination(default: `0`) -* `limit`: number of rows to return in pagination(default: `10`) -(maximum is set at `20` rows per request) -* `query`: query to make in MongoDB(default: `empty query`) -* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`) - -### create_file - -```python -create_file(filename, url, pretty_response=True) -``` - -* `filename`: name of file to be created -* `url`: url to CSV file -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### delete_file - -```python -delete_file(filename, pretty_response=True) -``` - -* `filename`: name of the file to be deleted -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## Projection API - -### create_projection - -```python -create_projection(filename, projection_filename, fields, pretty_response=True) -``` - -* `filename`: name of the file to make projection -* `projection_filename`: name of file used to create projection -* `fields`: list with fields to make projection -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## Data type handler API - -### change_file_type - -```python -change_file_type(filename, fields_dict, pretty_response=True) -``` - -* `filename`: name of file -* `fields_dict`: dictionary with `field`:`number` or `field`:`string` keys -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## Histogram API - -### create_histogram - -```python -create_histogram(filename, histogram_filename, fields, - pretty_response=True) -``` - -* `filename`: name of file to make histogram -* `histogram_filename`: name of file used to create histogram -* `fields`: list with fields to make histogram -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## t-SNE API - -### create_image_plot - -```python -create_image_plot(tsne_filename, parent_filename, - label_name=None, pretty_response=True) -``` - -* `parent_filename`: name of file to make histogram -* `tsne_filename`: name of file used to create image plot -* `label_name`: label name to dataset with labeled tuples (default: `None`, to -datasets without labeled tuples) -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### read_image_plot_filenames - -```python -read_image_plot_filenames(pretty_response=True) -``` - -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### read_image_plot - -```python -read_image_plot(tsne_filename, pretty_response=True) -``` - -* tsne_filename: filename of a created image plot -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### delete_image_plot - -```python -delete_image_plot(tsne_filename, pretty_response=True) -``` - -* `tsne_filename`: filename of a created image plot -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## PCA API - -### create_image_plot - -```python -create_image_plot(tsne_filename, parent_filename, - label_name=None, pretty_response=True) -``` - -* `parent_filename`: name of file to make histogram -* `pca_filename`: filename used to create image plot -* `label_name`: label name to dataset with labeled tuples (default: `None`, to -datasets without labeled tuples) -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### read_image_plot_filenames - -```python -read_image_plot_filenames(pretty_response=True) -``` - -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### read_image_plot - -```python -read_image_plot(pca_filename, pretty_response=True) -``` - -* `pca_filename`: filename of a created image plot -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -### delete_image_plot - -```python -delete_image_plot(pca_filename, pretty_response=True) -``` - -* `pca_filename`: filename of a created image plot -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -## Model builder API - -### create_model - -```python -create_model(training_filename, test_filename, preprocessor_code, - model_classificator, pretty_response=True) -``` - -* `training_filename`: name of file to be used in training -* `test_filename`: name of file to be used in test -* `preprocessor_code`: Python3 code for pyspark preprocessing model -* `model_classificator`: list of initial classificators to be used in model -* `pretty_response`: returns indented `string` for visualization -(default: `True`, returns `dict` if `False`) - -#### model_classificator - -* `lr`: LogisticRegression -* `dt`: DecisionTreeClassifier -* `rf`: RandomForestClassifier -* `gb`: Gradient-boosted tree classifier -* `nb`: NaiveBayes - -to send a request with LogisticRegression and NaiveBayes Classifiers: - -```python -create_model(training_filename, test_filename, preprocessor_code, ["lr", "nb"]) -``` - -#### preprocessor_code environment - -The Python 3 preprocessing code must use the environment instances as below: - -* `training_df` (Instantiated): Spark Dataframe instance training filename -* `testing_df` (Instantiated): Spark Dataframe instance testing filename - -The preprocessing code must instantiate the variables as below, all instances must be transformed by pyspark VectorAssembler: - -* `features_training` (Not Instantiated): Spark Dataframe instance for training the model -* `features_evaluation` (Not Instantiated): Spark Dataframe instance for evaluating trained model -* `features_testing` (Not Instantiated): Spark Dataframe instance for testing the model - -In case you don't want to evaluate the model, set `features_evaluation` as `None`. - -##### Handy methods - -```python -self.fields_from_dataframe(dataframe, is_string) -``` - -This method returns `string` or `number` fields as a `string` list from a DataFrame. - -* `dataframe`: DataFrame instance -* `is_string`: Boolean parameter(if `True`, the method returns the string DataFrame fields, otherwise, returns the numbers DataFrame fields) diff --git a/docs/t-sne-python.md b/docs/t-sne-python.md index e69de29..3053d5a 100644 --- a/docs/t-sne-python.md +++ b/docs/t-sne-python.md @@ -0,0 +1,45 @@ + +## t-SNE API + +### create_image_plot + +```python +create_image_plot(tsne_filename, parent_filename, + label_name=None, pretty_response=True) +``` + +* `parent_filename`: name of file to make histogram +* `tsne_filename`: name of file used to create image plot +* `label_name`: label name to dataset with labeled tuples (default: `None`, to +datasets without labeled tuples) +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### read_image_plot_filenames + +```python +read_image_plot_filenames(pretty_response=True) +``` + +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### read_image_plot + +```python +read_image_plot(tsne_filename, pretty_response=True) +``` + +* tsne_filename: filename of a created image plot +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) + +### delete_image_plot + +```python +delete_image_plot(tsne_filename, pretty_response=True) +``` + +* `tsne_filename`: filename of a created image plot +* `pretty_response`: returns indented `string` for visualization +(default: `True`, returns `dict` if `False`) diff --git a/docs/t-sne-rest.md b/docs/t-sne-rest.md index e69de29..975fac8 100644 --- a/docs/t-sne-rest.md +++ b/docs/t-sne-rest.md @@ -0,0 +1,59 @@ +# t-SNE API + +The T-distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm for visualization. + +It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. + +Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability, more information about this algorithm in its [Wiki page]( https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). + +## Create an image plot + +`POST CLUSTER_IP:5005/images/` + +The request uses a `parent_filename` as a dataset inserted filename, the body contains the JSON fields: + +```json +{ + "tsne_filename": "image_plot_filename", + "label_name": "dataset_label_column" +} +``` + +The `label_name` is the label name of the column for machine learning datasets which has labeled tuples. In the case that the dataset used doesn't contain labeled tuples, define the value as null type in json: + +```json +{ + "tsne_filename": "image_plot_filename", + "label_name": null +} +``` + +## Delete an image plot + +`DELETE CLUSTER_IP:5005/images/` + +Deletes an image plot by specifying its file name. + +## Read the filenames of the created images + +`GET CLUSTER_IP:5005/images` + +Returns a list with all created images plot file name. + +## Read an image plot + +`GET CLUSTER_IP:5005/images/` + +Returns the image plot of the specified file name. + +### Image plot examples + +These examples use the [titanic challengue datasets](https://www.kaggle.com/c/titanic/overview). + +#### Titanic Train dataset + +![](./tsne_titanic_train.png) + +#### Titanic Test dataset + +![](./tsne_titanic_test.png) diff --git a/docs/t-sne.md b/docs/t-sne.md deleted file mode 100644 index b825285..0000000 --- a/docs/t-sne.md +++ /dev/null @@ -1,59 +0,0 @@ -# t-SNE API - -The T-distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm for visualization. - -It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. - -Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability, more information about this algorithm in its [Wiki page]( https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). - -## Create an image plot - -`POST CLUSTER_IP:5005/images/` - -The request uses a `parent_filename` as a dataset inserted filename, the body contains the JSON fields: - -```json -{ - "tsne_filename": "image_plot_filename", - "label_name": "dataset_label_column" -} -``` - -The `label_name` is the label name of the column for machine learning datasets which has labeled tuples. In the case that the dataset used doesn't contain labeled tuples, define the value as null type in json: - -```json -{ - "tsne_filename": "image_plot_filename", - "label_name": null -} -``` - -## Delete an image plot - -`DELETE CLUSTER_IP:5005/images/` - -Deletes an image plot by specifying its file name. - -## Read the filenames of the created images - -`GET CLUSTER_IP:5005/images` - -Returns a list with all created images plot file name. - -## Read an image plot - -`GET CLUSTER_IP:5005/images/` - -Returns the image plot of the specified file name. - -### Image plot examples - -These examples use the [titanic challengue datasets](https://www.kaggle.com/c/titanic/overview). - -#### Titanic Train dataset - -![](./tsne_titanic_train.png) - -#### Titanic Test dataset - -![](./tsne_titanic_test.png) From d7b57560da8d938413e27e899ab2345a81ad0a00 Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Wed, 14 Oct 2020 20:12:47 +0200 Subject: [PATCH 19/27] Draft for description of Database microservice --- docs/microservices.md | 44 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 36 insertions(+), 8 deletions(-) diff --git a/docs/microservices.md b/docs/microservices.md index 76c7ac6..b2b2305 100644 --- a/docs/microservices.md +++ b/docs/microservices.md @@ -10,13 +10,48 @@ The current version of learningOrchestra offers 7 microservices: The microservices can be called on from any computer, including one that is not part of the cluster learningOrchestra is deployed on. learningOrchestra provides two options to access its features: a **microservice REST API** and a **Python package**. + + +- [Available microservices](#available-microservices) + - [Database microservice](#database-microservice) + - [Combine the Database microservice with a GUI](#combine-the-database-microservice-with-a-gui) + - [Data type microservice](#data-type-microservice) + - [Projection microservice](#projection-microservice) + - [Histogram microservice](#histogram-microservice) + - [t-SNE microservice](#t-sne-microservice) + - [PCA microservice](#pca-microservice) + - [Model builder microservice](#model-builder-microservice) +- [Additional information](#additional-information) + - [Spark Microservices](#spark-microservices) + + + ## Available microservices ### Database microservice -Download and handle datasets in a database. +The Database microservice is an abstraction layer of a [MongoDB](https://www.mongodb.com/) database. MongoDB uses [NoSQL, aka non-relational, databases](https://en.wikipedia.org/wiki/NoSQL), so the data is stored as [JSON](https://www.json.org/json-en.html)-like documents. + +The Database microservice is organised so each database document corresponds to a CSV file. The key of a file is its filename. The file metadata is saved as its first row. + +The microservice provides entry points to add a CSV file to the database, delete a CSV file from the database, retrieve the content of a CSV file in the database and list all files in the database. + +The Database microservice serves as a central pivot for the others microservices. They all use the Database microservice as their data source. All but the t-SNE and the PCA microservices send their results to the Database microservice to save. + For additional details, see the [REST API](database-rest.md) and [Python package](database-python.md) documentations. +#### Combine the Database microservice with a GUI + +GUI database managers like [NoSQLBooster](https://nosqlbooster.com) can interact directly with MongoDB. Using one will let you perform additional tasks which are not implemented in the Database microservice, such as schema visualization, file extractionor direct CSV or JSON download. + +Using a GUI is fully compatible with using learningOrchestra Database microservice. + +You can connect a MongoDB-compatible GUI to your learningOrchestra database with the url `cluster_ip:27017`, where `cluster_ip` is the IP address to an instance of your cluster. You will need to provide the following credentials: +``` +username = root +password = owl45#21 +``` + ### Data type microservice Change dataset fields type between number and text. @@ -59,10 +94,3 @@ To do this, with learningOrchestra already deployed, run the following in the ma `docker service scale microservice_sparkworker=NUMBER_OF_INSTANCES` *\** `NUMBER_OF_INSTANCES` *is the number of Spark microservice instances which you require. Choose it according to your cluster resources and your resource requirements.* - -### Database GUI - -NoSQLBooster- MongoDB GUI performs several database tasks such as file visualization, queries, projections and file extraction to CSV and JSON formats. -It can be util to accomplish some these tasks with your processed dataset or get your prediction results. - -Read the [Database API docs](https://learningorchestra.github.io/docs/database-api/) for more info on configuring this tool. From 324c9e50af92076001c4133c1e0ab46e0f76b7f0 Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 16 Oct 2020 15:05:05 +0200 Subject: [PATCH 20/27] Draft for Data type microservice description --- docs/microservices.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/microservices.md b/docs/microservices.md index b2b2305..a8db8d0 100644 --- a/docs/microservices.md +++ b/docs/microservices.md @@ -54,7 +54,9 @@ password = owl45#21 ### Data type microservice -Change dataset fields type between number and text. +The Data type microservice revolves around casting the data for a given field (= column for data organised as a table) to a new type. The microservice can cast fields into *strings* or into number types (*float* by default, *int* if appropriate). + + For additional details, see the [REST API](datatype-rest.md) and [Python package](datatype-python.md) documentations. ### Projection microservice From 67e0def2af32641ad92932750bf32a615064c34e Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 16 Oct 2020 15:26:29 +0200 Subject: [PATCH 21/27] Draft for Histogram microservice description --- docs/microservices.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/microservices.md b/docs/microservices.md index a8db8d0..bed0f35 100644 --- a/docs/microservices.md +++ b/docs/microservices.md @@ -66,7 +66,8 @@ For additional details, see the [REST API](projection-rest.md) and [Python packa ### Histogram microservice -Make histograms of stored datasets. +The Histogram microservice transform the data of a given source into an aggregate with observation counts for each value bin. The aggregate data is saved into the database from the Database microservice and can then be used to generate an histogram representation of the source data. + For additional details, see the [REST API](histogram-rest.md) and [Python package](histogram-python.md) documentations. ### t-SNE microservice From 6c20fab26510d4dc48de67441498fe064a5cb89e Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 16 Oct 2020 15:35:44 +0200 Subject: [PATCH 22/27] Draft for t-SNE microservice description --- docs/microservices.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/microservices.md b/docs/microservices.md index bed0f35..a991961 100644 --- a/docs/microservices.md +++ b/docs/microservices.md @@ -72,7 +72,12 @@ For additional details, see the [REST API](histogram-rest.md) and [Python packag ### t-SNE microservice -Make a t-SNE image plot of stored datasets. +The t-SNE microservice transforms the data of a given source using the [T-distributed Stochastic Neighbor Embedding]((https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding)) algorithm, generates the t-SNE graphical representations and manages the generated images. + +t-SNE is a machine learning algorithm for visualization of high-dimensional data. It relies on a non-linear dimensionality reduction technique to project high-dimensional data into a low-dimensional space (two or three dimensions). It models each high-dimensional object by a point in a low-dimensional space in such a way that similar objects are represented by nearby points with high probability, and conversely dissimilar objects are represented by distant points with high probability. + +The t-SNE microservice provides entry points to create and store a t-SNE graphical representation of a dataset, to list all the images previously stored by the microservice, to download one of these images, and to delete one of these images. + For additional details, see the [REST API](t-sne-rest.md) and [Python package](t-sne-python.md) documentations. ### PCA microservice From 787ff54305e453cbd1b5d63069df81204b6a1e95 Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 16 Oct 2020 20:04:01 +0200 Subject: [PATCH 23/27] Draft for Projection microservice description --- docs/microservices.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/microservices.md b/docs/microservices.md index a991961..e1f90b4 100644 --- a/docs/microservices.md +++ b/docs/microservices.md @@ -61,7 +61,8 @@ For additional details, see the [REST API](datatype-rest.md) and [Python package ### Projection microservice -Make projections of stored datasets using Spark cluster. +The Projection microservice is a data manipulation microservice. It provides an entry point to simplify a dataset by selecting only certain fields (= column for data organised as a table). + For additional details, see the [REST API](projection-rest.md) and [Python package](projection-python.md) documentations. ### Histogram microservice From ac24586c68a5f7dee5ad68594cf78c07d5dc835c Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 16 Oct 2020 20:12:42 +0200 Subject: [PATCH 24/27] Draft for PCA microservice description --- docs/microservices.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/docs/microservices.md b/docs/microservices.md index e1f90b4..82900d6 100644 --- a/docs/microservices.md +++ b/docs/microservices.md @@ -73,7 +73,7 @@ For additional details, see the [REST API](histogram-rest.md) and [Python packag ### t-SNE microservice -The t-SNE microservice transforms the data of a given source using the [T-distributed Stochastic Neighbor Embedding]((https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding)) algorithm, generates the t-SNE graphical representations and manages the generated images. +The t-SNE microservice transforms the data of a given source using the [T-distributed Stochastic Neighbor Embedding]((https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding)) algorithm, generates the t-SNE graphical representations, and manages the generated images. t-SNE is a machine learning algorithm for visualization of high-dimensional data. It relies on a non-linear dimensionality reduction technique to project high-dimensional data into a low-dimensional space (two or three dimensions). It models each high-dimensional object by a point in a low-dimensional space in such a way that similar objects are represented by nearby points with high probability, and conversely dissimilar objects are represented by distant points with high probability. @@ -83,7 +83,12 @@ For additional details, see the [REST API](t-sne-rest.md) and [Python package](t ### PCA microservice -Make a PCA image plot of stored datasets. +The PCA microservice decompose the data of a given source into a set of orthogonal components that explain a maximum amount of the variance, plots the data into the space defined by those components, and manages the generated images. + +The implementation of this microservice relies on the [scikit-learn libray](https://scikit-learn.org/stable/modules/decomposition.html#pca). + +The PCA microservice provides entry points to create and store a PCA graphical representation of a dataset, to list all the images previously stored by the microservice, to download one of these images, and to delete one of these images. + For additional details, see the [REST API](pca-rest.md) and [Python package](pca-python.md) documentations. ### Model builder microservice From d8888a66ac75d72dd30869baad2e824b3a7cc0e0 Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Fri, 16 Oct 2020 20:15:26 +0200 Subject: [PATCH 25/27] Add storage info for t-SNE and PCA --- docs/microservices.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/microservices.md b/docs/microservices.md index 82900d6..af9fb81 100644 --- a/docs/microservices.md +++ b/docs/microservices.md @@ -77,7 +77,7 @@ The t-SNE microservice transforms the data of a given source using the [T-distri t-SNE is a machine learning algorithm for visualization of high-dimensional data. It relies on a non-linear dimensionality reduction technique to project high-dimensional data into a low-dimensional space (two or three dimensions). It models each high-dimensional object by a point in a low-dimensional space in such a way that similar objects are represented by nearby points with high probability, and conversely dissimilar objects are represented by distant points with high probability. -The t-SNE microservice provides entry points to create and store a t-SNE graphical representation of a dataset, to list all the images previously stored by the microservice, to download one of these images, and to delete one of these images. +The t-SNE microservice provides entry points to create and store a t-SNE graphical representation of a dataset, to list all the images previously stored by the microservice, to download one of these images, and to delete one of these images. It relies on a dedicated storage in the Spark cluster rather than the Database microservice. For additional details, see the [REST API](t-sne-rest.md) and [Python package](t-sne-python.md) documentations. @@ -87,7 +87,7 @@ The PCA microservice decompose the data of a given source into a set of orthogon The implementation of this microservice relies on the [scikit-learn libray](https://scikit-learn.org/stable/modules/decomposition.html#pca). -The PCA microservice provides entry points to create and store a PCA graphical representation of a dataset, to list all the images previously stored by the microservice, to download one of these images, and to delete one of these images. +The PCA microservice provides entry points to create and store a PCA graphical representation of a dataset, to list all the images previously stored by the microservice, to download one of these images, and to delete one of these images. It relies on a dedicated storage in the Spark cluster rather than the Database microservice. For additional details, see the [REST API](pca-rest.md) and [Python package](pca-python.md) documentations. From f1f9f2bdbcbafcd52729981dde6b3b4ceef37ae3 Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Sat, 17 Oct 2020 17:20:52 +0200 Subject: [PATCH 26/27] Draft for Model builder microservice description --- docs/microservices.md | 136 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 135 insertions(+), 1 deletion(-) diff --git a/docs/microservices.md b/docs/microservices.md index af9fb81..9dc444a 100644 --- a/docs/microservices.md +++ b/docs/microservices.md @@ -93,9 +93,143 @@ For additional details, see the [REST API](pca-rest.md) and [Python package](pca ### Model builder microservice -Create a prediction model from pre-processed datasets using Spark cluster. +The Model builder microservice is a all-in-one entry point to train, evaluate and apply classification models. It loads datasets from the Database microservice, preprocesses their content using a user-specified Python script, trains each of the specified classifiers on the training dataset, evaluates the accuracy of the trained model on an evaluation dataset, predicts the labels of the unlabelled testing dataset and saves the accuracy results and predicted labels. + For additional details, see the [REST API](modelbuilder-rest.md) and [Python package](modelbuilder-python.md) documentations. +#### Available classifiers + +The following classifiers are currently available through the Model builder microservice, in their Pyspark implementation: +* Logistic regression +* Decision tree classifier +* Random forest classifier +* Gradient-boosted tree classifier +* Naive Bayes + +#### Preprocessing script + +The preprocessing script must be written by the user in Python 3 and include the Pyspark library. + +:exclamation: The variable names currently used are not the names typically used in machine learning libraries. Please take care to read their descriptions to understand their actual role. + +The following environment instances are made available to it: +- `training_df`: a Spark Dataframe instance holding the training-and-evaluation dataset loaded from the Database microservice, +- `testing_df`: a Spark Dataframe instance holding the unlabelled dataset loaded from the Database microservice. + +The preprocessing script must rename the label column as "label" in the training-and-evaluation dataset and create a zero-value "label" column in the unlabelled dataset. + +The preprocessing script must instantiate the following variables using Pyspark VectorAssembler: +- `features_training`: Spark Dataframe instance with the preprocessed training dataset, **including** the "label" column, +* `features_evaluation`: Spark Dataframe instance with the preprocessed testing dataset to measure classification accuracy, **including** the "label" column, +* `features_testing`: Spark Dataframe instance with the unlabelled dataset on which to apply the model, **including** the zero-value "label" column. + +In case you don't want to evaluate the model, `features_evaluation` can be set to `None`. + +##### Example of preprocessing script + +This example uses the [titanic challengue datasets](https://www.kaggle.com/c/titanic/overview). + +```python +from pyspark.ml import Pipeline +from pyspark.sql.functions import ( + mean, col, split, + regexp_extract, when, lit) + +from pyspark.ml.feature import ( + VectorAssembler, + StringIndexer +) + +TRAINING_DF_INDEX = 0 +TESTING_DF_INDEX = 1 + +training_df = training_df.withColumnRenamed('Survived', 'label') +testing_df = testing_df.withColumn('label', lit(0)) +datasets_list = [training_df, testing_df] + +for index, dataset in enumerate(datasets_list): + dataset = dataset.withColumn( + "Initial", + regexp_extract(col("Name"), "([A-Za-z]+)\.", 1)) + datasets_list[index] = dataset + +misspelled_initials = [ + 'Mlle', 'Mme', 'Ms', 'Dr', + 'Major', 'Lady', 'Countess', + 'Jonkheer', 'Col', 'Rev', + 'Capt', 'Sir', 'Don' +] +correct_initials = [ + 'Miss', 'Miss', 'Miss', 'Mr', + 'Mr', 'Mrs', 'Mrs', + 'Other', 'Other', 'Other', + 'Mr', 'Mr', 'Mr' +] +for index, dataset in enumerate(datasets_list): + dataset = dataset.replace(misspelled_initials, correct_initials) + datasets_list[index] = dataset + + +initials_age = {"Miss": 22, + "Other": 46, + "Master": 5, + "Mr": 33, + "Mrs": 36} +for index, dataset in enumerate(datasets_list): + for initial, initial_age in initials_age.items(): + dataset = dataset.withColumn( + "Age", + when((dataset["Initial"] == initial) & + (dataset["Age"].isNull()), initial_age).otherwise( + dataset["Age"])) + datasets_list[index] = dataset + + +for index, dataset in enumerate(datasets_list): + dataset = dataset.na.fill({"Embarked": 'S'}) + datasets_list[index] = dataset + + +for index, dataset in enumerate(datasets_list): + dataset = dataset.withColumn("Family_Size", col('SibSp')+col('Parch')) + dataset = dataset.withColumn('Alone', lit(0)) + dataset = dataset.withColumn( + "Alone", + when(dataset["Family_Size"] == 0, 1).otherwise(dataset["Alone"])) + datasets_list[index] = dataset + + +text_fields = ["Sex", "Embarked", "Initial"] +for column in text_fields: + for index, dataset in enumerate(datasets_list): + dataset = StringIndexer( + inputCol=column, outputCol=column+"_index").\ + fit(dataset).\ + transform(dataset) + datasets_list[index] = dataset + + +non_required_columns = ["Name", "Embarked", "Sex", "Initial"] +for index, dataset in enumerate(datasets_list): + dataset = dataset.drop(*non_required_columns) + datasets_list[index] = dataset + + +training_df = datasets_list[TRAINING_DF_INDEX] +testing_df = datasets_list[TESTING_DF_INDEX] + +assembler = VectorAssembler( + inputCols=training_df.columns[:], + outputCol="features") +assembler.setHandleInvalid('skip') + +features_training = assembler.transform(training_df) +(features_training, features_evaluation) =\ + features_training.randomSplit([0.8, 0.2], seed=33) +features_testing = assembler.transform(testing_df) +``` + + ## Additional information ### Spark Microservices From c816126820c328b1e23ef379ba782db901fd64be Mon Sep 17 00:00:00 2001 From: LaChapeliere Date: Sat, 17 Oct 2020 17:26:50 +0200 Subject: [PATCH 27/27] Draft for spark additional info --- docs/microservices.md | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/docs/microservices.md b/docs/microservices.md index 9dc444a..47cef4e 100644 --- a/docs/microservices.md +++ b/docs/microservices.md @@ -63,6 +63,8 @@ For additional details, see the [REST API](datatype-rest.md) and [Python package The Projection microservice is a data manipulation microservice. It provides an entry point to simplify a dataset by selecting only certain fields (= column for data organised as a table). +It runs as a Spark microservice and can be spread over [several instances](#spark-microservices). + For additional details, see the [REST API](projection-rest.md) and [Python package](projection-python.md) documentations. ### Histogram microservice @@ -79,6 +81,8 @@ t-SNE is a machine learning algorithm for visualization of high-dimensional data The t-SNE microservice provides entry points to create and store a t-SNE graphical representation of a dataset, to list all the images previously stored by the microservice, to download one of these images, and to delete one of these images. It relies on a dedicated storage in the Spark cluster rather than the Database microservice. +It runs as a Spark microservice and can be spread over [several instances](#spark-microservices). + For additional details, see the [REST API](t-sne-rest.md) and [Python package](t-sne-python.md) documentations. ### PCA microservice @@ -89,12 +93,16 @@ The implementation of this microservice relies on the [scikit-learn libray](http The PCA microservice provides entry points to create and store a PCA graphical representation of a dataset, to list all the images previously stored by the microservice, to download one of these images, and to delete one of these images. It relies on a dedicated storage in the Spark cluster rather than the Database microservice. +It runs as a Spark microservice and can be spread over [several instances](#spark-microservices). + For additional details, see the [REST API](pca-rest.md) and [Python package](pca-python.md) documentations. ### Model builder microservice The Model builder microservice is a all-in-one entry point to train, evaluate and apply classification models. It loads datasets from the Database microservice, preprocesses their content using a user-specified Python script, trains each of the specified classifiers on the training dataset, evaluates the accuracy of the trained model on an evaluation dataset, predicts the labels of the unlabelled testing dataset and saves the accuracy results and predicted labels. +It runs as a Spark microservice and can be spread over [several instances](#spark-microservices). + For additional details, see the [REST API](modelbuilder-rest.md) and [Python package](modelbuilder-python.md) documentations. #### Available classifiers @@ -229,16 +237,13 @@ features_training = assembler.transform(training_df) features_testing = assembler.transform(testing_df) ``` - ## Additional information ### Spark Microservices The Projection, t-SNE, PCA and Model builder microservices uses the Spark microservice to work. -By default, this microservice has only one instance. In case your data processing requires more computing power, you can scale this microservice. - -To do this, with learningOrchestra already deployed, run the following in the manager machine of your Docker swarm cluster: +By default, this microservice has only one instance. In case youyou require more computing power, you can scale this microservice by running the following command in the manager machine of the swarm cluster on which learningOrchestra is deployed: `docker service scale microservice_sparkworker=NUMBER_OF_INSTANCES` -*\** `NUMBER_OF_INSTANCES` *is the number of Spark microservice instances which you require. Choose it according to your cluster resources and your resource requirements.* +where `NUMBER_OF_INSTANCES` is the number of Spark microservice instances which you require, chosen according to your cluster resources and your computing power needs.