Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
ff7c14c
Update repo path in README
LaChapeliere Oct 9, 2020
ba3fd57
Update repo path in CONTRIBUTING
LaChapeliere Oct 9, 2020
28ded97
Add short introduction to README
LaChapeliere Oct 9, 2020
5d55c3d
Update website plan
LaChapeliere Oct 9, 2020
97b2570
About page inspired by main repo readme
LaChapeliere Oct 9, 2020
af6a930
Add support page redirecting to each repo
LaChapeliere Oct 9, 2020
debd6ad
Change syntax of internal links
LaChapeliere Oct 9, 2020
3ae59fd
Update install page
LaChapeliere Oct 9, 2020
9318342
Update structure to add pages to define each microservice
LaChapeliere Oct 9, 2020
28acfd4
Add and update page titles
LaChapeliere Oct 9, 2020
b2f67bd
Moving python package info, no update of content
LaChapeliere Oct 9, 2020
074ea96
Removing outdated structure
LaChapeliere Oct 9, 2020
0e9178e
Change structure again
LaChapeliere Oct 9, 2020
9e198f4
Cleaning
LaChapeliere Oct 9, 2020
cbce315
Add page for microservice description
LaChapeliere Oct 9, 2020
d8592a5
First draft of microservices description
LaChapeliere Oct 9, 2020
e8e789f
Split structure for python package
LaChapeliere Oct 9, 2020
cac9ac6
Split info and clean
LaChapeliere Oct 9, 2020
d7b5756
Draft for description of Database microservice
LaChapeliere Oct 14, 2020
324c9e5
Draft for Data type microservice description
LaChapeliere Oct 16, 2020
67e0def
Draft for Histogram microservice description
LaChapeliere Oct 16, 2020
6c20fab
Draft for t-SNE microservice description
LaChapeliere Oct 16, 2020
787ff54
Draft for Projection microservice description
LaChapeliere Oct 16, 2020
ac24586
Draft for PCA microservice description
LaChapeliere Oct 16, 2020
d8888a6
Add storage info for t-SNE and PCA
LaChapeliere Oct 16, 2020
f1f9f2b
Draft for Model builder microservice description
LaChapeliere Oct 17, 2020
c816126
Draft for spark additional info
LaChapeliere Oct 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ When creating an issue for a bug or other undefined/unwanted behaviour remember

## Pull Request process

1. [Fork](https://github.com/riibeirogabriel/learningOrchestra/fork) the repository
1. [Fork](https://github.com/learningOrchestra/docs/fork) the repository
2. [Clone](https://git-scm.com/docs/git-clone) your fork to your local environment
3. Navigate into the project root directory
4. Create a new [branch](https://git-scm.com/book/en/v2/Git-Branching-Basic-Branching-and-Merging), the branch should be named according to what feature/fix you're implementing
Expand All @@ -26,7 +26,7 @@ When creating an issue for a bug or other undefined/unwanted behaviour remember
7. Create a Pull Request

Remember to describe what feature or fix you're implementing in the Pull Request window.
In the Pull Request window remember to include a quick summary of what the committed code does and how it is an improvement.
In the Pull Request window remember to include a quick summary of what the committed code does and how it is an improvement.

After the Pull Request the repository owner will review your request.\
Be patient, if they require you to make changes to your request, do so.
Expand All @@ -37,4 +37,4 @@ Don't be rude, use crude language or harass other users.

## License
The repository is currently licenses under GNU General Public License v3.0.
By contributing to the project you agree that your contributions will be licensed under the same license and provisions.
By contributing to the project you agree that your contributions will be licensed under the same license and provisions.
20 changes: 9 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,14 @@

# learningOrchestra Docs

To make changes clone the repo:

`git clone https://github.com/learningOrchestra/learningOrchestra-docs.git`

`cd learningOrchestra-docs`

Install mkdocs:

This repository contains the files to generate the user documentation of [learningOrchestra](https://github.com/learningOrchestra/learningOrchestra). The content of the documentation is created manually and can be found in [the docs folder](https://github.com/learningOrchestra/docs/tree/main/docs). The documentation website is created with [MkDocs](https://www.mkdocs.org/).

To make changes please read the [contributing guide](https://github.com/learningOrchestra/docs/blob/main/CONTRIBUTING.md) then:
1. Clone the repo
`git clone https://github.com/learningOrchestra/docs.git`
2. Move inside the repo
`cd docs`
3. Install mkdocs
`pip install mkdocs`

Run:

4. Run MkDocs
`mkdocs serve`
13 changes: 13 additions & 0 deletions docs/about.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# About learningOrchestra

Nowadays, **data science relies on a wide range of computer science skills**, from data management to algorithm design, from code optimization to cloud infrastructures. Data scientists are expected to have expertise in these diverse fields, especially when working in small teams or for academia.

This situation can constitute a barrier to the actual extraction of new knowledge from collected data,
which is why the last two decades have seen more efforts to facilitate and streamline the development of
data mining workflows. The tools created can be sorted into two categories: **high-level** tools facilitate
the building of **automatic data processing pipelines** (e.g. [Weka](https://www.cs.waikato.ac.nz/ml/weka/))
while **low-level** ones support the setup of appropriate physical and virtual infrastructure (e.g. [Spark](https://spark.apache.org/)).

However, this landscape is still missing a tool that **encompasses all steps and needs of a typical data science project**. This is where learningOrchestra comes in.

Read our [first research monograph](https://drive.google.com/file/d/1ZDrTR58pBuobpgwB_AOOFTlfmZEY6uQS/view) (under construction) to know more about the research behind the project.
43 changes: 43 additions & 0 deletions docs/database-python.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
## Database API

### read_resume_files

```python
read_resume_files(pretty_response=True)
```
* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`)
(default `True`, if `False`, return dict)

### read_file

```python
read_file(filename, skip=0, limit=10, query={}, pretty_response=True)
```

* `filename` : name of file
* `skip`: number of rows to skip in pagination(default: `0`)
* `limit`: number of rows to return in pagination(default: `10`)
(maximum is set at `20` rows per request)
* `query`: query to make in MongoDB(default: `empty query`)
* `pretty_response`: returns indented `string` for visualization(default: `True`, returns `dict` if `False`)

### create_file

```python
create_file(filename, url, pretty_response=True)
```

* `filename`: name of file to be created
* `url`: url to CSV file
* `pretty_response`: returns indented `string` for visualization
(default: `True`, returns `dict` if `False`)

### delete_file

```python
delete_file(filename, pretty_response=True)
```

* `filename`: name of the file to be deleted
* `pretty_response`: returns indented `string` for visualization
(default: `True`, returns `dict` if `False`)
10 changes: 5 additions & 5 deletions docs/database-api.md → docs/database-rest.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Database API
# Database API

The **Database API** microservice creates a level of abstraction through a REST API.
The **Database API** microservice creates a level of abstraction through a REST API.

Using MongoDB, datasets are downloaded in CSV format and parsed into JSON format where the primary key for each document is the filename field contained in the JSON file POST request.

## GUI tool to handle database files

There are GUI tools to handle database files, like [NoSQLBooster](https://nosqlbooster.com) can interact with mongoDB used in database, and makes several tasks which are limited in `learning-orchestra-client` package, as schema visualization and files extraction and download to formats as CSV and JSON.
There are GUI tools to handle database files, like [NoSQLBooster](https://nosqlbooster.com) can interact with mongoDB used in database, and makes several tasks which are limited in `learning-orchestra-client` package, as schema visualization and files extraction and download to formats as CSV and JSON.

You also can navigate in all inserted files in easy way and visualize each row from determined file, to use this tool connect with the url `cluster\_ip:27017` and use the credentials:

Expand Down Expand Up @@ -91,7 +91,7 @@ Returns an array of metadata files from the database, where each file contains a
* `F1` - F1 Score from model accuracy
* `accuracy` - Accuracy from model prediction
* `classificator` - Initials from used classificator
* `filename` - Name of the file
* `filename` - Name of the file
* `fit_time` - Time taken for the model to be fit during training

## List file content
Expand All @@ -112,7 +112,7 @@ The first row in the query is always the metadata file.
`POST CLUSTER_IP:5000/files`

Insert a CSV into the database using the POST method, JSON must be contained in the body of the HTTP request.
The following fields are required:
The following fields are required:

```json
{
Expand Down
12 changes: 12 additions & 0 deletions docs/datatype-python.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
## Data type handler API

### change_file_type

```python
change_file_type(filename, fields_dict, pretty_response=True)
```

* `filename`: name of file
* `fields_dict`: dictionary with `field`:`number` or `field`:`string` keys
* `pretty_response`: returns indented `string` for visualization
(default: `True`, returns `dict` if `False`)
File renamed without changes.
15 changes: 15 additions & 0 deletions docs/histogram-python.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@

## Histogram API

### create_histogram

```python
create_histogram(filename, histogram_filename, fields,
pretty_response=True)
```

* `filename`: name of file to make histogram
* `histogram_filename`: name of file used to create histogram
* `fields`: list with fields to make histogram
* `pretty_response`: returns indented `string` for visualization
(default: `True`, returns `dict` if `False`)
File renamed without changes.
31 changes: 18 additions & 13 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,23 @@
# learningOrchestra Docs
# learningOrchestra user documentation

**learningOrchestra** is a distributed processing tool that facilitates and streamlines iterative processes in a Data Science project pipeline like:
learningOrchestra aims to facilitate the development of complex data mining workflows by **seamlessly interfacing different data science tools and services**. From a single interoperable Application Programming Interface (API), users can **design their analytical pipelines and deploy them in an environment with the appropriate capabilities**.

* Data Gathering
* Data Cleaning
* Model Building
* Validating the Model
* Presenting the Results
learningOrchestra is designed for data scientists from both engineering and academia backgrounds, so that they can **focus on the discovery of new knowledge** in their data rather than library or maintenance issues.

With learningOrchestra, you can:
learningOrchestra is organised into interoperable microservices. They offer access to third-party libraries, frameworks and software to **gather data**, **clean data**, **train machine learning models**, **tune machine learning models**, **evaluate machine learning models** and **visualize data and results**.

* load a dataset from an URL (in CSV format).
* accomplish several pre-processing tasks with datasets.
* create highly customised model predictions against a specific dataset by providing their own pre-processing code.
* build prediction models with different classifiers simultaneously using a spark cluster transparently.
The current version of learningOrchestra offers 7 microservices:
- The **Database API is the central microservice**. It holds all the data, including the analysis results.
- The **Data type API is a preprocessing microservice** dedicated to changing the type of data fields.
- The **Projection, Histogram, t-SNE and PCA APIs are data exploration microservices**. They transform the map the data into new representation spaces so it can be visualized. They can be used on the raw data as well as on the intermediate and final results of the analysis pipeline.
- The **Model builder API is the main analysis microservice**. It includes some preprocessing features and machine learning features to train models, evaluate models and predict information using trained models.

And so much more! Check the [Usage](https://learningorchestra.github.io/docs/usage/) section for more.
The microservices can be called on from any computer, including one that is not part of the cluster learningOrchestra is deployed on. learningOrchestra provides two options to access its features: a **microservice REST API** and a **Python package**.

Use this documentation to [learn more about the learningOrchestra project](about.md), [learn how to install and deploy learningOrchestra on a cluster](install.md), learn how to use the [REST APIs](rest-apis.md) and [Python package](python-package.md) to access learningOrchestra microservices, or [find options to get support](support.md)

You can also visit the repositories of the learningOrchestra project:
- [learningOrchestra](https://github.com/learningOrchestra/learningOrchestra) for the definition of the microservices and the REST APIs,
- [learningOrchestra-python-client](https://github.com/learningOrchestra/learningOrchestra-python-client) for the Python package,
- [docs](https://github.com/learningOrchestra/docs) for the content of the present documentation, and
- [learningOrchestra.github.io](https://github.com/learningOrchestra/learningOrchestra.github.io) for the code of the present website.
70 changes: 70 additions & 0 deletions docs/install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Install and deploy learningOrchestra on a cluster

:bell: This documentation assumes that the users are familiar with a number of advanced computer science concepts. We have tried to link to learning resources to support beginners, as well as introduce some of the concepts in the [last section](#concepts). But if something is still not clear, don't hesitate to [ask for help](support.md).

## Setting up your cluster

learningOrchestra operates from a [cluster](#what-is-a-cluster) of Docker [containers](#what-is-a-container).

All your hosts must operate under Linux distributions and have [Docker Engine](https://docs.docker.com/engine/install/) installed.

Configure your cluster in [swarm mode](https://docs.docker.com/engine/swarm/swarm-tutorial/create-swarm/). Install [Docker Compose](https://docs.docker.com/compose/install/) on your manager instance.

You are ready to deploy! :tada:

## Deploy learningOrchestra

Clone the main learningOrchestra repository on your manager instance.
- Using HTTP protocol, `git clone https://github.com/learningOrchestra/learningOrchestra.git`
- Using SSH protocol, `git clone git@github.com:learningOrchestra/learningOrchestra.git`
- Using GitHub CLI, `gh repo clone learningOrchestra/learningOrchestra`

Move to the root of the directory, `cd learningOrchestra`.

Deploy with `sudo ./run.sh`. The deploy process should take a dozen minutes.

### Interrupt learningOrchestra

Run `docker stack rm microservice`.

### Check cluster status

To check the deployed microservices and machines of your cluster, run `CLUSTER_IP:80` where *CLUSTER_IP* is replaced by the external IP of a machine in your cluster.

The same can be done to check Spark cluster state with `CLUSTER_IP:8080`.

## Install-and-deploy questions

###### My computer runs on Windows/OSX, can I still use learningOrchestra?

You can use the microservices that run on a cluster where learningOrchestra is deployed, but **not deploy learningOrchestra**.

###### I have a single computer, can I still use learningOrchestra?

Theoretically, you can, if your machine has 12 Gb of RAM, a quad-core processor and 100 Gb of disk. However, your single machine won't be able to cope with the computing demanding for a real-life sized dataset.

###### What happens if learningOrchestra is killed while using a microservice?

If your cluster fails while a microservice is processing data, the task may be lost. Some fails might corrupt the database systems.

If no processing was in progress when your cluster fails, the learningOrchestra will automatically re-deploy and reboot the affected microservices.

###### What happens if my instances loose the connection to each other?

If the connection between cluster instances is shutdown, learningOrchestra will try to re-deploy the microservices from the lost instances on the remaining active instances of the cluster.

## Concepts

###### What is a container?

Containers are a software that package code and everything needed to run this code together, so that the code can be run simply in any environment. They also isolate the code from the rest of the machine. They are [often compared to shipping containers](https://www.ctl.io/developers/blog/post/docker-and-shipping-containers-a-useful-but-imperfect-analogy).

###### What is a cluster?

A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. (From [Wikipedia](https://en.wikipedia.org/wiki/Computer_cluster))

###### What are microservices?

Microservices - also known as the microservice architecture - is an architectural style that structures an application as a collection of services that are: highly maintainable and testable, loosely coupled, independently deployable, organized around business capabilities, owned by small team.

[An overview of microservice architecture](https://medium.com/hashmapinc/the-what-why-and-how-of-a-microservices-architecture-4179579423a9)
38 changes: 0 additions & 38 deletions docs/installation.md

This file was deleted.

Loading