From c9517d7cbf1d518eef0447e1507d4c1d6bb444ae Mon Sep 17 00:00:00 2001
From: Hyperdraw <41306289+Hyperdraw@users.noreply.github.com>
Date: Sun, 24 Jul 2022 15:49:28 -0500
Subject: [PATCH 1/8] Copy design-checklist.md to design/README.md

so all design documents can be kept in a directory
---
 design/README.md | 6 ++++++
 1 file changed, 6 insertions(+)
 create mode 100644 design/README.md

diff --git a/design/README.md b/design/README.md
new file mode 100644
index 0000000..f30bc9f
--- /dev/null
+++ b/design/README.md
@@ -0,0 +1,6 @@
+# Design Documents Checklist
+
+* [ ] Simple generic RL model (TensorFlow)
+* [ ] RL iteration process (using the model and a Gym environment)
+* [ ] Flappy Bird Gym environment (or use `flappy-bird-gym`?)
+* [ ] Other Gym environment (maybe a somewhat more complex game)

From 594f5e041bc768017abcc2ce73b194007668bcdf Mon Sep 17 00:00:00 2001
From: Hyperdraw <41306289+Hyperdraw@users.noreply.github.com>
Date: Sun, 24 Jul 2022 15:58:54 -0500
Subject: [PATCH 2/8] Delete design-checklist.md

since it was moved to `design/README.md`
---
 design-checklist.md | 6 ------
 1 file changed, 6 deletions(-)
 delete mode 100644 design-checklist.md

diff --git a/design-checklist.md b/design-checklist.md
deleted file mode 100644
index 026fa7f..0000000
--- a/design-checklist.md
+++ /dev/null
@@ -1,6 +0,0 @@
-# Design Checklist
-
-* [ ] Simple generic RL model (TensorFlow)
-* [ ] RL iteration process (using the model and a Gym environment)
-* [ ] Flappy Bird Gym environment (or use `flappy-bird-gym`?)
-* [ ] Other Gym environment (maybe a somewhat more complex game)

From 87abd4322ea2f6318e6c5cb4271dd2130fe3ba02 Mon Sep 17 00:00:00 2001
From: Hyperdraw <41306289+Hyperdraw@users.noreply.github.com>
Date: Sun, 24 Jul 2022 16:24:59 -0500
Subject: [PATCH 3/8] Add design document for reinforcement loop

---
 design/reinforcement-loop.md | 43 ++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)
 create mode 100644 design/reinforcement-loop.md

diff --git a/design/reinforcement-loop.md b/design/reinforcement-loop.md
new file mode 100644
index 0000000..0499517
--- /dev/null
+++ b/design/reinforcement-loop.md
@@ -0,0 +1,43 @@
+# The Reinforcement Iteration Loop
+
+This document describes the main learning loop for the project.
+
+## Prerequisites
+
+* A Gym environment with discrete action space
+* An objective function that can provide rewards and punishments (e.g. how long the agent survives after an action)
+* A TensorFlow model suited for reinforcement learning (see the design document checklist)
+
+## Concept
+
+The reinforcement loop should run multiple games, each one gathering information about what actions are successful for various states.
+Between game runs, the model will be retrained to make use of the newly gathered data.
+Because the model will start untrained and innacurate, it would likely get stuck performing the same action every timestep.
+Therefore, a randomness factor, `epsilon`, is used to give a possibility each timestep of a random action being chosen rather than using the model.
+As the model gains new data, it should become more accurate and so `epsilon` can be slowly decreased.
+
+## Pseudocode
+
+* Let `epsilon` start at `1`
+* Let `epsilon_decrease` be a small positive number less than 1 (hyperparameter)
+* Let `state_history` be an empty list
+* Let `value_history` be an empty list
+* Let the model begin untrained
+* For every game run:
+  * Let `states` be an empty list
+  * Let `actions` be an empty list
+  * Let `values` be an empty list
+  * For every timestep:
+    * Pick a random number between 0 and 1. If it is less than or equal to `epsilon`:
+      * Perform a random action from the action space
+    * Otherwise, if it is greater than `epsilon`:
+      * Give the current game state to to the model to predict the scores of each action. Perform the action with the highest score.
+    * Append the performed action to `actions`
+    * Append the game state to `states`
+    * Append an entry to `values` where every action's score is 0
+    * If there is a reward/punishment:
+      * Change each entry in `values` so that the score of the action performed for that entry (based on the record in `actions`) is changed by the reward/punishment value (positive for reward, negative for punishment)
+  * Append all entries in `states` to `state_history`
+  * Append all entries in `values` to `value_history`
+  * Retrain the model with `state_history` as the inputs and `value_history` as the outputs
+  * Decrease `epsilon` by `epsilon_decrease`

From da43157db7a8d697245deb13a23d7e097701677f Mon Sep 17 00:00:00 2001
From: Hyperdraw <41306289+Hyperdraw@users.noreply.github.com>
Date: Sun, 24 Jul 2022 16:27:05 -0500
Subject: [PATCH 4/8] Check of RL iteration process design doc

---
 design/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/design/README.md b/design/README.md
index f30bc9f..28e9384 100644
--- a/design/README.md
+++ b/design/README.md
@@ -1,6 +1,6 @@
 # Design Documents Checklist
 
 * [ ] Simple generic RL model (TensorFlow)
-* [ ] RL iteration process (using the model and a Gym environment)
+* [X] [RL iteration process (using the model and a Gym environment)](reinforcement-loop.md)
 * [ ] Flappy Bird Gym environment (or use `flappy-bird-gym`?)
 * [ ] Other Gym environment (maybe a somewhat more complex game)

From e3728eab4621ce7f5839e2b255f7529de0033b53 Mon Sep 17 00:00:00 2001
From: Hyperdraw <41306289+Hyperdraw@users.noreply.github.com>
Date: Mon, 25 Jul 2022 11:09:15 -0500
Subject: [PATCH 5/8] Create model-explanation.txt

---
 design/model-explanation.txt | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)
 create mode 100644 design/model-explanation.txt

diff --git a/design/model-explanation.txt b/design/model-explanation.txt
new file mode 100644
index 0000000..eb2935d
--- /dev/null
+++ b/design/model-explanation.txt
@@ -0,0 +1,17 @@
+                frame0             frame1
+                  |                  |
+                  \/                 \/
+initial state -> Cell -> state 1 -> Cell -> state 2
+                  |                  |
+                  \/                 \/
+              discarded            output
+
+
+Conv2D -> Conv2D -> Conv2D -> LSTM
+
+Input shape: (frames, width, height, channels)
+Output shape: (actions,)
+
+Conv2d(width, height, channels) -> (width, height, filters)
+
+(frames, width, height, channels) -> Conv2D -> Conv2D -> Conv2D -> (frames, width, height, filters) -> Reshape -> (frames, width * height * filters) -> LSTM -> (actions,)

From 588eedd24d690b70e713b90f1ff333e210e741f3 Mon Sep 17 00:00:00 2001
From: bestgradient <68491459+bestgradient@users.noreply.github.com>
Date: Mon, 25 Jul 2022 11:52:07 -0500
Subject: [PATCH 6/8] Create model.md

---
 design/model.md | 7 +++++++
 1 file changed, 7 insertions(+)
 create mode 100644 design/model.md

diff --git a/design/model.md b/design/model.md
new file mode 100644
index 0000000..9f17b27
--- /dev/null
+++ b/design/model.md
@@ -0,0 +1,7 @@
+This document describes the design doc for the model of this project.
+
+# Input and output
+Atari games and other games like flappy bird returns RGB image for current frame. **The input shape is going to be (frames, width, height, channels).**
+Frames are hyperparameters that determines how many frames are going to be used for the training. The frames are from last few iterations of single episode.
+Width and height depends on the enviornment that model trains on. Channels are set to 3 when it's using RGB image and set to 1 when using grayscale image.
+**The output shape is going to be in shape (actions,)** where actions is a single integer in a discrete space. 

From b20d2ba4628b11e68402eb9c359e5c0b18618a4f Mon Sep 17 00:00:00 2001
From: bestgradient <68491459+bestgradient@users.noreply.github.com>
Date: Mon, 25 Jul 2022 13:56:01 -0500
Subject: [PATCH 7/8] Update model.md

---
 design/model.md | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/design/model.md b/design/model.md
index 9f17b27..ea61af6 100644
--- a/design/model.md
+++ b/design/model.md
@@ -1,7 +1,30 @@
 This document describes the design doc for the model of this project.
 
 # Input and output
-Atari games and other games like flappy bird returns RGB image for current frame. **The input shape is going to be (frames, width, height, channels).**
+Atari games and other games like flappy bird returns RGB image for current frame. 
+**The input shape is going to be (frames, width, height, channels).**
 Frames are hyperparameters that determines how many frames are going to be used for the training. The frames are from last few iterations of single episode.
 Width and height depends on the enviornment that model trains on. Channels are set to 3 when it's using RGB image and set to 1 when using grayscale image.
 **The output shape is going to be in shape (actions,)** where actions is a single integer in a discrete space. 
+
+# Layers
+- Layers are going to be a combination of **CNN** and **LSTM**. Because a single pixel provides just a single picture of the game, the model is not going to be able to recongnize the state of the game. Thus, by sending input(frames,width,height,channels) into convolution layers, the model is able to detect the state of the current pixel,i.e. whether flappy bird was flapping up or down.
+
+# diagram
+Using keras library, Conv2D is going to be used for convolutional layers. A simple diagram would look like:
+Conv2d(width, height, channels) -> (width, height, filters).<br/>
+
+The diagram for the model would look like: (frames, width, height, channels) -> Conv2D -> Conv2D -> Conv2D -> (frames, width, height, filters) -> Reshape -> (frames, width * height * filters) -> LSTM -> (actions,) <br/>
+
+After the input goes through convolutional layers, it's output is going to be in shape of (frames, width, height, filters) where filters is a hyperparameters. Because LSTM only accepts **2-D arrays as its initial state**, the output of Conv2D has to be reshaped into **(frames, width * height * filters)**. But LSTM also expects **1-D array for each frame** when it's running through cells internally. For example, if the initial state was (2,X), then frame 1 will be in shape of (1,X) and frame 2 (2,X). 
+
+- Diagram for LSTM
+```
+                frame0             frame1
+                  |                  |
+                  \/                 \/
+initial state -> Cell -> state 1 -> Cell -> state 2
+                  |                  |
+                  \/                 \/
+              discarded            output
+```

From 653e9d0935c49562caed224c5be5ca65e8890756 Mon Sep 17 00:00:00 2001
From: bestgradient <68491459+bestgradient@users.noreply.github.com>
Date: Mon, 25 Jul 2022 14:09:23 -0500
Subject: [PATCH 8/8] Update model.md

---
 design/model.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/design/model.md b/design/model.md
index ea61af6..7fc7816 100644
--- a/design/model.md
+++ b/design/model.md
@@ -28,3 +28,8 @@ initial state -> Cell -> state 1 -> Cell -> state 2
                   \/                 \/
               discarded            output
 ```
+# Hyperparameters
+- number of Conv2Ds
+- number of Conv2D filters
+- Conv2D kernel size
+- number of LSTM units