-
Notifications
You must be signed in to change notification settings - Fork 345
Description
Unless I'm mistaken, there is something odd about the main training loop (Listing 8.13) for the Super Mario game in Chapter 8. The way that the current x-position is checked against the min_progress parameter makes no sense to me.
More precisely: in line 23 of the main training loop, the environment step is taken (6 times) and last_x_pos is set to the current x-position:
state2, e_reward_, done, info = env.step(action)
last_x_pos = info['x_pos']
In the following lines of code, neither last_x_pos nor info['x_pos'] are changed. Then in line 33 the two are compared to one another:
if episode_length > params['max_episode_len']:
if (info['x_pos'] - last_x_pos) < params['min_progress']:
done = True
else:
last_x_pos = info['x_pos']
Isn't info['x_pos'] - last_x_pos always going to be zero here? This would always reset the environment as soon as episode_length > params['max_episode_len'].
What is the min_progress parameter meant to be intuitively? The progress from beginning till the end of one episode? The progress from time 0 till max_episode_len? Or the progress against a certain checkpoint in a certain amount of time? If so, how are these checkpoints chosen?
This has not become clear to me yet, neither from the book nor from the code.