intro
- here, i will explain and implement GRPO in an intuitive way
- prerequisites:
- you should be familiar with neural networks and gradient descent
- you should be familiar with pytorch
- it's okay if you are not familiar with deep RL
terms to remember
a little vocabulary goes a long way towards being able to understand this topic.
- policy: in deep RL, the policy is the neural network that looks at the observations from the environment and predicts the action to be taken. this is what we are training.
- environment: the world that the policy interacts with, which could be anything from a video game to a robotic simulation to a real-world setting. the environment provides observations and rewards.
- reward: a numerical value that tells the agent how good or bad its last action or sequence of actions was.
- trajectory: the sequence of states, actions, and rewards that occur during an episode or training run.
environment
cartpole environment
cartpole is a simple environment that we'll use for easy explaination: