intro

here, i will explain and implement GRPO in an intuitive way
prerequisites:
- you should be familiar with neural networks and gradient descent
- you should be familiar with pytorch
it's okay if you are not familiar with deep RL

terms to remember

a little vocabulary goes a long way towards being able to understand this topic.

policy: in deep RL, the policy is the neural network that looks at the observations from the environment and predicts the action to be taken. this is what we are training.
environment: the world that the policy interacts with, which could be anything from a video game to a robotic simulation to a real-world setting. the environment provides observations and rewards.
reward: a numerical value that tells the agent how good or bad its last action or sequence of actions was.
trajectory: the sequence of states, actions, and rewards that occur during an episode or training run.

cartpole is a simple environment that we'll use for easy explaination: