Why build a custom environment?
Most RL tutorials stop at classic control tasks. I wanted to push further and simulate something tangible: a Franka Panda wrist flicking a pan. Building my own environment forced me to think about reward design, state representation, and reproducibility at a much deeper level.
Reward shaping without falling into traps
The biggest challenge was balancing dense rewards with a sparse success signal. I started with a basic waypoint reward, layered in velocity constraints, and eventually introduced a shaped term for pan orientation. Every additional signal was validated with TensorBoard runs to ensure I wasn't encouraging reward-hacking behaviours.
Takeaways
- Simulator fidelity matters less than consistent resets.
- Logging tactile metrics (pan angular velocity, wrist torque) prevented me from chasing ghosts.
- PPO remained the most stable baseline, but SAC won once I dialled in entropy regularisation.