umairdot.dev

Why build a custom environment?

Most RL tutorials stop at classic control tasks. I wanted to push further and simulate something tangible: a Franka Panda wrist flicking a pan. Building my own environment forced me to think about reward design, state representation, and reproducibility at a much deeper level.

Reward shaping without falling into traps

The biggest challenge was balancing dense rewards with a sparse success signal. I started with a basic waypoint reward, layered in velocity constraints, and eventually introduced a shaped term for pan orientation. Every additional signal was validated with TensorBoard runs to ensure I wasn't encouraging reward-hacking behaviours.

Takeaways

- Simulator fidelity matters less than consistent resets.

- Logging tactile metrics (pan angular velocity, wrist torque) prevented me from chasing ghosts.

- PPO remained the most stable baseline, but SAC won once I dialled in entropy regularisation.