Overview
This project explores Reinforcement Learning (RL) techniques for autonomous agent navigation in a GridWorld environment. The implementation includes classical RL algorithms such as Q-Learning and SARSA, demonstrating how an agent learns optimal navigation strategies through trial and error.
Key Features
- GridWorld Environment: Custom grid-based environment with obstacles, rewards, and terminal states
- Q-Learning Algorithm: Model-free, off-policy temporal difference learning
- SARSA Algorithm: On-policy temporal difference control method
- Performance Visualization: Real-time visualization of learning progress and policy evolution
- Comparative Analysis: Detailed comparison between different learning algorithms
Technical Implementation
Algorithms
Q-Learning
- Off-policy temporal difference learning
- Learns the optimal action-value function Q*(s,a)
- Uses epsilon-greedy exploration strategy
- Update rule: Q(s,a) ← Q(s,a) + α[r + γ max Q(s’,a’) - Q(s,a)]
SARSA (State-Action-Reward-State-Action)
- On-policy temporal difference learning
- Learns action-value function for the current policy
- More conservative exploration
- Update rule: Q(s,a) ← Q(s,a) + α[r + γ Q(s’,a’) - Q(s,a)]
Environment Design
The GridWorld environment features:
- Grid Size: Configurable NxN grid layout
- Obstacles: Static barriers the agent must navigate around
- Rewards: Positive rewards for goal states, negative for obstacles
- Terminal States: Goal positions that end episodes
- Action Space: {Up, Down, Left, Right}
Technologies Used
- Python 3.x: Core implementation language
- NumPy: Numerical computations and Q-table management
- Matplotlib: Visualization of learning curves and policies
- OpenAI Gym (optional): Standard RL environment interface
Results & Analysis
The project demonstrates:
- Convergence Analysis: How both algorithms converge to optimal policies
- Sample Efficiency: Comparison of learning speed between Q-Learning and SARSA
- Exploration vs Exploitation: Impact of epsilon decay strategies
- Performance Metrics: Episode rewards, steps to goal, convergence time
Key Findings
- Q-Learning tends to find optimal policies faster but can be more unstable during training
- SARSA shows more stable learning curves with conservative policy updates
- Both algorithms successfully learn to navigate complex grid layouts with proper hyperparameter tuning
Implementation Highlights
# Simplified Q-Learning update
def q_learning_update(state, action, reward, next_state):
max_next_q = np.max(Q[next_state])
Q[state, action] += alpha * (reward + gamma * max_next_q - Q[state, action])
# Simplified SARSA update
def sarsa_update(state, action, reward, next_state, next_action):
next_q = Q[next_state, next_action]
Q[state, action] += alpha * (reward + gamma * next_q - Q[state, action])
Learning Outcomes
This project provided hands-on experience with:
- Fundamental reinforcement learning concepts and algorithms
- Balancing exploration and exploitation in RL systems
- Implementing and comparing different temporal difference methods
- Hyperparameter tuning for optimal learning performance
- Visualization and analysis of learning dynamics
Future Enhancements
- Implement Deep Q-Networks (DQN) for larger state spaces
- Add more complex RL algorithms (A3C, PPO, SAC)
- Create dynamic environments with moving obstacles
- Multi-agent reinforcement learning scenarios
- Transfer learning between different grid configurations
References
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction
- Watkins, C. J., & Dayan, P. (1992). Q-learning
- Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems
Project Status: Completed
Date: November 2024
Course: Reinforcement Learning