Reinforcement Learning - GridWorld Navigation

Overview

This project explores Reinforcement Learning (RL) techniques for autonomous agent navigation in a GridWorld environment. The implementation includes classical RL algorithms such as Q-Learning and SARSA, demonstrating how an agent learns optimal navigation strategies through trial and error.

Key Features

GridWorld Environment: Custom grid-based environment with obstacles, rewards, and terminal states
Q-Learning Algorithm: Model-free, off-policy temporal difference learning
SARSA Algorithm: On-policy temporal difference control method
Performance Visualization: Real-time visualization of learning progress and policy evolution
Comparative Analysis: Detailed comparison between different learning algorithms

Technical Implementation

Algorithms

Q-Learning

Off-policy temporal difference learning
Learns the optimal action-value function Q*(s,a)
Uses epsilon-greedy exploration strategy
Update rule: Q(s,a) ← Q(s,a) + α[r + γ max Q(s’,a’) - Q(s,a)]

SARSA (State-Action-Reward-State-Action)

On-policy temporal difference learning
Learns action-value function for the current policy
More conservative exploration
Update rule: Q(s,a) ← Q(s,a) + α[r + γ Q(s’,a’) - Q(s,a)]

Environment Design

The GridWorld environment features:

Grid Size: Configurable NxN grid layout
Obstacles: Static barriers the agent must navigate around
Rewards: Positive rewards for goal states, negative for obstacles
Terminal States: Goal positions that end episodes
Action Space: {Up, Down, Left, Right}

Technologies Used

Python 3.x: Core implementation language
NumPy: Numerical computations and Q-table management
Matplotlib: Visualization of learning curves and policies
OpenAI Gym (optional): Standard RL environment interface

Results & Analysis

The project demonstrates:

Convergence Analysis: How both algorithms converge to optimal policies
Sample Efficiency: Comparison of learning speed between Q-Learning and SARSA
Exploration vs Exploitation: Impact of epsilon decay strategies
Performance Metrics: Episode rewards, steps to goal, convergence time

Key Findings

Q-Learning tends to find optimal policies faster but can be more unstable during training
SARSA shows more stable learning curves with conservative policy updates
Both algorithms successfully learn to navigate complex grid layouts with proper hyperparameter tuning

Implementation Highlights

# Simplified Q-Learning update
def q_learning_update(state, action, reward, next_state):
    max_next_q = np.max(Q[next_state])
    Q[state, action] += alpha * (reward + gamma * max_next_q - Q[state, action])

# Simplified SARSA update
def sarsa_update(state, action, reward, next_state, next_action):
    next_q = Q[next_state, next_action]
    Q[state, action] += alpha * (reward + gamma * next_q - Q[state, action])

Learning Outcomes

This project provided hands-on experience with:

Fundamental reinforcement learning concepts and algorithms
Balancing exploration and exploitation in RL systems
Implementing and comparing different temporal difference methods
Hyperparameter tuning for optimal learning performance
Visualization and analysis of learning dynamics

Future Enhancements

Implement Deep Q-Networks (DQN) for larger state spaces
Add more complex RL algorithms (A3C, PPO, SAC)
Create dynamic environments with moving obstacles
Multi-agent reinforcement learning scenarios
Transfer learning between different grid configurations

References

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction
Watkins, C. J., & Dayan, P. (1992). Q-learning
Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems

Project Status: Completed
Date: November 2024
Course: Reinforcement Learning