- Agent is trained using Q-learning algorithm with DQN network. It is the basic DQN without any priority experience replay, Double DQN.
- The deep Q-network is a value based method which tries to learn the optimal action value for given state.
- The agent follows epsilon-greedy policy where epsilon decays as the training progresses.
- BUFFER_SIZE = int(1e5) # replay buffer size
- BATCH_SIZE = 64 # minibatch size
- GAMMA = 0.99 # discount factor
- TAU = 1e-3 # for soft update of target parameters
- LR = 5e-4 # learning rate
- UPDATE_EVERY = 4 # how often to update the network
Four layer MLP
self.fc1=nn.Linear(state_size,160) -> nn.Linear(160,80) -> nn.Linear(80,80) -> nn.Linear(80,action_size)
Applying these techniques to improve
- Double DQN - Prevents overestimation
- Prioritized Experience Replay - More important transitions should be sampled with higher probability.
- Dueling DQN - Results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions