The DQL algorithm is a distributed version of the popular Q-Learning algorithm.
The original update of state-value function of an agent according to the Q-learning algorithm is:
The distributed version instead has the following update for each agent:
where
The algorithm is tested on Flatland, in which each junction cell is modeled as a node in the graph. Each node is an independent RL agent that exchanges information with the successor node, i.e., where the train is sent starting from the current node.
📂 distributed_q_learning
├── 📂 flatland tools
│ └── ...
├── 📂 training _utils
│ └── ...
├── 📂 plot
│ └── ...
├── train.py
The folder flatland_tools contains the wrapper to the Flatland environment, the folder training_utils contains auxiliary functions for the training phase, the folder plot contains a python script that can be used to generate plots of the training curves. The python script train.py can be used to train the DQL algorithm.
Create a virtual environment, activate it and install all the requirements.
python -m venv venv_dql
source venv_venv_dql/bin/activate
pip install -r requirements.txt
The main dependencies are flatland-rl
, numpy
, scikit-learn
, scipy
(with python >= 3.6)
Inside the file train.py edit the following two parts.
-
Inside the function generate_env set the following return options
env_width
: the width of the flatland gridenv_height
: the height of the flatland gridenv_n_cities
: the number of cities in the flatland gridenv_n_trains
: the number of trains in the flatland gridseed
: the seed used to generate the environment
-
Before the starting of the main, edit the following variables
malf_configs
: the rate and the [min,max] interval defining the malfunctions in flatlandhp_configs
: the hyperparameters configuration of the DQL algorithm, i.e., 'epsilon', 'epsilon decay', 'alpha', 'alpha decay', 'n_episodes'.out_dir
: the path of the output directoryn_workers
: the number of parallel workers used to execute the codemaster_seed
: the master seed that generates all the randomnesslog_every
: frequency of the loggingsave_every
: frequency of DQL model savingsn_evals_calc
: number of evaluationseval_batch_size
: batch size used to compute evaluations
The output is generated in the out_dir
path specified in the train.py script.
It contains:
- one folder for each hyperparameter configuration, containing the saved models, the configuration parameters of the experiments, the cumulative rewards obtained during the training phase, the log file and the seeds.
- a global log file containing the computation time and the configurations of each experiment.
Inside the function generate_env
set the following return argument
return ModifiedEnv(
env_width=40,
env_height=40,
env_n_cities=7,
env_n_trains=5,
seed=13,
destination_bonus=200,
deadlock_penalty=-200,
delay_threshold=0.2,
malfunction_rate=malf_rate,
malfunction_min_duration=malf_min,
malfunction_max_duration=malf_max,
malfunction_seed=malf_seed
)
Before the main, edit the variables as follows:
# Malfunction configurations
malf_configs = pd.DataFrame([
[0., 0, 0],
], columns=['rate', 'min', 'max'])
# Hyperparameters
hp_configs = pd.DataFrame([
[1.0, 0.99997, 0.1, 0.99999, int(4e5)],
[1.0, 0.999965, 0.1, 0.99999, int(4e5)],
[1.0, 0.99997, 0.01, 1., int(4e5)],
[1.0, 0.999965, 0.01, 1.00000, int(4e5)],
[1.0, 0.999975, 0.1, 0.99999, int(4e5)],
[1.0, 0.999975, 0.01, 1, int(4e5)]
], columns=['epsilon', 'epsilon decay', 'alpha', 'alpha decay', 'n_episodes'])
# Other parameters
out_dir = 'experiments/reproduce_deterministic'
n_workers = multiprocessing.cpu_count()
master_seed = 666
log_every = 10_000
save_every = 10_000
n_evals_calc = lambda _: 1
eval_batch_size = 10_000
Use the plot function to obtain the training curve
Tested on Ubuntu 18.04.6 LTS | RAM 8GB | Intel® Core™ i7-8750H CPU @ 2.20GHz × 12 Running time: ~8h on 12 cores