Intelligent Traffic Signal Control

From RL for transport research
Revision as of 08:22, 9 October 2022 by Zhiyun Deng (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Contributors: Zhiyun Deng and Qi Luo.

This page mainly based on the survey by Noaeen et al. [1] and the article by Chu et al. [2].

Problem Statement[edit]

The increasing transportation demand caused by population growth and urbanization has brought significant pressures to exiting urban traffic infrastructure, resulting in growing severe congestion at intersections. However, the current used traffic signal controllers are far from perfect since they involve a repeating pattern that does not change with the live traffic situation, which continues through its cycles regardless of dynamic traffic changes in that area. Despite the actuated control method operates traffic signals based on real-time data of loop detectors, it is not designed to fully address fluctuating traffic demands, thereby rendering it less than optimal, specifically in a highly saturated traffic environment. By contrast, an adaptive signal is a more efficient solution as it has the built-in capacity to adapt to traffic changes without the restrictions that plague the actuated method.

Therefore, the concept of Adaptive Traffic Signal Control (ATSC) has been proposed to improve operational efficiency and mitigate potential congestion in saturated road networks by adjusting the signal timing according to real-time traffic dynamics. In a nutshell, the main idea of classical signal timing strategies is solving optimization problems to find efficient coordination and control policies. So far, several successful early-stage ATSC products have been applied in hundreds of cities worldwide, e.g., SCOOT [3], SCATS [4], OPAC [5], and PRODYN [6]. For example, the working principle of SCOOT is presented in Fig. 1, where the detectors placed upstream sense the occupancy and transmitted the information to the central computer. After that, the traffic models and optimizers will use this information to calculate signal timings to achieve the best overall compromise for coordination among all links in the controlled area. The main aim of this traffic signal control system is to react to changes in observed average traffic demands by making frequent but minor adjustments to the signal cycle time, green allocation, and offset of every controlled intersection.

Figure 1: The functioning principle of SCOOT [7].

Connection to Reinforcement Learning (RL) Methods[edit]

Motivation of RL in ATSC Problems[edit]

Reinforcement Learning (RL), formulated under the Markov Decision Process (MDP) framework, is a promising alternative to learning ATSC according to real-world traffic measurements [8]. While traditional model-driven approaches rely on several heuristic assumptions and equations, RL directly fits a parametric model to learn the optimal control based on its experience interacting with complex traffic systems. Moreover, recent advances in combining traditional RL and Deep Neural Networks (DNNs) help enhance RL's learning capacity on complex tasks, making it possible to deploy RL-based signal timing policies in real life [9].

As reported by [1], an advantage of RL over conventional methods, e.g., traffic theory based and heuristic methods, is that RL can learn from the interaction with the environment via trial and error to take appropriate actions based on the feedback it receives from the environment, rather than relying on pre-defined rules which are often used in conventional methods. ATSC is a complex problem of sequential decision making, where conventional methods may not sufficiently be suitable to compute optimal solutions. By contrast, ATSC can be a good fit to the framework of Markov Decision Processes (MDPs) and RL, which motivates researchers to investigate the benefits of applying RL for ATSC problems.

Nevertheless, RL has its own issues and challenges in solving ATSC and network-level ATSC problems. Even though DNNs have improved the scalability of RL, training a centralized RL agent is still infeasible for large-scale ATSC problems. For example, if we use a single agent to control all the traffic signals within a certain range, the topological information of the traffic network will be lost. Moreover, the joint state and action space of the agent grows exponentially in the number of signalized intersections, resulting in high latency and failure rate in practice. Hence, it is more efficient and natural to formulate ATSC as a Multi-Agent Reinforcement Learning (MARL) problem [10], where each intersection is controlled by a local RL agent according to local observation and limited communication.

MDP Formulation of ATSC Problems[edit]

In ATSC, traffic signals are the most common agents. As shown in Fig. 2, Some ATSC systems have a single agent in the RL environment, however it is common to have multiple agents work either cooperatively or competitively, in what is called Multi-Agent Reinforcement Learning (MARL). The benefit here is that agents work across a large environment while still having the precision of a single agent or close to it. The agents interact in a simulated traffic environment in different situations to learn the optimal way of interacting with the environment in a real-world setup. RL works based on a reward system that promotes long-term goals in an environment. The learning process is a feedback cycle of state, action and reward, where RL learns how an agent should map the states to actions to maximize a reward.

Figure 2: Illustration of the single agent and multi-agent setting of ATSC problems[11].

Action Space[edit]

Recall that in ATSC problem [1], a phase is defined as a period during which a set of non-conflicting traffic movements receive a green signal. A cycle is composed of several phases and cycle time is the time required to complete a full sequence of the phases. The proportion of the cycle that is green is called split. Moreover, in a coordinated system, offset is defined as the time that the green phase at an intersection begins after the beginning of green time of the reference signal. The main goal of ATSC is to improve the environment or network performance (e.g. delay time, travel time, queue length, and speed) by controlling the actions of the agents.

The standard action definitions of traffic signal control include phase switch [12], phase duration [13], and phase itself [14]. It is clear that the last definition is the simplest one that enables flexible and direct ATSC by RL agents. Hence, it is widely used by existing research, which define each local action of a single agent as a possible phase (e.g., red, green, and yellow light), or red-green combinations of traffic light at that intersection. In other words, the RL agent will select one of feasible phases for each intersection to be implemented for a duration of time at each decision step. In addition, in order to seek a balance between the benefits of flexible action setting and its impacts on the ease of problem handling, Li et al. [15] assume that the cycle lengths are not fixed but can only be adjusted between the maximum and minimum limits to ensure practicality (as shown in Fig. 3). Therefore, the agent accounts for a variable phasing sequence in an acyclic timing scheme with an unfixed cycle length and an unsettled phasing sequence. More specifically, the control action is either to extend the current phase (i.e., choose the number of units of green extension) or to switch to any other phase, and consequently, some unnecessary phases may be skipped according to the fluctuations in traffic. In addition, each chosen phase is assumed to have a length limit with predetermined minimal green time, yellow time, and red time as the settings in engineering practice.

Figure 3: Diagram of Candidate Actions in an Acyclic ATSC [15].

State Space[edit]

Various kinds of elements have been proposed to describe the environment state, such as queue length, waiting time, speed and phases, etc. These elements can be defined on the lane level or road segment level, and then concatenated as a vector. For example, in most existing research [13], [14], the local state is defined as the combination of the cumulative delay of the first vehicle and the total number of approaching vehicle along each incoming lane within a certain range to the intersection. It is noteworthy that these two kinds of information can be obtained by Induction-Loop Detectors (ILD) located nearby the intersection, which provides the basis for the application of ATSC. In addition, state information can be easily collected in simulation platforms, see, e.g., the laneAreaDetector in SUMO [16].

Reward Function[edit]

There are various objectives have been selected by researchers and set as the optimization goal of RL algorithms, while most of them are relevant to safety, operational efficiency, and energy consumption, which reflects the learning objective of RL agents. However, an appropriate reward function of MARL algorithms should be spatially decomposable and frequently measurable due to the distributed feature of multiple agents. For example, references [12] and [13] define the reward function as cumulative delay and wave, respectively. By contrast, references [14] and [2] define the reward function as the combination of measured queue length and total number of approaching lane along each incoming lane, which is directly correlated to state and action, and emphasizes both traffic congestion and trip delay.

Table 1: Typical works on RL-based ATSC methods.
Ref. Year Scale Model State Action Reward
[17] 2003 Single Q-learning Queue length Binary phase Fixed penalty
[18] 2010 Multiple Q-learning Arriving vehicle, Queue length, Traffic delay Green phases Traffic delay
[19] 2014 Multiple Q-learning Arriving vehicle, Queue length, Cumulative delay Traffic phases Stage costs
[20] 2016 Single DQN Queue length, Waiting time, Average speed, Phase duration Right of ways Traffic delay, queue length
[21] 2018 Multiple DQN Queue length, Waiting time, Vehicle number Traffic phases Queue length, Waiting time, Vehicle number
[22] 2020 Multiple DQN Vehicle speed Traffic phase Travel delay
[11] 2022 Multiple A2C Vehicle queue Traffic phase Waiting time

Network Architecture of RL Algorithms[edit]

Recall that states of traffic flows are full of complex spatial-temporal data, thus the learning process of agents may be non-stationary if only the current states are fed into the A2C networks. However, if we fed all historical data into A2C, the exponentially exploding state space will significantly reduce the attention of A2C even under the normal traffic demand, resulting long hours of training. Therefore, an alternately solution is to combine the ANN and LSTM networks [23], where the LSTM is applied as the last hidden layer to extract representations from different state types. As shown in the Fig. 4, all the input data are first processed by separate fully connected layers, then the hidden units are combined and input to a LSTM layer.

Figure 4: DNN structure of MA2C in ATSC [2].

The structure of deep neural networks has a high impact on learning in DRL. There are several neural network structures are available in the literature. Multi-layer perceptron (MP), i.e., the standard fully connected neural network model, is a useful tool for classic data classification. An extension of multi-layer perceptron with kernel filters is convolutional neural network (CNN), which provide high performance on mapping image to an output. Standard DQN considers CNN that uses consecutive raw pixel frames for state definition. Residual networks (ResNet) are used to deal with the overfitting problem in CNN-based deep network structures [24]. Another convolution-based network structure for operations in graphs is graph convolutional networks (GCN). Recurrent neural networks (RNN), e.g., Long Short-Term Memory (LSTM), are designed to work with sequential data [25], [26]. Another type of neural network model is autoencoder that learns an encoding for high-dimensional input data in a lower-dimensional subspace. The encoded input can be decoded to reconstruct the input, which is commonly used for clearing the noise on input data [27].

Main Results[edit]

In order to evaluate the performance of RL-based ATSC policy before it can be deployed in real-world, simulation-based experiments (e.g., SUMO [16].) are adopted by most of the existing researchers. The simulation scenarios can be either randomly generated with the build-in traffic simulator or with real-world traffic data. For example, the road network with 25 intersection is illustrated in Fig. 5, where the action space of each intersection contains five possible phases (i.e., E-W straight phase, E-W left-turn phase, three straight and left-turn phases for E, W, and N-S). It is clear that the centralized RL (i.e., the single agent RL) is infeasible for large-scale ATSC due to the extremely high dimension of the joint action space that has the size of 5^25.

Figure 5: (a) A traffic grid of 25 intersections. (b) Average queue length. (c) Average intersection delay [2].

Another scenario of real-world city data is presented in Fig. 6, where there are a variety of road and intersection types existing. Moreover, there are 30 signalized intersections in total: 11 are two-phase, 4 are three-phase, 10 are four-phase, 1 is five-phase, and the rest 4 are six-phase. The comparison of average queue length and average intersection delay over simulation time under different ATSC policies for the two scenarios are presented in Fig. 5 and Fig. 6. It can be seen that both IA2C and Greedy are able to reduce the queue lengths after peak values while IA2C achieves a better recovery rate. However, both of them fail to maintain sustainable intersection delays. By contrast, MA2C is able to achieve lower and more sustainable intersection delays via coordination with shared neighborhood fingerprints.

Figure 6: (a) Monaco traffic network. (b) Average queue length. (c) Average intersection delay [2].


  1. 1.0 1.1 1.2 Noaeen, M., Naik, A., Goodman, L., Crebo, J., Abrar, T., Abad, Z. S. H., ... & Far, B. (2022). Reinforcement learning in urban network traffic signal control: A systematic literature review. Expert Systems with Applications, 116830.
  2. 2.0 2.1 2.2 2.3 2.4 Chu, T., Wang, J., Codecà, L., & Li, Z. (2019). Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 21(3), 1086-1095.
  3. Hunt, P. B., Robertson, D. I., Bretherton, R. D., & Royle, M. C. (1982). The SCOOT on-line traffic signal optimisation technique. Traffic Engineering & Control, 23(4).
  4. Luk, J. Y. K. (1984). Two traffic-responsive area traffic control methods: SCAT and SCOOT. Traffic engineering & control, 25(1).
  5. Gartner, N. H. (1982). Demand-Responsive Decentralized Urban Traffic Control. Part I: Single-Intersection Policies (No. DOT/RSPA/DPB-50/81/24).
  6. Henry, J. J., Farges, J. L., & Tuffal, J. (1984). The PRODYN real time traffic algorithm. In Control in Transportation Systems (pp. 305-310). Pergamon.
  8. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  9. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533.
  10. Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P. H., Kohli, P., & Whiteson, S. (2017, July). Stabilising experience replay for deep multi-agent reinforcement learning. In International conference on machine learning (pp. 1146-1155). PMLR.
  11. 11.0 11.1 Wu, Q., Wu, J., Shen, J., Du, B., Telikani, A., Fahmideh, M., & Liang, C. (2022). Distributed agent-based deep reinforcement learning for large scale traffic signal control. Knowledge-Based Systems, 241, 108304.
  12. 12.0 12.1 El-Tantawy, S., Abdulhai, B., & Abdelgawad, H. (2013). Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): methodology and large-scale application on downtown Toronto. IEEE Transactions on Intelligent Transportation Systems, 14(3), 1140-1150.
  13. 13.0 13.1 13.2 Aslani, M., Mesgari, M. S., & Wiering, M. (2017). Adaptive traffic signal control with actor-critic methods in a real-world traffic network with different traffic disruption events. Transportation Research Part C: Emerging Technologies, 85, 732-752.
  14. 14.0 14.1 14.2 Prashanth, L. A., & Bhatnagar, S. (2010). Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 12(2), 412-421.
  15. 15.0 15.1 Li, Z., Yu, H., Zhang, G., Dong, S., & Xu, C. Z. (2021). Network-wide traffic signal control optimization using a multi-agent deep reinforcement learning. Transportation Research Part C: Emerging Technologies, 125, 103059.
  16. 16.0 16.1 Krajzewicz, D., Erdmann, J., Behrisch, M., & Bieker, L. (2012). Recent development and applications of SUMO-Simulation of Urban MObility. International journal on advances in systems and measurements, 5(3&4).
  17. Abdulhai, B., Pringle, R., & Karakoulas, G. J. (2003). Reinforcement learning for true adaptive traffic signal control. Journal of Transportation Engineering, 129(3).
  18. El-Tantawy, S., & Abdulhai, B. (2010, September). An agent-based learning towards decentralized and coordinated traffic signal control. In 13th International IEEE Conference on Intelligent Transportation Systems (pp. 665-670). IEEE.
  19. Prabuchandran, K. J., AN, H. K., & Bhatnagar, S. (2014, October). Multi-agent reinforcement learning for traffic signal control. In 17th International IEEE Conference on Intelligent Transportation Systems (ITSC) (pp. 2529-2534). IEEE.
  20. Li, L., Lv, Y., & Wang, F. Y. (2016). Traffic signal timing via deep reinforcement learning. IEEE/CAA Journal of Automatica Sinica, 3(3), 247-254.
  21. Wei, H., Zheng, G., Yao, H., & Li, Z. (2018, July). Intellilight: A reinforcement learning approach for intelligent traffic light control. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2496-2505).
  22. Rasheed, F., Yau, K. L. A., & Low, Y. C. (2020). Deep reinforcement learning for traffic signal control under disturbances: A case study on Sunway city, Malaysia. Future Generation Computer Systems, 109, 431-445.
  23. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
  24. Liu, M., Deng, J., Xu, M., Zhang, X., & Wang, W. (2017, August). Cooperative deep reinforcement learning for traffic signal control. In 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Halifax (Vol. 2017).
  25. Shi, S., & Chen, F. (2018). Deep recurrent Q-learning method for area traffic coordination control. Journal of Advances in Mathematics and Computer Science, 1-11.
  26. Choe, C. J., Baek, S., Woon, B., & Kong, S. H. (2018, November). Deep q learning with LSTM for traffic light control. In 2018 24th Asia-Pacific Conference on Communications (APCC) (pp. 331-336). IEEE.
  27. Li, L., Lv, Y., & Wang, F. Y. (2016). Traffic signal timing via deep reinforcement learning. IEEE/CAA Journal of Automatica Sinica, 3(3), 247-254.