Routing in Ridesharing

From RL for transport research
Jump to navigation Jump to search

Contributor:Zixuan Xu, Jingbang Chen, Qi Luo.
This page mainly based on the Survey by Qin, Zhiwei, Hongtu Zhu, and Jieping Ye[1].

Problem statement[edit]

The routing module provides turn-by-turn guidance on the road network to drivers/vehicles either in service of a passenger request or performing a reposition. It could be route planning or dynamic routing depending on the decision points. In the first type of set-up, each vehicle on the road network selects a route for a given OD pair from a set of feasible routes. The decision is only reviewed and revised after a trip is completed. Hence, it is called route planning or route choice. When the routes for all the vehicles are planned together, it is equivalent to assigning the vehicles to each link in the network, and hence, the problem is called traffic assignment problem (TAP), which is typically for transportation planning purposes. In the second type of set-up, the routing decision is made at each intersection to select the next outbound road (link) to enter. These are real-time adaptive navigation decisions for vehicles to react to the changing traffic state of the road network. The problem corresponding to this set-up is called dynamic routing, dynamic route choice, or route guidance. Routing in this paper refers to low-level navigation decisions on a road network, typically with output of matching and repositioning algorithms as input.

Connection to RL methods[edit]

Routing on a road network is a typical multi-agent problem, where the decisions made by one agent has influence on the other agents' performance. For example, the congestion level of a link depends directly on the number of vehicles passing through that link at a given time and has direct impact on the travel time for all the vehicles on that link within the same time interval. The literature for route planning and TAP often consider the equilibrium property of the algorithms when a population of vehicles adopt them. TAP is typically from a traffic manager's (i.e., system's) perspective. Its goal is to reach system equilibrium (SE, or also often referred to as the system optimum). Some works focus on route planning or TAP from an individual driver's perspective and maximize individual reward. These algorithms try to reach user equilibrium (UE) or Nash equilibrium, under which no agent has the incentive to change its policy because doing so will not achieve higher utility. This is the best that selfish agents can achieve but may not be optimal for the system.

Main results[edit]

Route Planning and TAP[edit]

Value-based RL is by far the most common approach for route planning and TAP aiming to reach UE (see Table 1). In the MDP formulation is listed below. This MDP is stateless, so strictly speaking, it is a multi-arm bandits or contextual bandits problem [2] if considering time as a contextual feature.

  • Agent: a vehicle (or equivalently, a task) with a given OD pair.
  • Objective: minimize the total travel time for an individual vehicle (task) [3][4][5], i.e., the agent is selfish.
  • Reward: total travel time of a trip for an individual and a particular run.
  • Action: action to take at decision time is to select a route from the set of feasible routes for the associated OD pair (Ramos et al. 2018, Zhou, Song, Zhao & Liu 2020, Bazzan & Chira 2015) [4][5][6].
  • Value: Due to the multi-agent nature, the environment w.r.t. each agent is non-stationary in that the reward function is changing with the policy updates from the other agents.
Paper Type Equilibrium Agent State Action Reward Algorithm
Mainali et al. (2008) [3] TAP UE vehicle current node (intersection) an adjacent link travel time on the link Q-iteration. The route is constructed by following the decision at each intersection.
Bazzan & Chira (2015)[6] TAP SE vehicle - route from feasible

routes for the OD pair

total travel time on the chosen route Q-iteration. The route is constructed by following the decision at each intersection.
Bazzan & Chira (2015) [6] TAP UE (empirical


vehicle - route from feasible

routes for the OD pair

total travel time on the chosen route Q-learning with action-regret updates to minimize driver's total regret
Zhou, Song, Zhao & Liu (2020)[5] TAP UE (theoretical



vehicle - route from feasible

routes for the OD pair

total travel time on the chosen route B-M RL scheme, similar to a MARL algo with individual reward.

Empirical convergence to UE is demonstrated by Ramos et al.(2018). Zhou, Song, Zhao & Liu (2020) further develop a Bush-Mosteller RL scheme for MARL and formally establishes its UE convergence property. We also highlight some unique features of the papers. Ramos et al. (2018) consider a different objective from the common and minimizes the driver's regret. To do that, the Q-learning updates are modified using the estimated action regret, which can be computed by local observations and global travel time information communicated by an app. Bazzan & Chira (2015) propose a hybrid method, with Q-learning for individual agents and Genetic Algorithm for reaching system equilibrium, minimizing the average travel time over different trips in the network. This method is thus able to achieve SE. Mainali et al. (2008) adopt Q-iterations with a model set-up similar to that of dynamic routing to be discussed next.

Dynamic Routing[edit]

Most applications of RL to routing concern with the dynamic routing (DR) problem (see Table 5).

  • Agent: vehicle agent
  • State: the traffic state of the current node (i.e., intersection). Some works consider state features of the neighboring nodes (Kim et al. 2005, Mao & Shen 2018) so that the agent has a broader view of the environment.
  • Action space: comprises the set of outbound links (i.e., roads) or adjacent nodes from the current node, so the policy provides a turn-by-turn navigation guidance until the destination is reached.
  • Reward: common to use travel time on a link as reward function.
Paper Type Equilibrium Agent State Action Reward Algorithm Road Network
Kim et al. (2005) [7] DR UE vehicle node, time, binary congestion status vector for all links an adjacent node cost accrued by traversing the link parameters of MDP estimated from data, MDP solved by value iterations Southeast Michigan network and traffic data
Grunitzki et al. (2014) [8] DR SE vehicle current node an adjacent link individual reward: negative travel time experienced by the agent on a link difference reward: as in Tumer et al. 2008[9] Applied to a more sophisticated network than aamas08 paper. Results show that DQ learning outforms IQ-learning. abstract network topology
Bazzan & Grunitzki(2016)[10] DR UE trip (OD pair) current node an adjacent node negative travel time by the agent on a link Independent Q learning compared with successive average method OW network, Sioux Falls network
Wen et al. (2019) [11] DR UE vehicle next approaching node, destination node an adjacent link negative travel time by the agent on a link tabular Q-learning, global road network clustered into subnetworks by differential evolution-based clustering. SUMO simulator with various networks in Japan and US
Shou & Di (2020b)[12] DR SE (avg travel time), UE (travel distance) vehicle current node time an adjacent link negative travel time by the agent on a link Bilevel optimization: Lower level - mean field MARL; Upper level - Bayesian optimization SUMO with Manhattan network

Tumer et al. (2008)[9], Grunitzki et al. (2014) [8]defined a new form called difference reward, which is the difference in average travel time on a link with and without the agent in the system. This applied to only a reward function dependent on the number of agents using the traversed link. In particular, travel distance cannot be used to define a difference reward. Whether solving a specific formulation achieves UE or SE depends on the reward function used. If all the agents in the system learn by global reward [9][8][13], then the system is expected to achieve SE. Otherwise, the agents learn by their local rewards, and we will have UE or Nash equilibrium[7][14][15][16][11].


  1. Qin, Z., Zhu, H., & Ye, J. (2021). Reinforcement Learning for Ridesharing: An Extended Survey. arXiv e-prints, arXiv-2105.
  2. Li, L., Chu, W., Langford, J. & Schapire, R. E. (2010), A contextual-bandit approach to personalized news article recommendation, in 'Proceedings of the 19th international conference on World wide web', pp. 661–670.
  3. 3.0 3.1 Mainali, M. K., Shimada, K., Mabu, S. & Hirasawa, K. (2008), 'Optimal route based on dynamic programming for road networks', Journal of Advanced Computational Intelligence and Intelligent Informatics 12(6), 546–553.
  4. 4.0 4.1 Ramos, G. d. O., Bazzan, A. L. & da Silva, B. C. (2018), 'Analysing the impact of travel information for minimising the regret of route choice', Transportation Research Part C: Emerging Technologies 88, 257–271.
  5. 5.0 5.1 5.2 Zhou, B., Song, Q., Zhao, Z. & Liu, T. (2020), 'A reinforcement learning scheme for the equilibrium of the in-vehicle route choice problem based on congestion game', Applied Mathematics and Computation 371, 124895.
  6. 6.0 6.1 6.2 Bazzan, A. L. & Chira, C. (2015), A hybrid evolutionary and multiagent reinforcement learning approach to accelerate the computation of traffic assignment, in 'AAMAS', pp. 1723–1724.
  7. 7.0 7.1 Kim, S., Lewis, M. E. & White, C. C. (2005), 'Optimal vehicle routing with real-time traffic information', IEEE Transactions on Intelligent Transportation Systems 6(2), 178–188.
  8. 8.0 8.1 8.2 Grunitzki, R., de Oliveira Ramos, G. & Bazzan, A. L. C. (2014), Individual versus difference rewards on reinforcement learning for route choice, in '2014 Brazilian Conference on Intelligent Systems', IEEE, pp. 253–258
  9. 9.0 9.1 9.2 Tumer, K., Welch, Z. T. & Agogino, A. (2008), Aligning social welfare and agent preferences to alleviate traffic congestion, in 'Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 2', Citeseer, pp. 655–662.
  10. Bazzan, A. L. & Grunitzki, R. (2016), A multiagent reinforcement learning approach to en-route trip building, in '2016 International Joint Conference on Neural Networks (IJCNN)', IEEE, pp. 5288– 5295.
  11. 11.0 11.1 Wen, F., Wang, X. & Xu, X. (2019), 'Hierarchical sarsa learning based route guidance algorithm', Journal of Advanced Transportation 2019.
  12. Shou, Z. & Di, X. (2020b),'Reward design for driver repositioning using multi-agent reinforcement learning', Transportation research part C: emerging technologies 119, 102738.
  13. Shou, Z. & Di, X. (2020a),'Multi-agent reinforcement learning for dynamic routing games: A unified paradigm', arXiv preprint arXiv:2011.10915 .
  14. Yu, S., Zhou, J., Li, B., Mabu, S. & Hirasawa, K. (2012), Q value-based dynamic programming with sarsa learning for real time route guidance in large scale road networks, in 'The 2012 International Joint Conference on Neural Networks (IJCNN)', IEEE, pp. 1–7.
  15. Mao, C. & Shen, Z. (2018), 'A reinforcement learning framework for the adaptive routing problem in stochastic time-dependent network', Transportation Research Part C: Emerging Technologies 93, 179–197.
  16. Bazzan, A. L. & Grunitzki, R. (2016), A multiagent reinforcement learning approach to en-route trip building, in '2016 International Joint Conference on Neural Networks (IJCNN)', IEEE, pp. 5288–5295.