Matching in ridesharing

From RL for transport research
Jump to navigation Jump to search

Contributor:Zixuan Xu, Jingbang Chen, Qi Luo.
This page mainly based on the Survey by Qin, Zhiwei, Hongtu Zhu, and Jieping Ye[1].

Problem statement[edit]

The matching module attempts to assign the request to an available driver. Depending on driver pool availability, the request may have to wait in the system until a successful match. Pre-match cancellation may happen during this time. The assigned driver then travels to pick up the passenger, during which time post-match cancellation may also occur. The pick-up location is usually where the passenger is making the request or he/she specifies. After the driver successfully transports the passenger to the destination, she receives the trip fare and becomes available again.

The ridesharing matching problem[2][3][4] may appear under different names in the literature, e.g., order dispatching [4], trip-vehicle assignment[5], and on-demand taxi dispatching[6]. It is an online bipartite matching problem where both supply and demand are dynamic, with the uncertainty coming from demand arrivals, travel times, and the entrance-exit behavior of the drivers. Matching can be done continuously in a streaming manner or at fixed review windows (i.e., batching). Sophisticated matching algorithms often leverage demand prediction in some form beyond the actual requests, e.g., the value function in RL.

Online request matching is not entirely unique to ridesharing. Distinctive feature is its spatiotemporal nature. Trip requests generally take different amount of time to finish, and they change the spatial states of the drivers, affecting the supply distribution for future matching. The drivers and passengers generally exhibit asymmetric exit behaviors in that drivers usually stay in the system for an extended period of time, whereas passenger requests are lost after a much shorter waiting period in general.

Connection to RL methods[edit]

The RL literature for rideshare matching typically aims to optimize the total driver income and the service quality over an extended period of time. Service quality can be quantified by response rate and fulfillment rate. Response rate is the ratio of the matched requests to all trip requests. Fulfillment rate is the ratio of completed requests to all requests and is no higher than the response rate.


FIGURE 1: The order matching process from a single request’s perspective[1].

Common RL Approach to Rideshare Platform[edit]

In terms of the MDP formulation, driver agent is a convenient modeling choice for its straightforward definition of state, action, and reward, in contrast to system-level modeling where the action space is exponential. In this case, the rideshare platform is naturally a multi-agent system with a global objective. A common approach is to crowdsource all drivers’ experience trajectories to train a single agent and apply it to all the drivers to generate their matching policies [7][8][9][10]. This type of single-agent approach avoids dealing explicitly with the multi-agent aspect of the problem and the interaction among the agents during training. Besides simplicity, this strategy has the additional advantage of being able to easily handle a dynamic set of agents (and hence, a changing action space) [11].

RL Approach to Multiple Objectives Problem[edit]

In practical settings, the online matching policy often has to balance among multiple objectives [12], e.g., financial metrics and customer experience metrics. The rationale is that persistent negative customer experience will eventually impact long-term financial metrics as users churn the service or switch to competitors. There are two potential ways that one can leverage RL to approach this problem. The explicit approach is to directly learn a policy that dynamically adjusts the weights to combine the multiple objectives into a single reward function. With the abundance of historical experience data, inverse RL can be used to learn the relative importance of multiple objectives under a given unknown policy[13]. The implicit approach is to capture the necessary state signals that characterize the impact of the metrics not explicitly in the reward function, so that the learned value function correctly reflect the long-term effect of the multiple metrics.

The Matching Window and The Matching Radius[edit]

Besides the driver-passenger pairing decisions, there are other important levers that can be optimized within the matching module, namely the matching window and the matching radius [14]. The matching window determines when to match a request (or a batch of requests). A larger window increases pre-match waiting time but may decrease pick-up time for matched requests because of more available drivers. The matching radius defines how far an idle driver can be from the origin of a given request to be considered in the matching. It can be defined in travel distance or time. A larger matching radius may increase the average pick-up distance but requests are more likely to be matched within a batch window, whereas a smaller radius renders less effective driver availability but it may decrease the average pick-up distance.

Both the matching window and radius are trade-offs between pre-match and post-match waiting times (and hence, cancellation). So far, few effort through RL has been devoted to matching radius optimization. The joint optimization of the matching window and radius is certainly another interesting line of research.

Relevant RL methods in Other Domains[edit]

Because of its generalizability, matching for ridesharing is closed related to a number of online matching problems in other domains, the RL methods to which are also relevant and can inform the research in rideshare matching. Some examples are training a truck agent using DQN with pooled experience to dispatch trucks for mining tasks [15], learning a decentralized value function using PPO with a shaped reward function for cooperation (in similar spirit as[14] to dispatch couriers for pick-up services [16], and designing a self-attention, pointer network-based policy network for a system agent to assign participants to tasks in mobile crowdsourcing [17].

Main results[edit]

Single-agent approach[edit]

Since the system reward is the sum of the drivers' rewards, the system value function does decompose into the individual drivers' value functions computed by each driver's own trajectories. The approximation is using a single value function learned from all drivers' data.

Table 1: Summary of Single-agent approach for Matching.
Paper Methods
Xu et al. (2018) learn a tabular driver value function using TD(0)
Wang et al. (2018), Tang et al. (2019), Holler et al. (2019) apply DQNtype of training to learn a value network
Tang et al. (2019) design a spatiotemporal state-value network using hierarchical coarse coding and cerebellar embedding memories for better state representation and training stability
Holler et al. (2019) develop an action-value network that leverages global SD information, which is embedded into a global context by attention

Furthermore, order matching requires strong system-level coordination in that a feasible solution has to satisfy the one-to-one constraints. To address this issue, below methods were established.

Table 2: Methods for satisfying the constraints
Paper Methods
Xu et al. (2018), Tang et al. (2019) use the learned state values to populate the edge weights of a bipartite assignment problem to generate a collective-greedy policy with respect to the state values
Holler et al. (2019) assume a setting where drivers are matched or repositioned sequentially so that the policy output always satisfies the matching constraints

Optimize the Multiagent System[edit]

When optimizing the multiagent system, one significant challenge is scalability since any realistic ridesharing setting can easily involve thousands of agents, precluding the possibility of dealing with an exact joint action space. Mean-field MARL, independent Q-learning and other methods were developed to slove this problem.

Table 3
Paper Methods
Li et al.(2019) apply mean-field MARL to make the interaction among agents tractable, by taking the ‘average’ action of the neighboring agents to approximate the joint actions
Zhou et al. (2019) propose independent Q-learning with centralized KL divergence (of the supply and demand distributions) regularization, which follow the centralized training decentralized execution paradigm
Jin et al.(2019) treating each spatial grid cell as a worker agent and a region of a set of grid cells as a manager agent, adopting hierarchical RL to jointly optimize order matching and vehicle repositioning

The Matching Window Optimization[edit]

There have been several RL works on the matching window optimization, which can be done from the perspective of a request itself[14] or the system [18] [19].

Table 4
Paper Methods
Ke, Yang, Ye et al. 2020 An agent network is trained centrally using pooled experience from all agents to decide whether or not to delay the matching of a request to the next review window, and all the agents share the same policy. To encourage cooperation among the agents, a specially shaped reward function is used to account for both local and global reward feedback. They also modify the RL training framework to address the delayed reward issue by sampling complete trajectories at the end of training epochs to update the network parameters
Wang et al. (2019) take a system’s view and propose a Restricted Q-learning algorithm to determine the length of the current review window (or batch size). They show theoretical analysis results on the performance guarantee in terms of competitive ratio for dynamic bipartite graph matching with adaptive windows
Qin, Luo, Yin, Sun & Ye (2021) use the AC method with experience replay (ACER) that combines on-policy updates (through a queuing-based simulator) with off-policy updates


  1. 1.0 1.1 Qin, Z., Zhu, H., & Ye, J. (2021). Reinforcement Learning for Ridesharing: An Extended Survey. arXiv e-prints, arXiv-2105.
  2. Yan, C., Zhu, H., Korolko, N., & Woodard, D. (2020). Dynamic pricing and matching in ride‐hailing platforms. Naval Research Logistics (NRL), 67(8), 705-724.
  3. Özkan, E., & Ward, A. R. (2020). Dynamic matching for real-time ride sharing. Stochastic Systems, 10(1), 29-70.
  4. 4.0 4.1 Qin, Z., Tang, X., Jiao, Y., Zhang, F., Xu, Z., Zhu, H. & Ye, J. (2020), ‘Ride-hailing order dispatching at didi via reinforcement learning’, INFORMS Journal on Applied Analytics 50(5), 272–286.
  5. Bei, X. & Zhang, S. (2018), Algorithms for trip-vehicle assignment in ride-sharing, in ‘Thirty-second AAAI conference on artificial intelligence’.
  6. Tong, Y., Zhou, Z., Zeng, Y., Chen, L. & Shahabi, C. (2020), ‘Spatial crowdsourcing: a survey’, The VLDB Journal 29(1), 217–250.
  7. Xu, Z., Li, Z., Guan, Q., Zhang, D., Li, Q., Nan, J., Liu, C., Bian, W. & Ye, J. (2018), Largescale order dispatch in on-demand ride-hailing platforms: A learning and planning approach, in ‘Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining’, ACM, pp. 905–913.
  8. Özkan, E., & Ward, A. R. (2020). Dynamic matching for real-time ride sharing. Stochastic Systems, 10(1), 29-70.
  9. Wang, Z., Qin, Z., Tang, X., Ye, J. & Zhu, H. (2018), Deep reinforcement learning with knowledge transfer for online rides order dispatching, in ‘International Conference on Data Mining’, IEEE.
  10. Tang, X., Qin, Z., Zhang, F., Wang, Z., Xu, Z., Ma, Y., Zhu, H. & Ye, J. (2019), A deep value-network based approach for multi-driver order dispatching, in ‘Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining’, pp. 1780–1790.
  11. Ke, J., Yang, H., Ye, J. et al. (2020), ‘Learning to delay in ride-sourcing systems: a multi-agent deep reinforcement learning framework’, IEEE Transactions on Knowledge and Data Engineering .
  12. Lyu, G., Cheung, W. C., Teo, C.-P. & Wang, H. (2019), ‘Multi-objective online ride-matching’, Available at SSRN 3356823 .
  13. Zhou, F., Lu, C., Tang, X., Zhang, F., Qin, Z., Ye, J. & Zhu, H. (2021), Multi-objective distributional reinforcement learning for large-scale order dispatching, in ‘2021 IEEE International Conference on Data Mining (ICDM)’, IEEE, pp. 1541–1546.
  14. 14.0 14.1 14.2 Yang, H., Qin, X., Ke, J. & Ye, J. (2020), ‘Optimizing matching time interval and matching radius in on-demand ride-sourcing markets’, Transportation Research Part B: Methodological 131, 84–105.
  15. Zhang, W., Wang, Q., Li, J. & Xu, C. (2020), ‘Dynamic fleet management with rewriting deep reinforcement learning’, IEEE Access 8, 143333–143341.
  16. Chen, Y., Qian, Y., Yao, Y., Wu, Z., Li, R., Zhou, Y., Hu, H. & Xu, Y. (2019), ‘Can sophisticated dispatching strategy acquired by reinforcement learning?-a case study in dynamic courier dispatching system’, arXiv preprint arXiv:1903.02716
  17. Shen, W., He, X., Zhang, C., Ni, Q., Dou, W. & Wang, Y. (2020), Auxiliary-task based deep reinforcement learning for participant selection problem in mobile crowdsourcing, in ‘Proceedings of the 29th ACM International Conference on Information & Knowledge Management’, pp. 1355–1364.
  18. Wang, Y., Tong, Y., Long, C., Xu, P., Xu, K. & Lv, W. (2019), Adaptive dynamic bipartite graph matching: A reinforcement learning approach, in ‘2019 IEEE 35th International Conference on Data Engineering (ICDE)’, IEEE, pp. 1478–1489.
  19. Qin, G., Luo, Q., Yin, Y., Sun, J. & Ye, J. (2021), ‘Optimizing matching time intervals for ride-hailing services using reinforcement learning’, Transportation Research Part C: Emerging Technologies 129, 103239.