Contributors: Zhicheng Jin and Qi Luo.
Public transportation networks such as bus transit networks or bike sharing system are an integral part of our cities, which relies heavily on optimal planning of routes. The quality of the routes directly inﬂuences the quality of service provided to passengers, in terms of coverage, directness, and in-vehicle travel time. In addition, it affects the proﬁtability of the transportation system, since the network structure directly inﬂuences the operational costs.
According to Ceder , the design phase of bus networks can be divided into ﬁve main stages: network design, frequencies setting, timetable development, bus scheduling and driver scheduling. The stages are highly dependable on each other, and the output of the last stage can be the starting point of another iteration of the planning process. Each stage can be thought of as a combinatorial problem, with a computational complexity lying in the NP-Hard domain . This makes it almost impossible to optimise the whole pipeline with our current computational resources using one single algorithm.
As for bike sharing system, Guofu Liargues that DRL has its special strengths in optimizing the resource allocation and schedule in large transportation system, and is tested on the rebalancing problem in bike sharing system (BSS) . Researchers suggest that there are two broad ways to categorize the rebalancing problem in a bike sharing system :
1.According to the type of the entity who conducts the reposition behavior: a. Operator-based rebalancing is conducted by the organization that runs the service; b. User-based rebalancing is conducted by the end user, encouraged by some incentive to return the bike to a requested near-by position;
2. According to the time of repositioning: a. Static rebalancing is conducted when the system is relatively static, e.g., in the midnight; b. Dynamic rebalancing is conducted during the day when the business is still running and the distribution of the bike is constantly changing.
Connection to RL method
All the papers we surveyed involved pre-determined heuristics which are used to traverse the search tree of solutions until a good enough solution was found. Based on some insights from the survey by Bengio et al., Darwish  decided to experiment with the possibility of learning those heuristics using connectionist approaches. The survey did not discourage the use of this approach for highly-dimensional problems like the Transit Network Design and Frequency Setting Problem (TNDFSP).
Several papers have been found trying to solve different graph combinatorial problems, such as the Travelling Salesman Problem (TSP), the Vehicle Routing Problem (VRP), and the Orienteering Problem (OP), all of which are, more or less, very speciﬁc instances of our problem.
Bello et al.  have utilized a Pointer Network with an attention layer as an agent, deciding which node to visit in the TSP problem. The Actor-Critic approach was used to train the agent. Dai et al.  implemented a self feeding network which takes as an input the embedding of a whole graph, while masking out the nodes the agent is not allowed to pick, and keeps inputting the output back into the network until the solution is complete. Nazari et al. proposed removing the LSTM encoder of pointer network from Bello’s work, since they claim there is no temporal value in the order the nodes are input to the network. In addition, they also propose inputting a ”dynamic element” to the attention mechanism, such that the network is more robust against dynamic properties of the Vehicle Routing Problem (VRP), and they utilize the A3C training method.
Darwish explain how an instance of our problem is transformed to a combinatorial optimization problem that can be solved by the model. 1) Encoder Input: To provide a better context to the decoder, the input graph is augmented with the demand and travel time information, which are better represented as edges, since the nodes have no useful intrinsic properties. In such case, the demand and the travel matrices can be thought of as adjacency matrices representing edges of the road network. The nodes of the graph are, consequently, redundant, and will not be input to the model, but will be used for the mask function, and calculating the reward. The model, as a result of this change, will be outputting a sequence of consecutive edges from the graph. 2) Decoder Context: For our problem, additional values are added to the context vector, most of which were mentioned in the Assumptions and Hyperparameters section. 3) The Masking Function: The main purpose of the masking function is to make sure the constraints mentioned in Assumptions and Hyperparameters are met to make sure that the decoded output is valid. In addition, we need to ensure that the chosen edges are consecutive, and actually exist in the original road network. This is done by keeping track of all visited nodes and edges, as well as a count of ﬁnished routes, and number of nodes accumulated so far in the current route. 4) Output Pre-processing and Reward Calculation: After the decoding process is ﬁnished and the solutions are formed, the costs of the decoded solutions, which denote the scalar values U and O in the problem formulation, must be calculated. 5) Reward Calculation. As shown in Figure 1, there are three main ways in which deep RL (DRL) solves combinatorial optimization problems: (1) using the problem as input to give a solution or heuristic directly; (2) integrating or expanding traditional operations research algorithms (OR) to increase the value of valuable information. (3) traditional operations research algorithms iteratively querying the same deep RL model to make decisions. The DRL model takes the current state of the algorithm as input, which may include problems.
Figure. 1. Deep RL (DRL) solves combinatorial optimization problems .
One of the most important factors that would make our deep learning approach more sustainable and quicker to train on different architectures and reward functions is the concurrency of the reward function. However, due to the difﬁculty making speciﬁc parts of our reward function parallel, like Dijkstra’s function and the Graph Modiﬁer, training was slowed down by those sequential bottlenecks. This prevented us from trying out enough variations of the weights, or trying to have the model train on outputting a diverse number of routes. What also necessitates the parallelism of our reward function is the need for optimizing bigger networks, that have a number of stations in the order of the hundreds
Darwish using Deep Reinforcement Learning to solve one of the oldest well-established problems in public transportation, the TNDFSP. This approach was never used before in literature, and it obviates the need for heuristics of any kind for reaching a good solution. The proposed system showed great potential of ﬁnding optimized solutions that satisfy both the passengers and operators’ objectives. The networks produced by our model were very competitive in comparison with state-of-the-art algorithms, when applied to Mandl’s benchmark network.
- Darwish A, Khalil M, Badawi K. Optimising Public Bus Transit Networks Using Deep Reinforcement Learning[C]//2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2020: 1-7.
- Li G, Cao N, Zhu P, et al. Towards Smart Transportation System: A Case Study on the Rebalancing Problem of Bike Sharing System Based on Reinforcement Learning[J]. Journal of Organizational and End User Computing (JOEUC), 2021, 33(3): 35-49.
- Ceder, A., & Wilson, N. H. M. (1986). BUS NETWORK DESIGN. Transportation Research Part B-Methodological, 20(4), 331–344.
- Magnanti, T. L., & Wong, R. T. (1984). Network Design and Transportation Planning: Models and Algorithms. Transportation Science, 18(1), 1–55.
- DeMaio, P. (2009). Bike-sharing: History, impacts, models of provision, and future. Journal of Public Transportation, 12(4), 3.
- Shaheen, S., Guzman, S., & Zhang, H. (2010). Bikesharing in Europe, the Americas, and Asia: Past, present, and future. Transportation Research Record: Journal of the Transportation Research Board, (2143), 159–167.
- Contardo, C., Morency, C., & Rousseau, L. M. (2012). Balancing a dynamic public bike-sharing system (Vol. 4). Cirrelt.
- Bengio, Y., Lodi, A., & Prouvost, A. (2018). Machine Learning for Combinatorial Optimization: a Methodological Tour d’Horizon. ArXiv Preprint ArXiv:1811.06128.
- Bello, I., Pham, H., Le, Q. V., Norouzi, M., & Bengio, S. (2017). Neural Combinatorial Optimization with Reinforcement Learning. In ICLR (Workshop).
- Dai, H., Khalil, E. B., Zhang, Y., Dilkina, B., & Song, L. (2017). Learning combinatorial optimization algorithms over graphs. In NIPS’17 Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6351–6361).
- Nazari, M., Oroojlooy, A., Snyder, L., & Takac, M. (2018). Reinforcement Learning for Solving the Vehicle Routing Problem. In NIPS 2018: The 32nd Annual Conference on Neural Information Processing Systems (pp. 9861–9871).
- Wang Q, Tang C. Deep reinforcement learning for transportation network combinatorial optimization: A survey[J]. Knowledge-Based Systems, 2021, 233: 107526.