From RL for transport research
Jump to navigation Jump to search

This page mainly based on the Survey by Zong[1] and Yan [2].

Problem statement[edit]

Urban logistics is a broad topic, as different logistics companies and organizations may have their unique problem statements. Nevertheless, in terms of problem natures, many of the applications in urban logistics can be associated with a set of routing problem variants[2].

Generally, the routing problems can be derived from the conventional Traveling Salesman Problem (TSP) or vehicle routing problems (VRP). For example, an express van may be assigned with more than a hundred delivery demands within its current service loop. We need to pre-plan the driving route for one truck or a group of fleets to ensure the efficiency and optimize the cost of cargo transportation. This page was try to introduce RL studies which are more relevant to logistics, mainly variants of TSP and VRP.

Connection to RL methods[edit]

Routing might be the most popular combinatorial optimization problem in logistics and freight transportation, which also a popular research area aims to assess the performance of RL approaches to develop highly efficient RL algorithms of NP-hard problems. Generally, the solution to the combinatorial problem can be described as a set of sequences, and reinforcement learning can transform the process of generating this into a multi-stage decision problem. Compared with traditional optimization methods, learning based method has a shorter solution time and is expected to give a more general solution framework.

End to End framework[edit]

One kind of solving method called end-to-end, in which the input is the problem parameter, and the output is solution sequence. Take VRP as an example, a common approach to generate VRP solutions is to generate a partial sequence gradually and finally obtains a complete solution. In such an MDP modeling, the updating between different states is to include a new unvisited node into the current solution, which naturally forms the action of the sequence generation agent. In each decision step, the action space is the selection upon all unvisited nodes, and the agent selects one of them as the next to visit.

Researchers were focus on the training algorithm and network structure in this kind work, such as Policy-Gradient by RNN [3], Policy-Gradient by Transformer[4], DQN by GNN[5] .etc.

FIGURE 1 The illustration of sequence generation methods for generating VRP solutions via RL. In each decision step, the agent selects the next provider/target location to visit[1] Routingfig1.jpg

Learning to improving heuristic framework[edit]

Besides generating partial solutions until completeness, researchers also searched for other MDP formulations for solving VRPs. An intuition originates from the continuous modification of current solutions in operations research, which is the core idea of many practical heuristics for VRP. Following this idea, researchers attempted to parameterize such a modification process to continuously improve solution quality[1].

Algorithms usually prefabricate a variety of heuristic operators manually for the problem, then improves the current solution through the operator selected by the selection strategy which has learning by DRL[6][7][8].

FIGURE 2 The illustration of learning heuristic methods for generating VRP solutions via RL. An initial solution is pre-established. In each decision step, the agent utilize the one of the pre-defined rules to modify the current solution, and thus improve the overall solution quality[1].


Main result[edit]

Staic model of routing problem[edit]

The static model is to design a routing strategy with a minimum cost for a fleet of vehicles, given all parameters of a set of customers. The focus of this kind of research is to take DRL in classical NP hard combinatorial optimization problems.

As for the model, the studies focused in this part do not have a specific realistic background, and are usually the basic model and its simple variants. In terms of solutions, most of the solution frameworks add deep neural networks to reinforcement learning framework, as a solution method considering the complexity of NP-hard problems.

Table 3
Paper Methods Training Network
Bello et al.(2016) end to end Policy-Gradient RNN
Nazari et al. (2018) end to end Policy-Gradient Transformer
Drori et al.(2020) end to end DQN GNN
Chen et al.(2019) learning heuristic Actor-Critic RNN
Wu et al.(2021) learning heuristic DQN RNN
Yao et al.(2021) learning heuristic DQN GNN

Dynamic model of routing problem[edit]

Many difficult combinatorial optimization problems have been modeled as static problems. However, in practice, many problems are dynamic and changing, while some decisions have to be made before all the design data are known. For example, in the Dynamic Vehicle Routing Problem (DVRP), new customer orders appear over time, and new routes must be reconfigured while executing the current solution[9]. The dynamic model focuses on the randomness of the macro environment, compared with the predetermined route, platform needs to be given a set of strategies to guide the fleets where to go after reaching a certain destination[][10][11][12].

Table 3
Paper Methods
Zhang et al.(2020) a novel reinforcement learning algorithm called the Multi-Agent Attention Model that can solve routing problem instantly benefit from lengthy offline training for Multi-vehicle routing problem with soft time windows
Joe et al. (2020) a solution approach that combines Deep Reinforcement Learning (specifically neural networks-based Temporal Difference learning with experience replay) to approximate

the value function and a routing heuristic based on Simulated Annealing, called DRLSA to solve a dynamic vehicle routing problem with time windows and both known and stochastic customers

Bogyrbayeva et al.(2021) proposes a reinforcement learning approach for nightly offline rebalancing operations in

free-floating electric vehicle sharing systems


  1. 1.0 1.1 1.2 1.3 Zong Z, Feng T, Xia T, et al. Deep Reinforcement Learning for Demand Driven Services in Logistics and Transportation Systems: A Survey[J]. arXiv preprint arXiv:2108.04462, 2021.
  2. 2.0 2.1 Yan Y, Chow A H F, Ho C P, et al. Reinforcement learning for logistics and supply chain management: Methodologies, state of the art, and future opportunities[J]. Transportation Research Part E: Logistics and Transportation Review, 2022, 162: 102712.
  3. Bello I, Pham H, Le Q V, et al. Neural combinatorial optimization with reinforcement learning[J]. arXiv preprint arXiv:1611.09940, 2016.
  4. Nazari M, Oroojlooy A, Snyder L, et al. Reinforcement learning for solving the vehicle routing problem[J]. Advances in neural information processing systems, 2018, 31.
  5. Drori I, Kharkar A, Sickinger W R, et al. Learning to solve combinatorial optimization problems on real-world graphs in linear time[C]//2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 2020: 19-24.
  6. Chen X, Tian Y. Learning to perform local rewriting for combinatorial optimization [J]. Advances in neural information processing systems, 2019, 32.
  7. Wu Y, Song W, Cao Z, et al. Learning improvement heuristics for solving routing problems [J]. IEEE transactions on neural networks and learning systems, 2021.
  8. Yao F, Cai R, Wang H. Reversible action design for combinatorial optimization with reinforcement learning [J]. arXiv preprint arXiv:2102.07210, 2021.
  9. Hanshar F T, Ombuki-Berman B M. Dynamic vehicle routing using genetic algorithms[J]. Applied Intelligence, 2007, 27(1): 89-99.
  10. Zhang K, He F, Zhang Z, et al. Multi-vehicle routing problems with soft time windows: A multi-agent reinforcement learning approach [J]. Transportation Research Part C: Emerging Technologies, 2020, 121: 102861.
  11. Joe W, Lau H C. Deep reinforcement learning approach to solve dynamic vehicle routing problem with stochastic customers [C]. Proceedings of the Proceedings of the international Conference on Automated Planning and Scheduling. 2020: 394-402.
  12. Bogyrbayeva A, Jang S, Shah A, et al. A reinforcement learning approach for rebalancing electric vehicle sharing systems [J]. IEEE Transactions on Intelligent Transportation Systems, 2021.