Point-to-Point Navigation of a Fish-Like Swimmer in a Vortical Flow With Deep Reinforcement Learning

Zhu, Yi; Pang, Jian-Hua; Tian, Fang-Bao

doi:10.3389/fphy.2022.870273

ORIGINAL RESEARCH article

Front. Phys., 09 May 2022

Sec. Biophysics

Volume 10 - 2022 | https://doi.org/10.3389/fphy.2022.870273

Point-to-Point Navigation of a Fish-Like Swimmer in a Vortical Flow With Deep Reinforcement Learning

Yi Zhu¹

Jian-Hua Pang^1,2*

Fang-Bao Tian³*

¹Ocean Intelligence Technology Center, Shenzhen Institute of Guangdong Ocean University, Shenzhen, China
²College of Ocean Engineering, Guangdong Ocean University, Zhanjiang, China
³School of Engineering and Information Technology, University of New South Wales, Canberra, ACT, Australia

Efficient navigation in complex flows is of crucial importance for robotic applications. This work presents a numerical study of the point-to-point navigation of a fish-like swimmer in a time-varying vortical flow with a hybrid method of deep reinforcement learning (DRL) and immersed boundary–lattice Boltzmann method (IB-LBM). The vortical flow is generated by placing four stationary cylinders in a uniform flow. The swimmer is trained to discover effective navigation strategies that could help itself to reach a given destination point in the flow field, utilizing only the time-sequential information of position, orientation, velocity and angular velocity. After training, the fish can reach its destination from random positions and orientations, demonstrating the effectiveness and robustness of the method. A detailed analysis shows that the fish utilizes highly subtle tail flapping to control its swimming orientation and take advantage of the reduced streamwise flow area to reach it destination, and in the same time avoiding entering the high flow velocity area.

1 Introduction

To find the timely optimal path between two given points in a complex flow is known as Zermelo’s navigation problem [1]. This problem is a key issue for many robotic and engineering applications, including micro-swimmers [2,3], fish-like underwater vehicles [4], unmanned drones [5], and weather balloons [6]. In realistic environments, different structures interact with disturbances like wind, waves and currents, generating abundant vortices that could significantly effect the operation of these robotics [7], making the predefined control algorithms ineffective. In this work, we tackle the Zermelo’s problem for the point-to-point navigation of a fish-like swimmer in a vortical flow environment. Typical application scenarios include oceanic supervision [8], fishery conservation and intervention on offshore structures [9].

Naive control strategies are usually ineffective or inefficient in vortical environments [10], since the vortices could easily deviate the vehicles away from their desired path [11]. Numerous methods have been trying to design a customized optimal path for a given environment, ranging from the classical optimal control theory [12] to modern optimization approaches [13,14]. An important feature of these methods is that they require the knowledge of the dynamics of the background flow [15]. However, in real world applications, it is impractical to measure the entire flow environment in advance, as ocean and air currents are too variable to be fully measured [15]. In addition, the vehicles themselves can also significantly alter the surrounding flow fields, making them more unpredictable [15].

Reinforcement learning (RL) offers a promising alternative for solving Zermelo’s navigation problem in complex time-varying environments. Compared to the classical methods, RL possesses two main advantages. The first advantage is that it does not require any prior knowledge of the environment [16]. Instead, it automatically develops an understanding of the dynamics of the environment through trial and error. The other advantage is that the influence of the historical states can be easily taken into consideration [17]. Therefore, the correlation between action and its effect can be accurately captured even when there is a delay between them and there are measurable impacts from the historical actions. Colabrese et al. [3] first demonstrated that reinforcement learning is an efficient way to address Zermelo’s Problem. They adopted this method to train a point-like swimmer in an Arnold-Beltrami-Childress (ABC) flow to navigate vertically as quickly as possible. The swimmer was assumed to swim with constant speed and its direction was decided by the combined effect of a shear-induced viscous torque and a torque applied by the swimmer to orient itself to a desired direction. And a torque on the swimmer was designed by measuring its instantaneous swimming direction and the local flow vorticity. The authors found that smart swimmer can take advantage of upwelling flows to accelerate upward navigation and avoid being trapped in the vortices. This work motivated a series of studies, investigating the point-to-point navigation in different flows, as well as different actions [7,10,15,18–26].

The above studies demonstrated the potential of reinforcement learning in solving the navigation problems in complex flows. However, several simplifications are used for a better comparison with the traditional control methods. Firstly, most of these studies adopted simplified flow models to avoid the actual complexity and unpredictability of a time-varying fluid flow. Secondly, idealized model of the swimmer and their actions are utilized. In most of studies, the swimmers are considered to be an infinitely small point, which has negligible influence on the background flow. Moreover, the propellers of those swimmers are not modeled. Instead, it is assumed that the swimmers have full control of their own velocities. Those assumptions neglect the complex interaction between the swimmers and the environmental flows, such as time delays between sensing, actions and rewards. In this work, we investigate the point-to-point navigation of a fish-like swimmer in a vortical flow with a hybrid method of deep reinforcement learning (DRL) and immersed boundary–lattice Boltzmann method (IB-LBM). Compared with previous works, the present work utilizes a full model of both the flow and the swimmer. Specifically, the vortical flow is numerically generated with IB-LBM by putting four cylinders in a uniform flow, and the fish-like swimmer propels itself by periodically undulating its fish-like body to push the surrounding flow afterwards. This setup retains the complex nonlinear interaction between the swimmer and the flow.

The rest of the paper is organized as follows. Numerical methods are simply introduced in Section 2. The results of the simulation are discussed in Section 3. The conclusions are provided in Section 4.

2 Methodology

The methodology used here is almost the same as that in our previous work [27]. Here briefly describe it for complicity. More details of the method and its validations can be found in our previous work.

2.1 Kinematic Model of the Fish

The half thickness of the body is mathematically approximated by

\frac{d}{L} = 0.2610 \sqrt{\frac{l}{L}} - 0.3112 (\frac{l}{L}) + 0.1371 {(\frac{l}{L})}^{2} - 0.0791 {(\frac{l}{L})}^{3} - 0.0078 {(\frac{l}{L})}^{4}, (1)

where l is the arc length along the mid-line of the body, and L is the body length which is a constant during the swimming [28].

The motion of the fish body is composed of the translation of the mass center, the body rotation around the mass center and the body undulation in the local coordinate system (Figure 1). The translational and rotational motion of the fish are determined by the FSI in the global coordinate system according to the Newton’s laws of motion. The FSI equations are solved by an explicit FSI coupling method as in Ref. [27,30]. The undulatory motion is controlled by the fish itself, which can be taken as the superposition of different waves propagating from head to tail. A polynomial-based waveform is adopted for each wave and the kinematics of the newest generated waves can be changed every half cycle. In the nth half cycle, the mid-line lateral displacement is determined by

θ_{l} (l, t) = \frac{l}{L} h [\frac{λ_{n}}{T_{n}} (t - t_{0 n}) - \frac{l}{L}], h_{l} (l, t) = \int_{0}^{l} s i n (θ_{l}) d l, (2)

where θ_l is the deflection angle of the mid-line with respect to axis x_l as shown in Figure 1, λ_n is the wavelength, T_n is the period, t is the time, t_0n = 0 for n = 1 and $\sum_{i = 1}^{n - 1} T_{i}$ for n > 1, and h is the waveform function described by

h (ζ) = c_{0} + c_{1} ζ + c_{2} ζ^{2} + c_{3} ζ^{3} + c_{4} ζ^{4} + c_{5} ζ^{5}, (3)

where c₀₋₅ can be determined by $h (0) = {(θ_{l m a x})}_{n - 1}$ , $h (λ_{n} / 2) = {(θ_{l m a x})}_{n}$ ,h′(0)=h′(λ_n/2)=0, $h^{''} (0) = - h (0) {(2 π / λ_{n - 1})}^{2}$ , and $h^{''} (λ_{n} / 2) = - h (λ_{n} / 2) {(2 π / λ_{n})}^{2}$ . ${(θ_{l m a x})}_{n}$ is the maximum deflection angle at the tail tip of the nth half wave.

FIGURE 1

FIGURE 1. A schematic illustration of the motion of the fish (Adapted from Ref. [27,29]).

2.2 Immersed Boundary–Lattice Boltzmann Method

The lattice Boltzmann method (LBM) is used to simulate the fluid dynamics [31,32]. Instead of solving the Navier-Stokes equations, the LBM solves the discrete lattice Boltzmann equation which governs the kinematics of the mesoscopic particles,

f_{i} (r + c_{i} Δ t, t + Δ t) - f_{i} (r, t) = Ω_{i} (r, t) + Δ t G_{i} (r, t), i = 0, \dots, 8 (4)

where f is the particle density distribution function, r = (x, y) is the space coordinate, c_i is the discrete lattice velocity, Δt is time step, Ω_i is the collision operator, and G_i is the source term representing the body force. A detailed description of this equation can be found in Ref. [33]. f in the whole flow field can be acquired from a well-defined boundary condition, such as the no-slip velocity condition on the boundary of the swimmer model. Once f is known, the macroscopic physical quantity such as fluid density, pressure and velocity can be computed from

ρ = \sum f_{i}, p = ρ c_{s}^{2}, u = \frac{1}{ρ} (\sum f_{i} c_{i} + \frac{Δ t g}{2}), (5)

where c_s is the lattice speed of sound in the fluid, and g is the body force. Then the force and torque on the swimmer model can be computed from those macroscopic physical quantity.

In addition, a diffusion immersed boundary method (IBM) [32,34–36] is utilized to handle the boundary condition at the fluid-structure interface. In this method, the influence of the boundary on the fluid is represented by a distribution of body force on the background Eulerian mesh nodes. Compared to body conformal methods [37–39], the grid generation in IBM is much easier for complicated shapes [32,40,41]. And a multi-block geometry-adaptive Cartesian grid is coupled with the IB–LBM to accelerate the computation. A detailed description of this numerical scheme and its validation can be found in Refs. [27,31,34,42–44]. The current method is first-order in accuracy.

2.3 Deep Reinforcement Learning

DRL is a machine learning method combining reinforcement learning with an artificial neural network. DRL has gained extensive attention due to its success in complex real-world problems [45]. In this study, a specific DRL method called deep recurrent Q-network (DRQN) [46] is adopted, in which a long-short-term-memory recurrent neural network (LSTM-RNN) is used to process time-sequential data. The method includes two basic elements: a learning agent and its environment [3]. The agent interacts with the environment in a trial-and-error fashion to collect observation of the environment state (denoted by s), control actions (denoted by a), and rewards (denoted by rd) [47]. The goal of the agent is learning to find a control policy (denoted by π(s, a)) that enables it to collect highest rewards in a single try.

The interaction procedure between the environment (IB-LBM) and the agent (DRL) is shown in Figure 2. The interaction is divided into a sequence of discrete steps n = 0, 1, 2, 3, …. At steps n, the agents detect state s_n, and select action a_n, based on policy $π_{(} s, a)$ . Then the environment is changed under the influence of the action. At step n + 1, in response to the change of the environment, the agent receives reward rd_n+1, and find itself in a new state s_n+1. A detailed explanation of the procedure can be found in Refs. [27,48]. Validations of the current solver can be found in Ref. [27] for the hybrid method of DRL and IB-LBM.

FIGURE 2

FIGURE 2. The interaction procedure between IB-LBM and DRL (Adapted from Ref. [29]).

3 Results and Discussion

3.1 The Hydrodynamics of a Uniform Flow Over Four Stationary Cylinders

A uniform flow over four stationary cylinders is conducted to produce a large-scale vortical flow environment as an initial flow for the fish to swim in. The diameter of the cylinders is D = 0.8L, which is slightly smaller than the body length of the fish. The centers of the cylinders are respectively placed at (−3L, 0.7L), (−3L, − 2.1L), (0L, − 0.7L) and (0L, 2.1L), as shown in Figure 3. Such arrangement is used in order to generate a complex vortical flow via the interaction of the vortices shedding from the leading two cylinder with the trailing cylinders.

FIGURE 3

FIGURE 3. The confined domain of the swimming.

The simulation is performed for a Reynolds number of Re = ρUL/μ = 400 or Re_cylinder = ρUD/μ = 320, where ρ is the density of the fluid, U is the incoming fluid velocity, and μ is the dynamic viscosity of the fluid. This Reynold number is used because it is able to generate sufficiently complex flows with reasonably low computational costs. The computational domain of 50L × 50L is divided into seven blocks with 98,373 grids. The minimum nondimensional grid spacing is Δx/L = Δy/L = 0.01 near the inner boundaries and the nondimensional time step size is ΔtU/L = 0.0004. Validation has been performed to ensure the numerical results are independent of mesh size, domain size and time step size.

Figure 4 shows the vorticity contour and flow velocity distribution behind the cylinders at four different instants (the animation of the movement of the vortices can be found in the Supplementary Materials). It can be seen that abundant vortices are generated in the wake flow of the cylinders, and the strength and moving velocity of the vortices are diversified. Those vortices interact with each other and the trailing cylinders, forming a highly dynamic and unpredictable flow field. Two basic types of vortices are identified: clockwise vortices (blue) and counter-clockwise vortices (red). The clockwise vortices accelerate the flow above it and decelerate the flow below it, and induce upward flow in its left side and downward flow in its right side. On the contrary, the counter-clockwise vortices accelerate the flow below it and decelerate the flow above it, and induce upward flow in its right side and downward flow in its left side. As a result, the flow velocity in the field is vastly altered. In next section, tL/U = 50 is used as an initial flow field for the swimming training.

FIGURE 4

FIGURE 4. Vorticity contour and flow velocity distribution behind the cylinders at four different instants: (A) tU/L = 22.7, (B) tU/L = 24.7, (C) tU/L = 26.7, and (D) tU/L = 28.7.

3.2 Learning to Navigate in the Vortical Flow

In this section, a fish is trained to navigate in a flow field as in the last section. The cases are conducted with four computational cores on a workstation with Intel Xeon CPU E5-2678 and OpenMP. The computational domain of 50L × 50L is divided into seven blocks with about 120,000 grids. The simulation requires about 21.0 s of CPU time per nondimensional time unit t/T = 1.0. For simplicity, the fish is restricted to swim in a rectangular area of 12L × 6L, as shown in Figure 3. The goal of the fish is to swim towards a given destination at (1L, 0.7L) from different initial positions. The goal is reflected by defining a reward as

r d = - \sqrt{{(x_{t i p} / L - 1)}^{2} + {(y_{t i p} / L - 0.7)}^{2}}, (6)

where x_tip and y_tip are the space coordinates of the head tip of the fish. In addition, if the fish swims out of the boundary of the confined area, it is given a strong penalty of rd = −100.

The swimmer propels itself by generating a travelling wave propagating from head to tail, as defined by Eq. 2. In order to achieve high maneuverability, the swimmer can change the wave amplitude every half swimming cycle. Each selected set of parameters is considered as an action. In this case, the period is fixed at TU/L = 0.4; the amplitude action base is defined as θ_lmax = 0°, 10°, 20°, 30°, 40°, 50°, 60°, 70° and 80°; and the wavelength is fixed at λ = L. This parameter set forms an action base of nine components.

A comprehensive representation of the environment state is very important for the accurate motion control. Specifically, the historical evolution of the sensory information should be considered throughly. Zhu et al. [27] conducted tests with different environment information and found that only considering the actions and body kinematics in the last four periods could provide environmental information with enough accuracy for motion control. Therefore, a similar way to consider the environment information is adopted here, in which the state is defined by a tuple.

s_{n} = [\begin{matrix} {(x)}_{n}, & {(y)}_{n}, & {(θ)}_{n}, & {({\bar{u}}_{x})}_{n}, & {({\bar{u}}_{y})}_{n}, & {\bar{ω}}_{n}, \\ {(x)}_{n - 1}, & {(y)}_{n - 1}, & {(θ)}_{n - 1}, & {({\bar{u}}_{x})}_{n - 1}, & {({\bar{u}}_{y})}_{n - 1}, & {\bar{ω}}_{n - 1}, & a_{n - 1} \\ \dots, & \dots, & \dots, & \dots, & \dots, & \dots, & \dots, \\ {(x)}_{n - 8}, & {(y)}_{n - 8}, & {(θ)}_{n - 8}, & {({\bar{u}}_{x})}_{n - 8}, & {({\bar{u}}_{y})}_{n - 8}, & {\bar{ω}}_{n - 8}, & a_{n - 8} \end{matrix}], (7)

where x, y and θ are respectively the space coordinates and orientation angle of the fish, and ${\bar{u}}_{x}$ , ${\bar{u}}_{y}$ and $\bar{ω}$ are respectively the average swimming speed in x − and y − directions and the angular speed in each half period.

The learning process is divided into a series of episodes. In each episode, the initial x coordinate x₀ is randomly chosen between 3 and 7L, the initial y coordinate y₀ is randomly chosen between −1.5 and 1.5L, and the initial orientation angle θ₀ randomly varies between −30° and 30°. The subsequent positions and orientations of the swimmer are then determined by the FSI with the actions. Once the swimmer exceeds the confined area or reaches a small circle area near the destination with radius 0.3L, the episode ends and another starts. The fish is trained for 3,000 episodes and 126,893 periods. Figure 5 shows the traces of the head tip during different learning stages. In episode 99, the fish is not able to maintain in the vortical flow area for a prolonged time and swims out of the confined area quickly. Nevertheless, after a trial-and-error exploration period (episode 565), it learns to hold position in the area for longer time instead of being washed away. At last, it has learned how to directly swim towards its destination. After learning for 990 episodes, it successfully finds a path leading it to close area of the destination, but ending up with a collision with one of the cylinders. Then it struggles and learns to reach the destination without hitting the cylinders (episode 1,604). Finally, after learning for about 3,000 episodes, it could accurately reach the destination.

FIGURE 5

FIGURE 5. The traces of the head during different learning stages.

In order to test the robustness of the control strategy, we investigated 100 different cases with different initial positions and orientation angles using the same control strategy after learning for 3,000 episodes. In 9 of the 100 tests, the fish loses its balance and eventually swam out of the confined area. In those cases, the relative angle of the fish with respective to the incoming flow grows so large that the fish could not restore its orientation in time. In 15 of the 100 tests, the fish ends up with a collision with the cylinders. In those cases, the fish could not resist the strong suction force behind the cylinders. In the other 76 cases, the fish successfully reach the destination. Figure 6 presents the traces when the fish swims to its destination with different initial positions. 5 cases are studied, in which the initial orientation angle is fixed at 0° while the initial position of the head tip ${(r_{t i p})}_{0}$ takes on the values (6L, − 1.5L), (6L, 1.5L), (6L, 0L), (3L, − 1.5L) and (4.5L, − 1.5L). Figure 7 presents the traces when the fish swims to its destination with different initial orientation angles. 5 cases are studied, in which the initial position is fixed at (6L, − 1.5L) while the initial orientation angle θ₀ takes on the values 0°, 30°, 15°, −15° and −30°. In all cases, the fish reaches its destination successfully but the path varies a lot. However, two main paths can be identified. The first path is to approach the destination from the above and the other is to approach from the bottom.

FIGURE 6

FIGURE 6. The traces of the head for different initial positions.

FIGURE 7

FIGURE 7. The traces of the head for different initial orientation angles.

In order to understand the hydrodynamics underlying the behaviors, we investigate a typical case in details, in which the initial orientation angle is 0° and the initial position is (6L, − 1.5L). The time change of the lateral tail tip movement is shown in Figure 8. The vorticity contour and flow velocity distribution in several typical instants are shown in Figure 9 (the animation of the fish swimming can be found in the Supplementary Materials). It is noted that the fish is forced to hold still in the flow field for 50 periods until the vortex street is fully developed. Then it is allowed to swim freely in the flow. Its goal is to swim upstream and reach its destination (green circle in Figure 9). Figure 9A shows the body gesture of the fish and the ambient flow field at instant t/T = 50. It can be seen that an area of reduced streamwise flow (denoted as RSF in the figure) is formed in the right side of the fish. It will be easier if the fish can take advantage of this area to move upstream. However, the surrounding flow is trying to push the fish leftwards to the high flow velocity area. Without active control, the fish will be washed downstream quickly. Therefore, the fish adopts a large-amplitude right flapping to turn right towards the reduced flow area (Figure 9B). At instants t/T = 53 and t/T = 54 (Figures 9C,D), the fish is oriented at the reduced streamwise flow area. Meanwhile, the clockwise flow induced by Vortex 1 (denoted by V1 in the figure) has a tendency to turn it right (rotating clockwise) and draw it backwards to the downstream area. And large-amplitude right flapping will accelerate this process. Therefore, the fish adopts a large-amplitude left flapping to resist this tendency and restore its swimming orientation. In the following several periods, a similar strategy is adopted by the swimmer to take advantage of the reduced streamwise flow area and keep balance (see details in the Supplementary Video S6). From instant t/T = 61.5 to t/T = 65.0 (Figures 9E–H), a strong counter-clockwise vortex (V2) is at the right side of the fish, inducing strong rightward flow and reduce streamwise flow in the right side of the fish. Therefore, the fish adopts two large-amplitude right flapping motions to swim rightwards and three compensate left flapping motions to hold stability. Those motions are of crucial importance for the fish to make the most use of the flow to swim upstream while keeping perfect balance. From instant t/T = 72.5 to t/T = 75.9 (Figures 9I–L), the fish is very close to the destination and located in a strong streamwise flow that could wash it away from the destination. Therefore, the fish adopts a sequence of high-amplitude right flapping motions to fast reach the destination. It is noted that the fish chooses to approach the destination from the counterflow direction instead of the downstream direction, since the high flow velocity makes it extremely hard to swim upstream.

FIGURE 8

FIGURE 8. The time change of the lateral tail tip movement in the local coordinate system.

FIGURE 9

FIGURE 9. Vorticity contour and flow velocity distribution at 12 different instants: (A) t/T = 50, (B) t/T = 51.5, (C) t/T = 53, (D) t/T = 54, (E) t/T = 61.5, (F) t/T = 62.5, (G) t/T = 63, (H) t/T = 65, (I) t/T = 72.5, (J) t/T = 73.5, (K) t/T = 74.5 and (L) t/T = 75.9.

4 Conclusion

The point-to-point navigation of a fish-like swimmer in a vortical flow is numerically studied with a hybrid method of deep reinforcement learning and immersed boundary–lattice Boltzmann method. The goal of the swimmer is to swim upstream through the vortical area to its destination. The vortical area is generated by placing four stationary cylinders in a uniform flow. The function of the vortices is twofold. It not only induces reduced streamwise flow to make swimming upstream easier, but also induces strong streamwise and lateral flow to deviate the swimmer from its desired path. The swimmer utilizes only the time-sequential information of position, orientation, velocity and angular velocity to learn to navigate to its destination. By considering the time-sequential information, the swimmer learns to reach its destination from different initial positions and orientations, demonstrating the effectiveness and robustness of the method. A detailed analysis shows that the fish utilizes highly subtle tail flapping motions to control its swimming orientation and take advantage of the reduced streamwise flow area to reach it destination, and in the same time avoiding entering the high flow velocity area.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.

Author Contributions

YZ has made contributions to methodology, software development, data analysis and interpolation, and writing of the work. F-BT has made contributions to the conception of the work, methodology, and revising of the work. J-HP has made contribution to the conception and revising of the work.

Funding

This work was partially supported by the Australian Research Council (project number DE160101098).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Acknowledgments

YZ acknowledges Shenzhen Institute of Guangdong Ocean University and Dalian Maritime University during the pursuit this study.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphy.2022.870273/full#supplementary-material

References

1. Zermelo E. Über das Navigationsproblem bei ruhender oder veränderlicher Windverteilung. Z Angew Math Mech (1931) 11:114–24. doi:10.1002/zamm.19310110205