Feedback stabilization of probabilistic finite state machines based on deep Q-network

Tian, Hui; Su, Xin; Hou, Yanfang

doi:10.3389/fncom.2024.1385047

ORIGINAL RESEARCH article

Front. Comput. Neurosci., 02 May 2024

Volume 18 - 2024 | https://doi.org/10.3389/fncom.2024.1385047

This article is part of the Research TopicThe Applications of Advanced Artificial Neural Networks in Unmanned Key SubsystemsView all 6 articles

Feedback stabilization of probabilistic finite state machines based on deep Q-network

Hui Tian¹^*

Xin Su¹

Yanfang Hou²

¹Key Laboratory of Industrial Internet of Things and Networked Control, Ministry of Education, Chongqing University of Posts and Telecommunications, Chongqing, China
²School of Electrical and Electronic Engineering, Chongqing University of Technology, Chongqing, China

Background: As an important mathematical model, the finite state machine (FSM) has been used in many fields, such as manufacturing system, health care, and so on. This paper analyzes the current development status of FSMs. It is pointed out that the traditional methods are often inconvenient for analysis and design, or encounter high computational complexity problems when studying FSMs.

Method: The deep Q-network (DQN) technique, which is a model-free optimization method, is introduced to solve the stabilization problem of probabilistic finite state machines (PFSMs). In order to better understand the technique, some preliminaries, including Markov decision process, ϵ-greedy strategy, DQN, and so on, are recalled.

Results: First, a necessary and sufficient stabilizability condition for PFSMs is derived. Next, the feedback stabilization problem of PFSMs is transformed into an optimization problem. Finally, by using the stabilizability condition and deep Q-network, an algorithm for solving the optimization problem (equivalently, computing a state feedback stabilizer) is provided.

Discussion: Compared with the traditional Q learning, DQN avoids the limited capacity problem. So our method can deal with high-dimensional complex systems efficiently. The effectiveness of our method is further demonstrated through an illustrative example.

1 Introduction

The finite state machine (FSM), also known as finite automata (Yan et al., 2015b), is an important mathematical model, which has been used in many different fields, such as manufacturing system (Wang et al., 2017; Piccinini et al., 2018), health care (Shah et al., 2017; Zhang, 2018; Fadhil et al., 2019), and so on. The deterministic finite state machine (DFSM) is known for its deterministic behaviors, in which each subsequent state is uniquely determined by its input event and preceding state (Vayadande et al., 2022). However, DFSMs may not be effective in dealing with random behaviors (Ratsaby, 2019), for example, the randomness caused by component failures in sequential circuits (El-Maleh and Al-Qahtani, 2014). To address the challenge, a probabilistic finite state machine (PFSM) was proposed in the study by Vidal et al. (2005), which provides a more flexible framework for those systems that exhibit random behaviors. Especially, it gives an effective solution to practical issues, such as the reliability assessment of sequential circuits (Li and Tan, 2019). Therefore, the PFSM offers a new perspective for the theoretical research of FSMs.

On the other hand, the stabilization of systems is an important and fundamental research topic, and there have been many excellent research results in various fields, for example, Boolean control network (Tian et al., 2017; Tian and Hou, 2019), time-delay systems (Tian and Wang, 2020), neural networks (Ding et al., 2019), and so on. The stabilization research of FSMs is no exception and has also attracted the attention of many scholars. The concepts of stability and stabilization of discrete event systems described by FSMs were given in the study by Özveren et al. (1991). A polynomial solution of stability detection and a method for constructing stabilizers were presented. Passino et al. (1994) utilzed the Lyapunov method to study the stability and stabilization of FSMs. Tarraf et al. (2008) proposed some new concepts, including gain stability, incremental stability and external stability, and then established a research framework for robust stability of FSMs. Kobayashi et al. developed a linear state equation representation method for modeling DFSMs in the study by Kobayashi (2006) and Kobayashi and Imura (2007) and derived a necessary and sufficient condition for DFSM to be stabilizable at a target equilibrium node in the study by Kobayashi et al. (2011).

However, as we know, the FSM is most often non-linear. Moreover, none of the above methods are convenient when analyzing and designing various FSMs. In the last decade, scholars applied the semi-tensor product (STP) of matrices to FSMs and derived many excellant results. First, with the help of STP, an algebraic form of DFSMs was given in the study by Xu et al. (2013). This algebraic form is a discrete-time bilinear equation. Then, the classic control theory can be used to investigate FSMs. Especially, under the algebraic form, necessary and sufficient conditions for the stabilizability of DFSMs were derived in the study by Xu et al. (2013), and a state feedback controller was obtained by computing a corresponding matrix inequality. Moreover, Yan et al. (2015a) provided a necessary and sufficient condition to check whether a set of states can be stabilized. Han and Chen (2018) considered the set stabilization of DFSMs and provided an optimal design approach for stabilizing controllers. Later, Zhang et al. used the STP method to investigate PFSMs and non-deterministic FSMs. Specifically, a necessary and sufficient condition for stabilization with probability one and a design method for optimal state feedback controller were provided in the study by Zhang et al. (2020a). Moreover, a systematic procedure was designed to get a static output feedback stabilizer for non-deterministic FSMs in the study by Zhang et al. (2020b). Although the STP method is very useful in analyzing discrete event systems, including various FSMs, it suffers from high computational complexity and can only handle small-scale or even micro-scale discrete event systems. To solve the problem, this study refers to techniques developed by Acernese et al. (2019) to solve the stabilization problem of high-dimensional PFSMs, and then provides a reinforcement learning algorithm to compute a state feedback stabilizer for PFSMs. The algorithm is especially advantageous in dealing with high-dimensional systems.

The rest of this study is arranged as follows: Section 2 introduces some preliminary knowledge, including PFSM, Markov decision process (MDP), deep Q newtwork (DQN), and ϵ-greedy strategy. In Section 3, a stabilizabillity condition is derived and an algorithm based on DQN is provided. An illustrative example is employed to show the effectiveness of our results, as shown in Section 4, which is followed by a brief conclusion in Section 5.

2 Methods

For the convenience of statement, some symbol explanations are provided first.

Notation: ℝ expresses the set of all real numbers. ℤ⁺ stands for the set of all positive integers. $ℤ_{a, b}^{+}$ denotes the set {a, a+1, ⋯ , b}, where a, b ∈ ℤ⁺, a ≤ b. |A| is the cardinality of set A.

2.1 Probabilistic finite state machine

A PFSM is a five-tuple

\begin{array}{l} Λ = (X, U, P, f, X_{0}), & (1) \end{array}

where the set $X : = {X_{1}, X_{2}, \dots, X_{n}}$ represents a finite set of states, and $X_{0} \in X$ is the initial state. $U : = {U_{1}, U_{2}, \dots, U_{m}}$ denotes a finite set of events. $P : X \times U \times X \to [0, 1]$ is a transition probability function, and $P (X_{i}, U_{k}, X_{j}) : = P_{X_{i}, X_{j}}^{U_{k}}$ expresses the probability of PFSM (1), transiting from state $X_{i} \in X$ to state $X_{j} \in X$ under the input event $U_{k} \in U$ , satisfying

\begin{array}{l} \sum_{X_{j} \in X} P_{X_{i}, X_{j}}^{U_{k}} = 1 \end{array}

\begin{array}{l} \sum_{X_{j} \in X} P_{X_{i}, X_{j}}^{U_{k}} = 0 . \end{array}

The state transition function $f : X \times U \to 2^{X}$ describes that PFSM (1) may reach different states from one state under the same input event, where $2^{X}$ is the power set of $X$ .

2.2 Markov decision process and optimization methods

A Markov decision process (MDP) is characterized by a quintuple

\begin{array}{l} Ω = (S, A, P, R, γ), & (2) \end{array}

where S is a set of states, A is a set of actions, P is a state transition probability function, R is a reward function, and γ ∈ [0, 1] is a discount factor that determines the trade-off between short-term and long-term gains.

MDP (2) may reach state s_t+1 from state s_t ∈ S under the chosen action a_t ∈ A, and its probability is determined by the function $P_{s_{t}, s_{t + 1}}^{a_{t}} = P (s_{t + 1} ∣ s_{t}, a_{t})$ . The expected one-step reward from state s_t to state s_t+1 via action a_t is as follows:

\begin{array}{l} R_{s_{t}, s_{t + 1}}^{a_{t}} = 𝔼 [r_{t + 1} ∣ s_{t}, a_{t}] \end{array}

where r_t+1 = r_t+1(s_t, a_t, s_t+1) represents the immediate return after adopting action a_t at time t, and $E$ [·] is the expected value of [·].

The objective of MDP (2) is to determine an optimal policy π. This policy can maximize the expected return $E$ _π[G_t] under policy π where

\begin{array}{l} G_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1} . \end{array}

For a given policy π, the value function of a state s_t, denoted by v_π(s_t), is the expected return of MDP (2) taking an action according to the policy π at time step t:

\begin{array}{l} v_{π} (s_{t}) = 𝔼_{π} [\sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1} ∣ s_{t}], \forall s_{t} \in S . & (3) \end{array}

The optimal policy is as follows:

\begin{array}{l} π^{*} (s_{t}, a_{t}) = \underset{π \in Π}{arg max} v_{π} (s_{t}), \forall s_{t} \in S & (4) \end{array}

where Π is the set of all admissible policies.

From (4), it is easy to understand $v^{*} (s_{t}) = v_{π^{*}} (s_{t})$ . Since v_π(·) satisfies the Bellman equation, we have

\begin{array}{l} v^{*} (s_{t}) = max_{a \in A} \sum_{s \in S} P_{s_{t}, s}^{a} [R_{s_{t}, s}^{a} + γ v^{*} (s)] & (5) \end{array}

Similarly, the action-value function describes the cumulative return from state-action (s_t, a_t) under policy π

\begin{array}{l} q_{π} (s_{t}, a_{t}) = 𝔼_{π} [\sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1} ∣ s = s_{t}, a = a_{t}], \forall a_{t} \in A . & (6) \end{array}

By substituting (3) into (6), we can obtain

\begin{array}{l} q_{π} (s_{t}, a_{t}) = 𝔼_{π} [r_{t + 1} + γ v_{π} (s_{t + 1})], \end{array}

which represents the expected return of action a_t adopted by MDP (2) at state s_t, following policy π. The action-value function under optimal strategy π^* is called as the optimal action-value function,i.e., $q^{*} (s_{t}, u_{t}) : = q_{π^{*}} (s_{t}, a_{t}), \forall s_{t} \in S, \forall a_{t} \in A$ . Since $v^{*} (s_{t}) = max_{a} q^{*} (s_{t}, a)$ , from (5), we can get

\begin{array}{l} q^{*} (s_{t}, a_{t}) = \sum_{s \in S} P_{s_{t}, s}^{a_{t}} [R_{s_{t}, s}^{a_{t}} + γ max_{a} q^{*} (s, a)] . \end{array}

Therefore, if MDP (2) exists an optimal deterministic policy, it can be expressed as follows:

\begin{array}{l} μ^{*} (s_{t}) = \underset{a \in A}{arg max} q^{*} (s_{t}, a), \forall s_{t} \in S . \end{array}

DQN is such a technique that combines Q leaning with arificial neural networks (ANNs), providing an effective approach to decision-making problems in dynamic and uncertain environments. It uses ANNs to construct parametric models and estimate action value functions online. Compared with Q learning, the main advantages of DQN are as follows: (1) DQN uses ANNs to approximate Q functions, overcoming the issue of limited capacity in Q tables and enabling the algorithm to handle high-dimensional state spaces. (2) DQN makes full use of empirical knowledge.

Q learning updates the value function according to the following temporal difference (TD) formula:

\begin{array}{l} q (s_{t}, a_{t}) \leftarrow q (s_{t}, a_{t}) + α [r_{t + 1} + γ max_{a^{'}} q (s_{t + 1}, a^{'}) - q (s_{t}, a_{t})], & (7) \end{array}

where $r_{t + 1} + γ {max}_{a^{'}} q (s_{t + 1}, a^{^{'}})$ is the TD target, $r_{t + 1} + γ max_{a^{^{'}}} q (s_{t + 1}, a^{^{'}}) - q (s_{t}, a_{t})$ is the TD error δ, and 0 < α ≤ 1 is a constant that determines how quickly the past experiences are forgotten.

When dealing with high-dimensional complex systems, the action-value function q(s, a), as described in Equation (7), is approximated by an ANN to reduce computational complexity. This can be achieved by minimizing the following loss function

\begin{array}{l} L (θ_{t}) = {(r_{t + 1} + γ max_{a^{'}} q (s_{t + 1}, a^{'}; θ_{t}^{-}) - q (s_{t}, a_{t}; θ_{t}))}^{2}, & (8) \end{array}

where the parameter $θ_{t}^{-}$ is a periodic copy of the current network parameter θ_t.

By differentiating Equation (8), we have

\begin{array}{r} ▽_{θ_{t}} ℒ (θ_{t}) = 2 (r_{t + 1} + γ \max_{a^{'}} q (s_{t + 1}, a^{'}; θ_{t}^{-}) \\ - q (s_{t}, a_{t}; θ_{t})) (- ▽_{θ_{t}} q (s_{t}, a_{t}; θ_{t})), & (9) \end{array}

where ▽_{θ_t}q(s_t, a_t; θ_t) represents the gradient of q(s_t, a_t; θ_t) with respect to the parameter θ_t.

We choose the gradient descent method as the optimization strategy

\begin{array}{l} θ_{t + 1} = θ_{t} - \frac{α}{2} ▽_{θ_{t}} L (θ_{t}) . & (10) \end{array}

By substituting Equations (9) into (10), we obtain an update formula for parameter θ_t

\begin{array}{r} θ_{t + 1} = θ_{t} + α [r_{t + 1} + γ \max_{a^{'}} q (s_{t + 1}, a^{'}; θ_{t}^{-}) \\ - q (s_{t}, a_{t}; θ_{t})] ▽_{θ_{t}} q (s_{t}, a_{t}; θ_{t}) . \end{array}

Finally, the ϵ-greedy strategy is used for action selection. Specifically, an action is chosen randomly with probability ϵ ∈ ℝ(0 < ϵ ≤ 1), and the best estimated action is chosen with probability 1−ϵ. As learning progresses, ϵ gradually decreases, and the policy is shifted from exploring the action space to exploiting the learned Q values. The policy π(a∣s) is as follows:

π (a ∣ s) = {\begin{array}{l} 1 - ϵ + \frac{ϵ}{| A |} & if a = \underset{a \in A}{\arg \max} q (s, a) \\ \frac{ϵ}{| A |} & other actions, \end{array}

where π(a∣s) is the probability of MDP (2) selecting action a at state s. $\underset{a \in A}{arg max} q (s, a)$ stands for the action with the highest estimated Q value for state s.

3 Results

We first give a definition.

Definition 1: Assume that $X_{e}$ is an equilibrium state of PFSM (1). The PFSM is said to be feedback stabilizable to $X_{e}$ with probability one, if for any initial state $X_{i} \in X$ , there exists a control sequence $U : = U_{l_{1}}, U_{l_{2}}, \dots, U_{l_{k}} \in U$ , such that $P_{X_{i}, X_{e}}^{U} = 1$ .

We define an attraction domain $ℜ_{k} (X_{e})$ for an equilibrium state $X_{e}$ , which is a set of states that can reach $X_{e}$ in k steps.

\begin{array}{r} ℜ_{k} (X_{e}) = {X_{i} \in X ∣ there exists a control sequence U : \\ = U_{l_{1}}, U_{l_{2}}, \dots, U_{l_{k}} \in U, such that P_{X_{i}, X_{e}}^{U} = 1} . & (11) \end{array}

Next, we give an important result.

Theorem 1: Assume that $X_{e}$ is an equilibrium state of PFSM (1). The PFSM is feedback stabilizable to $X_{e}$ with probability one, if and only if there exists an integer ρ ≤ n−1 such that

\begin{array}{l} ℜ_{ρ} (X_{e}) = X . & (12) \end{array}

Proof (Necessity): Assume that PFSM (1) is feedback stabilizable to the equilibrium state $X_{e}$ with probability one. Then, according to Definition 1, for any initial state $X_{i}$ , there exists a control sequence $U : = U_{l_{1}}, U_{l_{2}}, \dots, U_{l_{ρ}}$ , such that $P_{X_{i}, X_{e}}^{U} = 1$ , namely $X_{i} \in ℜ_{k} (X_{e})$ . Due to the fact that the state space is a finite set, there must be an integer ρ, such that $ℜ_{ρ} (X_{e}) = X$ holds.

(Sufficiency): Assume that Equation (12) holds. For any initial state $X_{i} \in X$ , we have $X_{i} \in ℜ_{ρ} (X_{e})$ . From Equation (11), there exists a positive integer ρ and a control sequence $U : = U_{l_{1}}, U_{l_{2}}, \dots, U_{l_{ρ}}$ such that $X_{i}$ can be driven to $X_{e}$ by U in ρ steps with probability one. According to Definition 1, PFSM (1) is feedback stabilizable to $X_{e}$ with probability one. ■

We cast the feedback control problem of PFSM (1) into a model-free reinforcement learning framework. The main aim is to find a state feedback controller, which can guarantee the finite time stabilization of PFSM (1). This means that all states can be controlled and brought to an equilibrium state within finite steps. Therefore, PFSM (1) is rewritten as $(X, U, P, R, γ)$ , where P is unknown. The stabilization problem of PFSM (1) is formulated as follows:

\begin{array}{l} \begin{matrix} max_{μ (\cdot)} 𝔼_{μ} & [\sum_{t = 0}^{\infty} γ^{t} r_{t + 1} (X_{t}, U_{t}, X_{t + 1})], \forall X_{0} \in X \end{matrix} & (13) \end{array}

subject to (1),

where

r_{t + 1} = {\begin{array}{l} 1, if X_{t + 1} = X_{e} \\ - 0.1, otherwise . \end{array}

The objective of Equation (13) is to find an action U that maximizes the action-value function q^* among all possible actions in $U$ . Therefore, for any state $X_{t}$ and external condition $X_{e}$ , the optimal state feedback control law of PFSM (1) is as follows:

\begin{array}{l} μ^{*} (X_{t}, X_{e}) = \underset{U \in U}{arg max} q^{*} (X_{t}, U, X_{e}; θ^{-}), \forall X_{t} \in X . \end{array}

Based on the above discussion, we are ready to introduce an algorithm to design an optimal feedback controller (see Algorithm 1). It should be noted that in this algorithm, DQN uses two ANNs. The structure diagram of DQN is shown in Figure 1.

Algorithm 1

Algorithm 1. State feedback stabilization of PFSM (1) based on deep Q-network.

Figure 1

Figure 1. Structure diagram of DQN.

Remark 1: This algorithm is mainly used to solve the stabilization problem of high dimensional PFSMs. For small or micro-scale PFSMs, it is slightly more complex. In this case, we can choose the STP method. Therefore, Algorithm 1 and the STP method complement each other.

According to the results calculated by Algorithm 1, a state feedback controller can be given. Specifically, from Algorithm 1, the result is an optimal policy. Assume that $μ^{*} (X_{i}, X_{e})$ is the calculation result. Then, we get a state feedback controller $μ_{i}^{*} : = μ^{*} (X_{i}, X_{e})$ , $\forall i \in ℤ_{1, n}^{+}$ .

4 Discussion

Example 1: Consider a PFSM

\begin{array}{l} X_{1} (t + 1) = {\begin{array}{l} f (X_{1} (t), U_{1}) = {\begin{array}{l} X_{2}, P = 0.5 \\ X_{5}, P = 0.5 \end{array} \\ f (X_{1} (t), U_{2}) = X_{1}, P = 1.0 \\ f (X_{1} (t), U_{3}) = {\begin{array}{l} X_{1}, P = 0.6 \\ X_{3}, P = 0.4, \end{array} \end{array} X_{2} (t + 1) = {\begin{array}{l} f (X_{2} (t), U_{1}) = X_{2}, P = 1.0 \\ f (X_{2} (t), U_{2}) = {\begin{array}{l} X_{3}, P = 0.7 \\ X_{6}, P = 0.3 \end{array} \\ f (X_{2} (t), U_{3}) = {\begin{array}{l} X_{2}, P = 0.5 \\ X_{4}, P = 0.5, \end{array} \end{array} \\ X_{3} (t + 1) = {\begin{array}{l} f (X_{3} (t), U_{1}) = X_{3}, P = 1.0 \\ f (X_{3} (t), U_{2}) = X_{3}, P = 1.0 \\ f (X_{3} (t), U_{3}) = X_{3}, P = 1.0, \end{array} X_{4} (t + 1) = {\begin{array}{l} f (X_{4} (t), U_{1}) = {\begin{array}{l} X_{4}, P = 0.4 \\ X_{7}, P = 0.6 \end{array} \\ f (X_{4} (t), U_{2}) = X_{5}, P = 1.0 \\ f (X_{4} (t), U_{3}) = {\begin{array}{l} X_{4}, P = 0.3 \\ X_{6}, P = 0.7, \end{array} \end{array} \\ X_{5} (t + 1) = {\begin{array}{l} f (X_{5} (t), U_{1}) = X_{5}, P = 1.0 \\ f (X_{5} (t), U_{2}) = {\begin{array}{l} X_{3}, P = 0.9 \\ X_{6}, P = 0.1 \end{array} \\ f (X_{5} (t), U_{3}) = {\begin{array}{l} X_{5}, P = 0.8 \\ X_{9}, P = 0.2, \end{array} \end{array} X_{6} (t + 1) = {\begin{array}{l} f (X_{6} (t), U_{1}) = X_{3}, P = 1.0 \\ f (X_{6} (t), U_{2}) = X_{6}, P = 1.0 \\ f (X_{6} (t), U_{3}) = {\begin{array}{l} X_{6}, P = 0.7 \\ X_{7}, P = 0.3, \end{array} \end{array} \\ X_{7} (t + 1) = {\begin{array}{l} f (X_{7} (t), U_{1}) = X_{7}, P = 1.0 \\ f (X_{7} (t), U_{2}) = X_{5}, P = 1.0 \\ f (X_{7} (t), U_{3}) = {\begin{array}{l} X_{7}, P = 0.6 \\ X_{8}, P = 0.4, \end{array} \end{array} X_{8} (t + 1) = {\begin{array}{l} f (X_{8} (t), U_{1}) = {\begin{array}{l} X_{7}, P = 0.7 \\ X_{8}, P = 0.3 \end{array} \\ f (X_{8} (t), U_{2}) = X_{9}, P = 1.0 \\ f (X_{8} (t), U_{3}) = {\begin{array}{l} X_{8}, P = 0.9 \\ X_{1}, P = 0.1, \end{array} \end{array} \\ X_{9} (t + 1) = {\begin{array}{l} f (X_{9} (t), U_{1}) = X_{6}, P = 1.0 \\ f (X_{9} (t), U_{2}) = X_{9}, P = 1.0 \\ f (X_{9} (t), U_{3}) = {\begin{array}{l} X_{9}, P = 0.5 \\ X_{2}, P = 0.5, \end{array} \end{array} X_{10} (t + 1) = {\begin{array}{l} f (X_{10} (t), U_{1}) = {\begin{array}{l} X_{11}, P = 0.3 \\ X_{12}, P = 0.7 \end{array} \\ f (X_{10} (t), U_{2}) = X_{10}, P = 1.0 \\ f (X_{10} (t), U_{3}) = {\begin{array}{l} X_{10}, P = 0.4 \\ X_{13}, P = 0.6, \end{array} \end{array} \\ X_{11} (t + 1) = {\begin{array}{l} f (X_{11} (t), U_{1}) = X_{11}, P = 1.0 \\ f (X_{11} (t), U_{2}) = {\begin{array}{l} X_{12}, P = 0.5 \\ X_{14}, P = 0.5 \end{array} \\ f (X_{11} (t), U_{3}) = {\begin{array}{l} X_{11}, P = 0.7 \\ X_{15}, P = 0.3, \end{array} \end{array} X_{12} (t + 1) = {\begin{array}{l} f (X_{12} (t), U_{1}) = {\begin{array}{l} X_{16}, P = 0.6 \\ X_{17}, P = 0.4 \end{array} \\ f (X_{12} (t), U_{2}) = X_{12}, P = 1.0 \\ f (X_{12} (t), U_{3}) = {\begin{array}{l} X_{12}, P = 0.8 \\ X_{18}, P = 0.2, \end{array} \end{array} \\ X_{13} (t + 1) = {\begin{array}{l} f (X_{13} (t), U_{1}) = X_{13}, P = 1.0 \\ f (X_{13} (t), U_{2}) = {\begin{array}{l} X_{14}, P = 0.9 \\ X_{19}, P = 0.1 \end{array} \\ f (X_{13} (t), U_{3}) = {\begin{array}{l} X_{13}, P = 0.5 \\ X_{20}, P = 0.5, \end{array} \end{array} X_{14} (t + 1) = {\begin{array}{l} f (X_{14} (t), U_{1}) = {\begin{array}{l} X_{15}, P = 0.8 \\ X_{16}, P = 0.2 \end{array} \\ f (X_{14} (t), U_{2}) = X_{14}, P = 1.0 \\ f (X_{14} (t), U_{3}) = {\begin{array}{l} X_{17}, P = 0.6 \\ X_{18}, P = 0.4, \end{array} \end{array} \\ X_{15} (t + 1) = {\begin{array}{l} f (X_{15} (t), U_{1}) = X_{15}, P = 1.0 \\ f (X_{15} (t), U_{2}) = {\begin{array}{l} X_{16}, P = 0.7 \\ X_{20}, P = 0.3 \end{array} \\ f (X_{15} (t), U_{3}) = {\begin{array}{l} X_{19}, P = 0.5 \\ X_{15}, P = 0.5, \end{array} \end{array} X_{16} (t + 1) = {\begin{array}{l} f (X_{16} (t), U_{1}) = X_{16}, P = 1.0 \\ f (X_{16} (t), U_{2}) = {\begin{array}{l} X_{17}, P = 0.8 \\ X_{18}, P = 0.2 \end{array} \\ f (X_{16} (t), U_{3}) = {\begin{array}{l} X_{16}, P = 0.6 \\ X_{19}, P = 0.4, \end{array} \end{array} \\ X_{17} (t + 1) = {\begin{array}{l} f (X_{17} (t), U_{1}) = X_{17}, P = 1.0 \\ f (X_{17} (t), U_{2}) = {\begin{array}{l} X_{18}, P = 0.9 \\ X_{20}, P = 0.1 \end{array} \\ f (X_{17} (t), U_{3}) = {\begin{array}{l} X_{17}, P = 0.7 \\ X_{19}, P = 0.3, \end{array} \end{array} X_{18} (t + 1) = {\begin{array}{l} f (X_{18} (t), U_{1}) = X_{18}, P = 1.0 \\ f (X_{18} (t), U_{2}) = {\begin{array}{l} X_{19}, P = 0.8 \\ X_{20}, P = 0.2 \end{array} \\ f (X_{18} (t), U_{3}) = {\begin{array}{l} X_{18}, P = 0.5 \\ X_{1}, P = 0.5, \end{array} \end{array} \\ X_{19} (t + 1) = {\begin{array}{l} f (X_{19} (t), U_{1}) = X_{19}, P = 1.0 \\ f (X_{19} (t), U_{2}) = {\begin{array}{l} X_{20}, P = 0.6 \\ X_{1}, P = 0.4 \end{array} \\ f (X_{19} (t), U_{3}) = {\begin{array}{l} X_{19}, P = 0.8 \\ X_{2}, P = 0.2, \end{array} \end{array} X_{20} (t + 1) = {\begin{array}{l} f (X_{20} (t), U_{1}) = X_{20}, P = 1.0 \\ f (X_{20} (t), U_{2}) = {\begin{array}{l} X_{1}, P = 0.7 \\ X_{3}, P = 0.3 \end{array} \\ f (X_{20} (t), U_{3}) = {\begin{array}{l} X_{20}, P = 0.9 \\ X_{4}, P = 0.1, \end{array} \end{array} & (14) \end{array}

where $X_{i} (t)$ represents the i-th state of PFSM (14) at time step t. It is easy to observe that $X_{3}$ is an equilibrium state.

We now use Algorithm 1 to compute a state feedback controller to stabilize PFSM (14) to $X_{3}$ . The computation is performed on a computer with Intel i5-11300H processor, 2.6 GHz frequency, 16 GB RAM, and Python 3.7 software. We adopt TensorFlow in Keras to train the DQN model, where the discount factor γ is 0.99, the rang for ϵ in ϵ-greedy policy is from 0.05 to 1.0, and the sizes of memory buffer B and mini-batch M are 10,000 and 128, respectively.

Through calculation, we obtain a state feedback controller

\begin{array}{l} μ_{i}^{*} = μ^{*} (X_{i}, X_{3}), i \in [1, 20], & (15) \end{array}

which is shown in Table 1.

Table 1

Table 1. A state feedback controller of PFSM (14).

Model (14) is a PFSM with 20 states, which is not a simple system. Here, we utilize average rewards to track the performance during training (see Figure 2). It is easy to observe that as training time goes on, the performance inceases and tends to be stable. We put the state feedback controller (15), as shown in Table 1, into PFSM (14) and get a closed-loop system.

\begin{array}{l} X_{1} (t + 1) = {\begin{array}{l} X_{2}, P = 0.5 \\ X_{5}, P = 0.5, \end{array} X_{2} (t + 1) = {\begin{array}{l} X_{3}, P = 0.7 \\ X_{6}, P = 0.3, \end{array} \\ X_{3} (t + 1) = X_{3}, P = 1.0, X_{4} (t + 1) = {\begin{array}{l} X_{4}, P = 0.4 \\ X_{7}, P = 0.6, \end{array} \\ X_{5} (t + 1) = {\begin{array}{l} X_{3}, P = 0.9 \\ X_{6}, P = 0.1, \end{array} X_{6} (t + 1) = X_{3}, P = 1.0, \\ X_{7} (t + 1) = X_{5}, P = 1.0, X_{8} (t + 1) = X_{9}, P = 1.0, \\ X_{9} (t + 1) = X_{6}, P = 1.0, X_{10} (t + 1) = {\begin{array}{l} X_{10}, P = 0.4 \\ X_{13}, P = 0.6, \end{array} \\ X_{11} (t + 1) = {\begin{array}{l} X_{12}, P = 0.5 \\ X_{14}, P = 0.5, \end{array} X_{12} (t + 1) = {\begin{array}{l} X_{12}, P = 0.8 \\ X_{18}, P = 0.2, \end{array} \\ X_{13} (t + 1) = {\begin{array}{l} X_{13}, P = 0.5 \\ X_{20}, P = 0.5, \end{array} X_{14} (t + 1) = {\begin{array}{l} X_{17}, P = 0.6 \\ X_{18}, P = 0.4, \end{array} \\ X_{15} (t + 1) = {\begin{array}{l} X_{16}, P = 0.7 \\ X_{20}, P = 0.3, \end{array} X_{16} (t + 1) = {\begin{array}{l} X_{17}, P = 0.8 \\ X_{18}, P = 0.2, \end{array} \\ X_{17} (t + 1) = {\begin{array}{l} X_{18}, P = 0.9 \\ X_{20}, P = 0.1, \end{array} X_{18} (t + 1) = {\begin{array}{l} X_{18}, P = 0.5 \\ X_{1}, P = 0.5, \end{array} \\ X_{19} (t + 1) = {\begin{array}{l} X_{20}, P = 0.6 \\ X_{1}, P = 0.4, \end{array} X_{20} (t + 1) = {\begin{array}{l} X_{1}, P = 0.7 \\ X_{3}, P = 0.3. \end{array} & (16) \end{array}

Figure 2

Figure 2. Performance of Algorithm 1 in Example 1.

The state transition trajectory of the closed-loop system (16) starting from any initial state is shown in Figure 3. It can be observed from Figure 3 that all states reach $X_{3}$ after a finite number of steps and then stay at $X_{3}$ forever with probability one. This demonstrates the effectiveness of our controller. The number of steps required to reach $X_{3}$ for each state is shown in Figure 4. From these results, we can observe that based on DQN, Algorithm 1 can solve the stabilization problem of non-small-scale PFSMs.

Figure 3

Figure 3. Evolution of the closed-loop system (16).

Figure 4

Figure 4. The number of steps required to stabilize PFSM (14) to $X_{3}$ .

5 Conclusion

This article studied the state feedback stabilization of PFSMs using the DQN method. The feedback stabilization problem of PFSMs was first transformed into an optimization problem. A DQN was built, whose two key parts: TD target and Q function, are approximated through neural networks. Then, based on the DQN and a stabilizability condition derived in this paper, an algorithm was developed. The algorithm can be used to calculate the optimization problem mentioned above and then solves the feedback stability problem of PFSMs. Since DQN avoids the limited capacity problem of Q learning, our algortithm can handle high-dimensional complex systems. Finally, an illustrative example is provided to show the effectiveness of our method.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

HT: Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Supervision, Writing—review & editing. XS: Conceptualization, Formal analysis, Investigation, Methodology, Validation, Writing—original draft. YH: Formal analysis, Investigation, Writing—review & editing.

Funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This research was funded by the National Key R&D Program of China (2021YFB3203202) and Chongqing Nature Science Foundation (cstc2020jcyj-msxmX0708).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Acernese, A., Yerudkar, A., Glielmo, L., and Vecchio, C. D. (2020). Double deep-Q learning-based output tracking of probabilistic Boolean control networks. IEEE Access 8, 199254–199265. doi: 10.1109/ACCESS.2020.3035152

Crossref Full Text | Google Scholar

Ding, S., Wang, Z., and Zhang, H. (2019). Quasi-synchronization of delayed memristive neural networks via region-partitioning-dependent intermittent control. IEEE Trans. Cybern. 49, 4066–4077. doi: 10.1109/TCYB.2018.2856907

PubMed Abstract | Crossref Full Text | Google Scholar

El-Maleh, A., and Al-Qahtani, A. (2014). A finite state machine based fault tolerance technique for sequential circuits. Microelectron. Reliab. 54, 654–661. doi: 10.1016/j.microrel.2013.10.022

Crossref Full Text | Google Scholar

Fadhil, A., Wang, Y., and Reiterer, H. (2019). Assistive conversational agent for health coaching: a validation study. Methods Inf. Med. 58, 9–23. doi: 10.1055/s-0039-1688757

PubMed Abstract | Crossref Full Text | Google Scholar

Han, X., and Chen, Z. (2018). A matrix-based approach to verifying stability and synthesizing optimal stabilizing controllers for finite-state automata. J. Franklin Inst. 355, 8642–8663. doi: 10.1016/j.jfranklin.2018.09.009

Crossref Full Text | Google Scholar

Kobayashi, K. (2006). “Modeling of discrete dynamics for computational time reduction of model predictive control,” in Proceedings of the 17th International Symposium on Mathematical Theory of Networks and Systems (Tokyo), 628–633.

Google Scholar

Kobayashi, K., and Imura, J. (2007). “Minimalilty of finite automata representation in hybrid systems control,” in Hybrid Systems: Computation and Control (Berlin Heidelberg: Springer), 343–356. doi: 10.1007/978-3-540-71493-4_28

PubMed Abstract | Crossref Full Text | Google Scholar

Kobayashi, K., Imura, J., and Hiraishi, K. (2011). Stabilization of finite automata with application to hybrid systems control. Discret. Event Dyn. Syst. 21, 519–545. doi: 10.1007/s10626-011-0110-2

Crossref Full Text | Google Scholar

Li, J., and Tan, Y. (2019). A probabilistic finite state machine based strategy for multi-target search using swarm robotics. Appl. Soft Comput. 77, 467–483. doi: 10.1016/j.asoc.2019.01.023

Crossref Full Text | Google Scholar

Özveren, C., Willsky, A., and Antsaklis, P. (1991). Stability and stabilizability of discrete event dynamic systems. J. ACM 38, 729–751. doi: 10.1145/116825.116855

Crossref Full Text | Google Scholar

Passino, K., Michel, A., and Antsaklis, P. (1994). Lyapunov stability of a class of discrete event systems. IEEE Trans. Automat. Contr. 39, 269–279. doi: 10.1109/9.272323

Crossref Full Text | Google Scholar

Piccinini, A., Previdi, F., Cimini, C., Pinto, R., and Pirola, F. (2018). Discrete event simulation for the reconfiguration of a flexible manufactuing plant. IFAC-PapersOnLine 51, 465–470. doi: 10.1016/j.ifacol.2018.08.362

Crossref Full Text | Google Scholar

Ratsaby, J. (2019). On deterministic finite state machines in random environments. Probab. Eng. Inf. Sci. 33, 528–563. doi: 10.1017/S0269964818000451

Crossref Full Text | Google Scholar

Shah, S., Velardo, C., Farmer, A., and Tarassenko, L. (2017). Exacerbations in chronic obstructive pulmonary disease: identification and prediction using a digital health system. J. Med. Internet Res. 19:e69. doi: 10.2196/jmir.7207

PubMed Abstract | Crossref Full Text | Google Scholar

Tarraf, D., Megretski, A., and Dahleh, M. (2008). A framework for robust stability of systems over finite alphabets. IEEE Trans. Automat. Contr. 53, 1133–1146. doi: 10.1109/TAC.2008.923658

Crossref Full Text | Google Scholar

Tian, H., and Hou, Y. (2019). State feedback design for set stabilization of probabilistic boolean control networks. J. Franklin Inst. 356, 4358–4377. doi: 10.1016/j.jfranklin.2018.12.027

Crossref Full Text | Google Scholar

Tian, H., Zhang, H., Wang, Z., and Hou, Y. (2017). Stabilization of k-valued logical control networks by open-loop control via the reverse-transfer method. Automatica 83, 387–390. doi: 10.1016/j.automatica.2016.12.040

Crossref Full Text | Google Scholar

Tian, Y., and Wang, Z. (2020). A new multiple integral inequality and its application to stability analysis of time-delay systems. Appl. Math. Lett. 105:106325. doi: 10.1016/j.aml.2020.106325

Crossref Full Text | Google Scholar

Vayadande, K., Sheth, P., Shelke, A., Patil, V., Shevate, S., Sawakare, C., et al. (2022). Simulation and testing of deterministic finite automata machine. International Journal of Comput. Sci. Eng. 10, 13–17. doi: 10.26438/ijcse/v10i1.1317

Crossref Full Text | Google Scholar

Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F., and Carrasco, R. (2005). Probabilistic finite-state machines - part I. IEEE Trans. Pattern Anal. Mach. Intell., 27, 1013–1025. doi: 10.1109/TPAMI.2005.147

PubMed Abstract | Crossref Full Text | Google Scholar

Wang, L., Zhu, B., Wang, Q., and Zhang, Y. (2017). Modeling of hot stamping process procedure based on finite state machine (FSM). Int. J. Adv. Manuf. Technol. 89, 857–868. doi: 10.1007/s00170-016-9097-z

Crossref Full Text | Google Scholar

Xu, X., Zhang, Y., and Hong, Y. (2013). “Matrix approach to stabilizability of deterministic finite automata,” in 2013 American Control Conference (Washington, DC), 3242–3247.

Google Scholar

Yan, Y., Chen, Z., and Liu, Z. (2015a). Semi-tensor product approach to controllability and stabilizability of finite automata. J. Syst. Eng. Electron. 26, 134–141. doi: 10.1109/JSEE.2015.00018

Crossref Full Text | Google Scholar

Yan, Y., Chen, Z., and Yue, J. (2015b). Stp approach to controlliability of finite state machines. IFAC-PapersOnLine 48, 138–143. doi: 10.1016/j.ifacol.2015.12.114

Crossref Full Text | Google Scholar

Zhang, X. (2018). Application of discrete event simulation in health care: a systematic review. BMC Health Serv. Res. 18, 1–11. doi: 10.1186/s12913-018-3456-4

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, Z., Chen, Z., Han, X., and Liu, Z. (2020a). Stabilization of probabilistic finite automata based on semi-tensor product of matrices. J. Franklin Inst. 357, 5173–5186. doi: 10.1016/j.jfranklin.2020.02.028

Crossref Full Text | Google Scholar

Zhang, Z., Xia, C., and Chen, Z. (2020b). On the stabilization of nondeterministic finite automata via static output feedback. Appl. Math. Comput. 365:124687. doi: 10.1016/j.amc.2019.124687

Crossref Full Text | Google Scholar

Keywords: probabilistic finite state machine (PFSM), deep Q-network (DQN), feedback stabilization, artificial neural network (ANN), controller

Citation: Tian H, Su X and Hou Y (2024) Feedback stabilization of probabilistic finite state machines based on deep Q-network. Front. Comput. Neurosci. 18:1385047. doi: 10.3389/fncom.2024.1385047

Received: 11 February 2024; Accepted: 08 April 2024;
Published: 02 May 2024.

Edited by:

Yang Cui, University of Science and Technology Liaoning, China

Reviewed by:

Geyang Xiao, Zhejiang Lab, China
Junqi Yang, Henan Polytechnic University, China

Copyright © 2024 Tian, Su and Hou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Hui Tian, dGlhbmh1aUBjcXVwdC5lZHUuY24=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.