A Wasserstein-based distributionally robust neural network for non-intrusive load monitoring

Zhang, Qing; Yan, Yi; Kong, Fannie; Chen, Shifei; Yang, Linfeng

doi:10.3389/fenrg.2023.1171437

ORIGINAL RESEARCH article

Front. Energy Res., 05 April 2023

Sec. Smart Grids

Volume 11 - 2023 | https://doi.org/10.3389/fenrg.2023.1171437

This article is part of the Research TopicKey Technology of Smart Energy System OptimizationView all 5 articles

A Wasserstein-based distributionally robust neural network for non-intrusive load monitoring

Qing Zhang^1,2

Yi Yan¹*

Fannie Kong³

Shifei Chen³

Linfeng Yang^1,2

¹School of Computer Electronics and Information, Guangxi University, Nanning, China
²Guangxi Key Laboratory of Multimedia Communication and Network Technology, Guangxi University, Nanning, China
³School of Electrical Engineering, Guangxi University, Nanning, China

Non-intrusive load monitoring (NILM) is a technique that uses electrical data analysis to disaggregate the total energy consumption of a building or home into the energy consumption of individual appliances. To address the data uncertainty problem in non-intrusive load monitoring, this paper constructs an ambiguity set to improve the robustness of the model based on the distributionally robust optimization (DRO) framework using the Wasserstein metric. Also, for the hard-to-solve semi-infinite programming problem, a novel and computationally efficient upper-layer approximation is used to transform it into an easily solvable regularization problem. Two different data feature extraction methods are used on two open-source datasets, and the experimental results show that the proposed model has good robustness and performs better in identifying devices with large fluctuations. The improvement is about 6% compared to that of the convolutional neural network model without the addition of distributionally robust optimization. The proposed method supports transfer learning and can be added to the neural network in the form of a single-layer net, avoiding unnecessary training times, while ensuring accuracy.

1 Introduction

Most countries around the world are witnessing rapid growth in building energy use; commercial and residential buildings account for more than one-third of the global energy consumption, while accounting for more than 40% of global carbon dioxide emissions (Yoon S H et al., 2018). In order to improve energy efficiency, it is necessary to adopt more suitable energy management techniques (Zhang D et al., 2021) or the use of smart devices to collect more detailed equipment data (Xie H et al., 2023). Since the 1990s, non-intrusive load monitoring (NILM) (Hart G W, 1992) has become one of the dominant frameworks in the field of energy consumption detection (Azizi E et al., 2021; Gillis J M et al., 2017). NILM is a technology that uses electrical data analysis to disaggregate the total energy consumption of a building or home into the energy consumption of individual appliances. This can be achieved without the need for additional hardware sensors, by analyzing the electrical data to identify the energy consumption of each appliance, including energy consumption, frequency of use, and energy peaks. Compared to traditional energy monitoring techniques, NILM technology has the advantages of being non-intrusive, less costly, scalable, and providing more accurate energy consumption data. This technology increases the interaction between electricity suppliers and consumers. For suppliers, NILM can help them understand the power models of various appliances more accurately, and for consumers, they can target specific appliances for a more rational use (Liu Y et al., 2018).

In general, there are two approaches to NILM technology, the event-based approach and the event-less approach. The former method usually performs device identification through state transitions of a single appliance in the total measurement data. The latter often does the matching by separating the sample data of one or more appliances from the aggregated data. In this paper, we adopt an event-based detection method, which has the following steps: event monitoring, feature extraction, and load identification. The task of the event detection phase is mainly to record the changes in aggregated data caused when one or more appliances are activated or when a state transition occurs and then, to extract the features of the data in this phase; the extracted features should maximize the differences between different appliances and minimize the differences between the same appliances (Zheng Z et al., 2018). The selection of an effective set of appliance features is still challenging and an appropriate feature representation can greatly affect the accuracy of appliance recognition (Liu Y et al., 2018), after which the feature can be used in load recognition to identify different appliance classes.

Load recognition is one of the important tasks of NILM, which uses machine learning techniques to extract the electrical feature vectors of each appliance from the aggregate measurements and match them to their respective classes at the output (Azizi E et al., 2021). The electrical appliance feature data are extracted at different sampling rates (high or low frequency) and which data are used depend on the appliance features required by the adopted algorithm. Low-frequency data usually record appliance data over long periods with long intervals between data, usually seconds or minutes. High-frequency data provide more detailed data features to allow us to consider the steady-state, transient, and other characteristics of appliances and to extract the relationship between voltage and current. Several related studies have demonstrated the feasibility of identification techniques for high-frequency features (Du L et al., 2015; De Baets L et al., 2018; De Baets L et al., 2018; Wang A L et al., 2018; Abd El-Ghany H A et al., 2021; Chea R et al., 2022; Lu J et al., 2023). Wang A L et al. (2018) developed a classification method for household appliances based on the shape features of V-I trajectories. Du L et al. (2015) used binary mapping of voltage and current trajectories to obtain features for appliance classification and to compare and analyze the different features; the binary images were directly input in the classifier, which achieved good accuracy on the PLAID (Gao J et al., 2014). De Baets L et al. (2018) proposed that the V-I trajectories were interpreted as weighted pixelated images, trained and tested on the WHITED dataset and the PLAID, and the experiments showed that it was also feasible to directly use the processed V-I pixel maps as input in the neural network.

For the practical application of NILM, there are two main common challenges: 1) the accuracy of the extracted feature-vector data directly affects the final accuracy of the model, and it is crucial to resolve the instability of the data. 2) Training a recognition model from scratch for different brands of appliances can be time-consuming and expensive in terms of computational resources, and even with an extensive coverage database, maintaining the database would be a challenge as the number of appliances increases. Transfer learning allows different tasks to use the same learning framework, which reduces modeling and computational costs and is one of the solutions to problem 2. For problem 1, currently, the common methods used to solve this problem mathematically are stochastic programming (SP) and robust optimization (RO). SP assumes that the uncertainty of the problem follows an assumed probability distribution; then, it is feasible to transform it into a deterministic problem, but the intractable problem is to find the appropriate assumed distribution (Asensio M et al., 2015). RO neglects to extract probabilistic information about the uncertainty and instead, gives rather conservative solutions, i.e., always looking for the best solution in the worst case (Wei W et al., 2014).

To combine the characteristics of SP and RO, researchers have proposed a new approach called distributionally robust optimization (DRO) (Delage E et al., 2010; Rahimian H et al., 2019; Cheramin M et al., 2022). Unlike the probability distribution assumed in SP, DRO presents the probability distribution as an ambiguity set and minimizes the expected consumption in the worst case. There are two main approaches for constructing the ambiguity set: one based on moments and the other on distances. Considering that, we only have part of the available historical data and do not know the real probability distribution information; the constructed ambiguity set should contain the real data distribution as much as possible in order to get better results. It should be noted that as the historical data gradually increases, the ambiguity set becomes progressively smaller and is closer to the true distribution than the ambiguity set at lower data volumes.

Our contribution has three main aspects: 1) we proposed the DRO approach can be used in NILM and supports transfer learning. The optimization module of DRO can be used as part of an end-to-end deep learning network, while allowing incorporation into the pipeline in the form of a single-layer network structure. This approach allows for easier modification of the network, thus improving the recognition of appliance features. 2) In addition to using V-I trajectory maps for the representation of appliance features, we apply the Euclidean distance matrix as a preprocessing of the data, and this method improved the uniqueness of appliance features. 3) We evaluate this method in two open-source datasets. Unlike traditional methods, we use aggregated data from the entire house for training and testing, instead of using data from the submeters of a single appliance for learning, which is more realistic.

2 A proposed distributionally robust method

2.1 Classical learning model

The goal of supervised learning is to derive an unknown objective function $f : X \to Y$ from the available historical data. We assume that the training samples are independent of each other and follow an unknown distribution $P : X \times Y$ ; then, the objective function $f$ maps any input, $x \in X$ to $y \in Y$ (e.g., for a binary classification problem, the label is $- 1$ and $1$ ), since the space of all mapping functions from $X$ to $Y$ is very large and learning the target function from an infinite number of samples is also very difficult. Therefore, it is convenient to constrain the search space to a structured family of candidate functions, $H \subseteq R^{X}$ , e.g., $H$ is the space of all linear functions or all neural networks with a fixed number of layers, so we refer to the candidate function $h \in H$ as a hypothesis and $H$ as the hypothesis space.

Considering a convolutional neural network with $M$ layers, we can obtain the following:

H = \{h (∙) | \begin{matrix} \exists ϕ_{m} (∙), m \in M, \\ h (x) = σ_{M} (W_{M} (\dots σ_{1} (W_{1} x))) \end{matrix}\} (1)

$σ$ is the activation function for each layer. Empirical risk minimization (ERM) allows the trained machine learning model to achieve excellent performance on data sampled from the distribution followed by its training dataset. Then, the target problem can be written as follows:

\inf_{h \in H} \{\frac{1}{N} \sum_{i = 1}^{N} l (h (x^{i}), y^{i})\} = \inf_{h \in H} \{E_{{\hat{P}}_{N}} [l (h (x), y)]\} (2)

where ${\hat{P}}_{N}$ is a simple unbiased estimator of the unknown true distribution $P^{*}$ based on the empirical data $(x, y)$ , and ${\hat{P}}_{N}$ can be obtained by the Dirac measure, shown as follows:

{\hat{P}}_{N} = \frac{1}{N} \sum_{i = 1}^{N} δ_{(x^{i}, y^{i})} (3)

Intuitively, as $N \to \infty$ , ${\hat{P}}_{N}$ should tend to the true distribution $P^{*}$ of $(x, y)$ . For the sake of description, we note that $ξ^{i} = (x^{i}, y^{i})$ and $L (ξ, h) = l (h (x), y)$ .

Non-intrusive load monitoring usually requires only the aggregated signals of the whole building to be collected, and by analyzing the aggregated signals, the working status of each sub-appliance in the building can be derived. Non-intrusive load monitoring can be divided into two aspects, load decomposition and load identification, and the experiments in this paper are based on load identification. Since the data used are high-frequency data, the transient characteristics are equivalent to the electrical characteristics that cause the events, and the transient characteristics include both voltage and current. Therefore, the input of the model in the experimental part of this paper is the transient electrical characteristics and the output is the equipment that matches such electrical characteristics, so as to achieve the purpose of load identification.

For the load identification problem, assuming that there are $K$ classes of appliances and we have a sample $x \in X$ , our goal is to predict its label, represented by a $K$ -dim vector $y \in {0,1}^{K}$ , where $y = \{y_{1}, y_{2}, y_{3}, \dots, y_{K}\}$ , $\sum_{i}^{K} y_{i} = 1$ , and $y_{i} = 1$ , if and only if $x$ belongs to class $i$ . For a given input $x$ , the conditional distribution of $y$ can be written as follows:

p (y | x) = \prod_{i}^{K} p (y^{i} | x^{i}) = \prod_{i}^{K} p_{i}^{y^{i}} (4)

where $p (y^{i} | x^{i}) = e^{w^{i} x} / \sum_{k = 1}^{K} e^{w^{k} x}$ , $i \in [K]$ , and $W$ are the weight matrices; the log-likelihood can be written as follows:

\log p (y | x) = \sum_{i}^{K} y^{i} \log (e^{w^{i} x} / \sum_{k = 1}^{K} e^{w^{k} x}) = y W x - \log 1 e^{W x}

where $W ≜ [W^{1}, W^{2}, \dots, W^{K}]$ . The loss function is defined as $L (ξ, h) = \log 1 e^{W x} - y W x$ . Thus, our target is as follows:

\inf_{h \in H} \{E_{{\hat{P}}_{N}} [L (ξ, h)]\} (5)

2.2 An approximation based on the Wasserstein metric

In fact, if only the empirical risk is minimized as in (2), there are many hypotheses other than the log-loss function that are compatible with the existing training data, achieving an accurate prediction of the output value from the input values in the existing dataset (Defourny B et al., 2010). Considering only minimizing the empirical loss, it causes an overfitting of the sample; this can lead to these hypotheses producing predictions that do not match the expectations on the datasets, other than the training data. This means that even if good results are obtained on $E_{{\hat{P}}_{N}} [L (ξ, h)]$ , the error $E_{P^{*}} [L (ξ, h)]$ outside the existing sample will be large for an unknown true distribution $P^{*}$ .

Regularization is an effective method to combat overfitting, so it is better to approximate the solution of a regularized problem as opposed to solving the problem in (5). A common regularization is mostly seen in the following form:

\inf_{h \in H} \{E_{{\hat{P}}_{N}} [L (ξ, h)] + λ Ω (∙)\} (6)

where $Ω (∙)$ is a penalty term, $λ$ is the regularization weight of the regularization function, and the function minimizes the sum of the average loss and penalty terms. Usually, $Ω (∙) = {‖∙‖}_{p}$ , and the value of $p$ is $1, 2$ or $\infty$ . Even though there are many ideal theoretical models for the interpretation of regularization, there is a consensus that regularization methods that have been successfully validated in practice are heuristic methods (Wan L et al., 2013). Most popular interpretations of regularization methods rely on a priori probability distribution assumptions, which remain arbitrary in some perspectives. Equation 6, which consists of in-sample error and overfitting penalty, can be seen as an in-sample estimate of the out-of-sample error; however, this problem remains difficult to prove.

Based on the Wasserstein metric, we can consider getting the expected loss under distribution $Q$ close to the empirical distribution ${\hat{P}}_{N}$ , i.e., distribution $Q$ is able to produce training data outside of ${\hat{P}}_{N}$ with high confidence. In this way, we can achieve the goal of obtaining out-of-sample data. The distance measure between the two distributions $P$ and $Q$ can be expressed as follows:

\begin{array}{c} W (P, Q) = \inf_{π \in m (Ξ x Ξ)} \{E_{π} [d (ξ^{P}, ξ^{Q})] : ξ^{P} \sim P, ξ^{Q} \sim Q\} \\ = \inf_{π \in m (Ξ x Ξ)} \{\begin{matrix} \int_{Ξ \times Ξ} d (ξ^{P}, ξ^{Q}) Π (d ξ^{P}, d ξ^{Q}) : \\ π (d ξ^{P}, Ξ) = P (d ξ^{P}), \\ π (d ξ^{Q}, Ξ) = Q (d ξ^{Q}), \\ d (ξ^{P}, ξ^{Q}) = {‖x^{P} - x^{Q}‖}_{p} + κ 1_{\{y^{P} \neq y^{Q}\}} \end{matrix}\} \end{array} (7)

where $m (∙)$ is the set of probability measures on a measurable space, $π$ is the joint distribution of $ξ^{P}$ and $ξ^{Q}$ ; $ξ^{P}$ and $ξ^{Q}$ follow distribution $P$ and $Q$ , respectively. $d (∙)$ is the metric between two distributions on $Ξ$ , and $κ$ is a positive constant. Explained in a different way, $W (P, Q)$ usually represents the solution to a transportation problem and mathematically represents the overall minimum cost of moving distribution $P$ to distribution $Q$ ; $d$ represents the cost of moving unit $ξ^{P}$ to $ξ^{Q}$ .

2.2.1 Ambiguity set

Considering the ambiguity set, the constructed Wasserstein ball (Zhao C et al., 2018) constructed with empirical distribution ${\hat{P}}_{N}$ centered at a given radius $ϵ$ is as follows:

B_{ϵ} ({\hat{P}}_{N}) = \{P | W (P, {\hat{P}}_{N}) \leq ε\} (8)

subject to the constraint that a suitable and sufficiently large ball will contain all the distributions of the unknown true input–output distribution $P^{*}$ , and for the selection of the radius (Duan C et al., 2018), it gives a possible choice. At this point, the worst-case expectation is $\sup_{P \in B_{ϵ} ({\hat{P}}_{N})} E_{P} [L (ξ, h)]$ , which is also the upper bound on the out-of-sample error $E_{P^{*}} [L (ξ, h)]$ . This allows us to replace (6) with a new formulation that is able to achieve the minimum expectation in the worst case, shown as follows:

\inf_{h \in H} \{\sup_{P \in B_{ϵ} ({\hat{P}}_{N})} E_{P} [L (ξ, h)]\} (9)

2.2.2 Support set

The purpose of the data-driven support set is to capture a priori information about the range of inputs and outputs. We adopt upper and lower bounds on each dimension to specify the support set of uncertainty, given as follows:

Ξ = \{ξ | Π^{-} \leq ξ \leq Π^{+}\}, (10)

where the upper and lower bound can be determined based on ${\{ξ^{i}\}}_{i = 1}^{N}$ , and (10) can be reformulated as follows:

Ξ = \{ξ | A ξ \leq b\} (11)

where $A = [I; - I]$ and $b = [Π^{+}, Π^{-}]$ .

3 The proposed solution methodology

3.1 Reformulation of the proposed model

Problem (9), obtained previously, is hard to reformulate because of the presence of function variables in the set of ambiguity sets in the worst-case expectation problem. In the study by Defourny B et al. (2010), the proposed strong duality conclusion can help us reformulate the worst-case expectation. Thus, the sub-problem in the inner part of Eq. 9 can be rewritten in the following form:

\begin{array}{c} \sup_{π \in m (Ξ x Ξ)} \int_{Ξ} L (ξ, h) π (d ξ, \hat{Ξ}) \\ s . t . \{\begin{matrix} \int_{Ξ \times \hat{Ξ}} d (ξ, \hat{ξ}) π (d ξ, d \hat{ξ}) \leq ε \\ π (Ξ, d \hat{ξ}) = {\hat{P}}_{N} (d \hat{ξ}) \\ P (d ξ) = π (d ξ, \hat{Ξ}) \end{matrix} \end{array} (12)

Since $P$ and ${\hat{P}}_{N}$ are discrete, i.e., $\hat{Ξ} = {\{ξ^{i}\}}_{i = 1}^{N}$ , we can obtain the following:

P (d ξ) = π (d ξ, \hat{Ξ}) = \frac{1}{N} \sum_{i = 1}^{N} P^{i} (d ξ) = \frac{1}{N} \sum_{i = 1}^{N} π (d ξ | \hat{ξ} = ξ^{i})

and

π (d ξ, d \hat{ξ}) = π (d ξ, \hat{ξ} = ξ) ∙ {\hat{P}}_{N} (ξ^{i}) = \frac{1}{N} P^{i} (d ξ)

According to these two equations, it is possible to equivalently rewrite (12) to obtain the following:

\begin{array}{c} \lim_{P^{i} \geq 0} \frac{1}{N} \sum_{i = 1}^{N} \int_{Ξ} L (ξ, h) P^{i} (d ξ) \\ s . t . \{\begin{matrix} \frac{1}{N} \sum_{i = 1}^{N} \int_{Ξ} d (ξ, ξ^{i}) P^{i} (d ξ) \leq ε \\ \int_{Ξ} P^{i} (d ξ) = 1, \forall i \in [N] \end{matrix} \end{array} (13)

Considering the Lagrangian dual function of (13), it is not difficult to obtain the following:

\begin{array}{c} \min_{λ_{i}, β \geq 0} \sum_{i = 1}^{N} λ_{i} + ε β \\ s . t . {L (ξ, h) - N λ_{i} - β d (ξ, ξ^{i}) \leq 0, \forall ξ \in Ξ, \forall i \in [N] \end{array} (14)

where $β \geq 0$ and $λ_{i}$ are dual variables of the constraints.

3.2 Upper approximation of the proposed model

Equation 14, obtained previously, is a large-scale semi-infinite programming problem that is still intractable. We solve this problem in this section by obtaining a conservative upper bound through multiple upper approximations (Everett III H, 1963). First, based on the definition of the Lipschitz constant, we further define an extended definition of Lipschitz for a function $f : X \to Y$ ; using the norm on $S$ and $S \subseteq X$ , we define the Lipschitz module of $f$ as follows:

l i p (f) ∶ = \lim_{z, z^{'} \in S} \{\frac{‖f (z) - f {(z}^{'})‖}{‖z - z^{'}‖} : z \neq z^{'}\}

Then, an approximate upper bound for (14) can be obtained as follows:

\begin{array}{c} (14) = \inf_{λ \geq 0} ε β + \frac{1}{N} \sum_{i = 1}^{N} \sup_{ξ \in Ξ} [L (y h (x)) - λ (‖x - x^{i}‖ + κ 1_{\{y^{P} \neq y^{Q}\}})] \\ \leq \inf_{λ \geq 0} ε β + \frac{1}{N} \sum_{i = 1}^{N} \sup_{ξ \in Ξ} [L (y^{i} h (x^{i})) + l i p (L) (|y h (x) - y^{i} h (x^{i})|) - λ (‖x - x^{i}‖ + κ 1_{\{y^{P} \neq y^{Q}\}})] \\ \leq \inf_{λ \geq 0} ε β + \frac{1}{N} \sum_{i = 1}^{N} \sup_{ξ \in Ξ} [L (y^{i} h (x^{i})) + l i p (L) l i p (h) ‖x - x^{i}‖ 1_{\{y^{P} \neq y^{Q}\}} + l i p (L) |h (x) + h (x^{i})| 1_{\{y^{P} \neq y^{Q}\}} - λ (‖x - x^{i}‖ + κ 1_{\{y^{P} \neq y^{Q}\}})] \\ \leq \inf_{λ \geq 0} ε β + \frac{1}{N} \sum_{i = 1}^{N} \sup_{ξ \in Ξ} [L (y^{i} h (x^{i})) - λ (‖x - x^{i}‖ + κ 1_{\{y^{P} \neq y^{Q}\}}) + l i p (L) \max \{2 \frac{c}{κ}, l i p (h)\} (‖x - x^{i}‖ + κ 1_{\{y^{P} \neq y^{Q}\}})], \end{array}

where $c = \sup_{h \in H, x \in X} |h (x)|$ because of the Lipschitz continuity of $L$ and the first inequality holds; similarly, the second inequality holds because of the Lipschitz continuity of $h$ . We note that $λ = l i p (L) \max \{l i p (h), 2 c / κ, 1 / κ\}$ , and we are able to obtain the upper bound on the worst-case expectation:

\frac{1}{N} \sum_{i = 1}^{N} l (h (x^{i}), y^{i}) + ε l i p (L) \max \{l i p (h), \frac{\max \{1,2 \sup_{h \in H, x \in X} |h (x)|\}}{κ}\} (15)

If $κ \to \infty$ , we have a further upper approximation of the DRO model, given as follows:

\inf_{h \in H} \{\frac{1}{N} \sum_{i = 1}^{N} l (h (x^{i}), y^{i}) + ε l i p (L) l i p (h)\} (16)

3.3 Solving the reformulated model

Gouk H et al. (2021) provide a comprehensive analysis of the application of the Lipschitz function to neural networks. The composite property allows us to extend the single Lipschitz constant to the entire neural network; using the property $l i p (h) \leq \prod_{m = 1}^{M} l i p (σ_{m}) ‖W_{m}‖$ , we can obtain an upper bound for $l i p (h)$ with the following:

(16) = \inf_{h \in H} \{\frac{1}{N} \sum_{i = 1}^{N} l (h (x^{i}), y^{i}) + ε l i p (L) \prod_{m = 1}^{M} l i p (σ_{m}) ‖W_{m}‖\} . (17)

$σ_{m}$ is the activation function of the $m$ th layer of the neural network with $[M]$ layers; $‖W_{m}‖$ is the operator norm induced by the norm on space $R^{n_{m}}$ and $R^{n_{m + 1}}$ . As $κ \to \infty$ and set $\tilde{σ} = \prod_{m = 1}^{M} l i p (σ_{m})$ , (17) satisfies the following:

\begin{array}{c} \frac{1}{N} \sum_{i = 1}^{N} l (h (x^{i}), y^{i}) + ε l i p (L) \prod_{m = 1}^{M} l i p (σ_{m}) ‖W_{m}‖ \\ = \frac{1}{N} \sum_{i = 1}^{N} l (h (x^{i}), y^{i}) + ε \tilde{σ} l i p (L) \prod_{m = 1}^{M} ‖W_{m}‖ \\ \leq \frac{1}{N} \sum_{i = 1}^{N} l (h (x^{i}), y^{i}) + ε \tilde{σ} l i p (L) {(\sum_{m = 1}^{M} \frac{‖W_{m}‖}{M})}^{M} \end{array} (18)

It has been proved in Everett III H, 1963 that when (17) has an optimal solution $h^{*}$ (since each hypothesis $h$ has its unique weight matrix, $W_{[M]} = (W_{M}, \dots, W_{2}, W_{1})$ , the optimal solution for $W$ at this point can be written as $W_{[M]}^{*}$ ), then $h^{*}$ is also an optimal solution to the following constraint problem:

\begin{array}{c} \inf_{h \in H} \frac{1}{N} \sum_{i = 1}^{N} l (h (x^{i}), y^{i}) \\ s . t . {(\sum_{m = 1}^{M} \frac{‖W_{m}‖}{M})}^{M} \leq {(\frac{θ}{M})}^{M} \end{array}

for $θ = \sum_{m = 1}^{M} ‖W_{m}^{*}‖$ . Therefore, there exists a Lagrange multiplier $\tilde{λ},$ such that $W_{[M]}^{*}$ is the solution to the minimization of the following penalized problem:

\inf_{h \in H} \frac{1}{N} \sum_{i = 1}^{N} l (h (x^{i}), y^{i}) + \tilde{λ} \sum_{m = 1}^{M} ‖W_{m}‖ . (19)

This means that when $κ \to \infty$ , there exists $\tilde{λ} > 0$ , such that the upper bound of the distributionally robust optimization model (9) is (19), which is a minimization problem with regularization terms.

4 Experiment design

4.1 Datasets

In our experiments, we use aggregated data from the whole building rather than measurements from submeters. We use two open-source datasets, PLAID (Gao J et al., 2014) and LILACD (Kahl M et al., 2019), both of which are high-frequency datasets, as the data used in the experiments. Among these, the aggregated data in PLAID are measured at 30 kHz and contain 1478 different states, such as on or off, for 12 different devices from 11 different appliance types in more than 55 households in Pittsburgh, Pennsylvania, United States of America. The latter aggregated data contain 16 different types of appliances sampled at 50 kHz. The datasets are pre-defined with labels for on and off occurrences, simplifying the identification of voltage and current details during the event. PLAID is a dataset of residential buildings where appliances are solely single-phase, unlike LILACD, which is a novel industrial dataset with an assortment of industrial and household electrical equipment. Additionally, the appliances in LILACD operate in both three-phase and single-phase modes, rendering the situation more intricate in comparison to PLAID. In the following device labeling, the prefix “3p” indicates that the device works in the three-phase mode.

4.2 Evaluation metrics

As recommended by Makonin S et al. (2015), we use the F1-score and Matthews correlation coefficient (MCC) to evaluate the classification performance by using the following equation:

\begin{array}{c} F_{s c o r e} = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} \\ P r e c i s i o n = \frac{T P}{T P + F P} \\ R e c a l l = \frac{T P}{T P + F N} \\ F_{m a c r o} = 100 \times \frac{1}{K} \sum_{i = 1}^{K} F_{s c o r e}^{i} \end{array}

where TP is true positive, TN is true negative, FP is false positive, and FN is false negative, $K$ is the number of appliances, $F_{s c o r e}$ is the harmonic mean of precision and recall, and $F_{m a c r o}$ is the average of the $F_{s c o r e}$ of all devices, also known as the macro average. For a given confusion matrix $C$ with $K$ classes, the MCC can be defined as follows:

M C C = \frac{c \times s - \sum_{i}^{M} p_{i} \times t_{i}}{\sqrt{(s^{2} - \sum_{i}^{M} p_{i}^{2}) \times (s^{2} - \sum_{i}^{M} t_{i}^{2})}}

where $t_{i} = \sum_{k}^{M} C_{k i}$ , $p_{i} = \sum_{k}^{M} C_{i k}$ , $c = \sum_{k}^{M} C_{k k}$ , and $s = \sum_{i}^{M} \sum_{j}^{M} C_{i j}$ .

4.3 Experiment setting

In our experiments, we used two methods of extracting features; the first one is the commonly used V-I trajectory map, which extracts the current and voltage trajectories at the steady state in one current cycle at high-frequency data and, thus, obtains the relationship between voltage and current in one cycle. Figure 1A indicates the aggregated current data obtained over a period of time. Starting from a current of 0, multiple cycles of fluctuations are selected, and the average value obtained is the current curve, as shown in (B). It should be noted that because the training data are selected from partial points in one measurement, the current of the original data do not always start from 0; therefore, some alignment of data is required. (C) represents the voltage data of the target appliance with the same horizontal coordinate as the current data, and the voltage data are also averaged over multiple cycles. To choose the data of the same moment, the horizontal coordinate as the voltage and the vertical coordinate as the current, we compress the data to obtain the V-I trajectory map of the specified size, and (D) is a pixelated V-I map with a width of 50. The second method uses the Euclidean distance matrix proposed in the study by Dokmanic I et al. (2015) and uses the matrix to represent the relationship between each element of the time-series signal to measure the correlation between different point locations. For example, if there are sequences $\{t_{1}, t_{2}, \dots, t_{T}\}$ of length $T$ , we can obtain the Euclidean distance matrix $E_{T \times T}$ , shown as follows:

E_{T \times T} = [\begin{array}{c} d_{t_{1}, t_{1}} & \dots & d_{t_{1}, t_{T}} \\ ⋮ & ⋱ & ⋮ \\ d_{t_{T}, t_{1}} & \dots & d_{t_{T}, t_{T}} \end{array}]

where $d_{i, j}$ denotes the difference between point $i$ and point $j$ , $d_{i, j} = {‖t_{i} - t_{j}‖}_{p}$ . The $p$ value we considered in the experiment is $1$ . When the value of $d$ between two points exceeds threshold $ϵ$ , the value of the corresponding position in the map is $1$ . Subplot (E) in Figure 1 shows the Euclidean distance matrix representation of the appliance.

FIGURE 1

FIGURE 1. Extraction of current and voltage signals from the aggregated measurements. (A) Aggregate current. (B) Current waveform when CFL is turned on. (C) Voltage waveform when CFL is turned on. (D) V-I trajectory of CFL. (E) EDM of CFL.

In our experiments, we used a convolutional neural network to construct the structure shown in Figure 2 and trained the model using the V-I trajectories obtained from the PLAID to obtain a pre-trained convolutional neural network model capable of classifying V-I trajectory maps. The network contains five layers, including three convolutional layers and two fully connected layers, and ReLU is used as the activation function for each convolutional and fully connected layer. The first convolutional layer filters the input image with 16 kernels of size 5, straddling two pixels. The second convolutional layer takes the output of the first convolutional layer as the input and filters it with 32 kernels. The third convolutional layer filters it with 64 kernels of size 5. The role of the neural network is to perform feature extraction on the input data. The convolutional and fully connected layers of the pre-trained model can be considered as a cascade of feature extractors, and we carry out a separate process for the last fully connected layer for the purpose of fusing the DRO model, as follows:

1) All the layers except the last fully connected layer are extracted from the pre-trained model.

2) A fully connected layer combining the DRO model is linked to it.

3) Using V-I trajectory maps and the Euclidean distance matrix obtained from both datasets as the input, the last layer of the new network is separately trained until the stopping criterion is satisfied.

FIGURE 2

FIGURE 2. Structure of the network and addition of DRO modules.

It should be noted that in order to prevent the influence of the epoch of training sessions on the accuracy, we also carry out the same epoch of training for the pre-trained model as we did for the DRO model, so as to obtain the model without DRO added for the same epoch of training. We also performed several cross-validations of the data based on stratified sampling, and the final mean value was obtained as the final value.

As we derived previously, the empirical cross-entropy with the regularization term $‖W_{m}‖$ is an upper bound for the worst case of all distributions in the Wasserstein ball, $m \in [M]$ , and since the empirical loss is still non-convex, it is suitable to use a local optimization method for the solution; we use a stochastic approximate gradient descent algorithm to adjust $W_{m}$ , updated as follows:

W_{m}^{k + 1} = {p r o x}_{η_{k} \tilde{λ} ‖W_{m}‖} (W_{m}^{k} - η_{k} \nabla_{W_{m}} l (h (x^{i_{k}}, y^{i_{k}}))),

where $η_{k}$ is the step size and $i_{k}$ is randomly selected from the index set $[N]$ . According to Nitanda A (2014), the proximal operator of the convex function $φ$ is defined as follows:

{p r o x}_{φ} (W_{m}) ∶ = \underset{W_{m}^{'}}{argmin} φ (W_{m}^{'}) + \frac{1}{2} {‖W_{m}^{'} - W_{m}‖}_{F}^{2},

where ${‖∙‖}_{F}$ stands for the Frobenius norm.

4.4 Results on the PLAID

On the PLAID, we first experimented with different initializations of the penalty coefficient, $\tilde{λ}$ , for the regularization term. As shown in Figure 3, the hyperparameter $\tilde{λ}$ is $\{0,0.1,0.2,0.4, . . ., 1.0\}$ ; 0 represents the neural network without adding the DRO model, and the remaining non-zero values represent the coefficient values of the penalty terms. It can be seen that the larger lambda values lead to an excessive upper bound on the approximation of the model, which weakens the differences between different appliances, thus leading to a lower classification performance. The classification accuracy of the model gradually decreases when the parameter is greater than $1$ ; when the parameter is around $0.6$ , the model seems to gain stable performance, and the conclusions are roughly similar under both preprocessing methods.

FIGURE 3

FIGURE 3. Effect of different penalty coefficients on the DRO model with PLAID data.

As a result, with the help of DRO, the F1-score of the PLAID improved from $0.9166 \pm 0.023$ to $0.9261 \pm 0.019$ and from $0.9065 \pm 0.018$ to $0.9278 \pm 0.018$ under the two preprocessing methods, respectively. In addition to which, we selected the CNN model from the literature (De Baets L et al., 2018) and the KNN method from the study by Gurbuz F B et al. (2021) to classify the appliance features, as shown in Table 1. Thus, it can be shown that the distributionally-robust optimization method effectively improves the classification ability of the network. For a more detailed analysis, we use the confusion matrix to visualize the classification of the proposed model. As shown in Figure 4, each row of the matrix represents the predicted labels; each category represents the true labels; the diagonal line indicates the number of each category correctly identified, i.e., the degree of matching between the predicted and true values; and the values outside the diagonal line indicate the portion of incorrect predictions.

TABLE 1

TABLE 1. Summary of the results of the two preprocessing methods combined with the DRO model under the PLAID.

FIGURE 4

FIGURE 4. (A) V-I trajectory map without DRO addition. (B) V-I trajectory map with DRO addition. (C) Euclidean distance matrix without DRO addition. (D) Euclidean distance matrix with DRO addition.

4.5 Results on the LILACD

The LILACD contains the energy usage of some industrial and household electrical appliances, and we use the same steps as before while cross-validation is also applied. The addition of the DRO model is achieved by replacing the last layer of the pre-trained model. As a result, the network without DRO addition obtained an F1-score of $0.85 \pm 0.061$ when using the V-I trajectory map as the input, while the model with DRO addition achieved an F1-score of $0.9058 \pm 0.077$ ; SVM (Hernandez A S et al., 2021) achieved an F1-score of $0.787 \pm 0.091$ , and KNN obtained an F1-score of $0.77 \pm 0.07$ . When using the Euclidean distance matrix to represent the appliance features, the model with DRO addition achieved an F1-score of $0.87 \pm 0.025$ , which is about $5 %$ higher than the model without DRO addition and 15% higher than the traditional machine learning methods, SVM and KNN. To evaluate the classification performance of each appliance, Figure 5 and Figure 6 show the detailed F1-score for each appliance. It can be seen that the DRO model achieves more significant results in the classification of coffee machines, hair dryers, kettles, and raclettes, mainly because of the irregular fluctuating waveforms of the appliance, which made the accuracy significantly lower. These devices are similar in that they all work in the form of the generated heat energy, resulting in variable device states, while the different gears directly affect the final power output, making it difficult to identify methods with a low robustness.

FIGURE 5

FIGURE 5. F1-score of appliance loads for the LILACD with the V-I trajectory.

FIGURE 6

FIGURE 6. F1-score of appliance loads for the LILACD with EDM.

Overall, these results indicate that the DRO model significantly improves robustness performance, allowing the network to cope well with irregular fluctuations in equipment and to achieve high accuracy in both residential and industrial equipment classifications.

5 Conclusion and future work

In this paper, we propose a Wasserstein metric-based distributionally robust optimization framework for the non-intrusive load monitoring problem and establish a relationship between robustness and regularization in multiple variables by reformulating the min–max problem as a regularized empirical loss minimization problem through multiple upper approximations. In addition, two appliance feature extraction methods for high-frequency load data are used in the experiments to investigate the effect of the DRO method on the performance of the neural network when the convolutional neural network has different input data. In addition, the proposed DRO module can be added to the single-layer neural network in the form of constraints to improve the network performance. Experiments show that the proposed method has better robustness for devices with large fluctuations and can effectively identify device features compared to the network without DRO. Since there is no method that can directly solve the proposed DRO model, more accurate solution methods will be the focus of research in the future.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.

Author contributions

QZ: conceptualization, methodology, formal analysis, writing—original draft, writing—review and editing, and visualization. YY: resources and writing—review and editing. FK: supervision and project administration. SC: methodology and validation. LY: supervision, writing—review and editing, and funding acquisition.

Funding

This work was supported by the Natural Science Foundation of Guangxi (2020GXNSFAA297173 and 2020GXNSFDA238017), the Natural Science Foundation of China (51767003), the Thousands of Young and Middle-Aged Backbone Teachers Training Program for Guangxi Higher Education [Education Department of Guangxi (2017)], and the Innovation Project of Guangxi Graduate Education (YCSW2022050).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling editor DZ declared a shared affiliation with the authors at the time of the review.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abd El-Ghany, H. A., Elgebaly, A. E., and Taha, I. B. M. (2021). A new monitoring technique for fault detection and classification in PV systems based on rate of change of voltage-current trajectory. Int. J. Electr. Power and Energy Syst. 133, 107248. doi:10.1016/j.ijepes.2021.107248

CrossRef Full Text | Google Scholar

Asensio, M., and Contreras, J. (2015). Stochastic unit commitment in isolated systems with renewable penetration under CVaR assessment. IEEE Trans. Smart Grid 7 (3), 1356–1367. doi:10.1109/tsg.2015.2469134

CrossRef Full Text | Google Scholar

Azizi, E., Beheshti, M. T. H., and Bolouki, S. (2021). Event matching classification method for non-intrusive load monitoring. Sustainability 13 (2), 693. doi:10.3390/su13020693

CrossRef Full Text | Google Scholar

Chea, R., Thourn, K., and Chhorn, S. “Improving VI trajectory load signature in NILM spproach,” in Proceedings of the 2022 International Electrical Engineering Congress (iEECON), Khon Kaen, Thailand, March 2022, 1–4.

Google Scholar

Cheramin, M., Cheng, J., Jiang, R., and Pan, K. (2022). Computationally efficient approximations for distributionally robust optimization under moment and Wasserstein ambiguity. Inf. J. Comput. 34 (3), 1768–1794. doi:10.1287/ijoc.2021.1123

CrossRef Full Text | Google Scholar

De Baets, L., Dhaene, T., Deschrijver, D., Develder, C., Berges, M., et al. (2018b). “VI-based appliance classification using aggregated power consumption data,” in Proceedings of the 2018 IEEE international conference on smart computing (SMARTCOMP), Taormina, Italy, June 2018 (IEEE), 179–186. doi:10.1109/SMARTCOMP.2018.00089

CrossRef Full Text | Google Scholar

De Baets, L., Ruyssinck, J., Develder, C., Dhaene, T., and Deschrijver, D. (2018a). Appliance classification using VI trajectories and convolutional neural networks. Energy Build. 158, 32–36. doi:10.1016/j.enbuild.2017.09.087

CrossRef Full Text | Google Scholar

Defourny, B. (2010). “Machine learning solution methods for multistage stochastic programming,”. PhD diss (Belgium, Europe: University of Liege). https://www.lehigh.edu/defourny/PhDthesis_B_Defourny.pdf.

Google Scholar

Delage, E., and Ye, Y. (2010). Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Res. 58 (3), 595–612. doi:10.1287/opre.1090.0741

CrossRef Full Text | Google Scholar

Dokmanic, I., Parhizkar, R., Ranieri, J., and Vetterli, M. (2015). Euclidean distance matrices: Essential theory, algorithms, and applications. IEEE Signal Process. Mag. 32 (6), 12–30. doi:10.1109/msp.2015.2398954

CrossRef Full Text | Google Scholar

Du, L., He, D., Harley, R. G., and Habetler, T. G. (2015). Electric load classification by binary voltage–current trajectory mapping. IEEE Trans. Smart Grid 7 (1), 358–365. doi:10.1109/tsg.2015.2442225

CrossRef Full Text | Google Scholar

Duan, C., Fang, W., Jiang, L., Yao, L., and Liu, J. (2018). Distributionally robust chance-constrained approximate AC-OPF with Wasserstein metric. IEEE Trans. Power Syst. 33 (5), 4924–4936. doi:10.1109/tpwrs.2018.2807623

CrossRef Full Text | Google Scholar

Everett, H. (1963). Generalized Lagrange multiplier method for solving problems of optimum allocation of resources. Operations Res. 11 (3), 399–417. doi:10.1287/opre.11.3.399

CrossRef Full Text | Google Scholar

Gao, J., Giri, S., Kara, E. C., and Bergés, M. “Plaid: A public dataset of high-resoultion electrical appliance measurements for load identification research: Demo abstract,” in Proceedings of the 1st ACM Conference on Embedded Systems for Energy-Efficient Buildings, Memphis, TN, USA, November 2014, 198–199.

Google Scholar

Gillis, J. M., Chung, J. A., and Morsi, W. G. (2017). Designing new orthogonal high-order wavelets for nonintrusive load monitoring. IEEE Trans. Industrial Electron. 65 (3), 2578–2589. doi:10.1109/tie.2017.2739701

CrossRef Full Text | Google Scholar

Gouk, H., Frank, E., Pfahringer, B., and Cree, M. J. (2021). Regularisation of neural networks by enforcing Lipschitz continuity. Mach. Learn. 110, 393–416. doi:10.1007/s10994-020-05929-w

CrossRef Full Text | Google Scholar

Gurbuz, F. B., Bayindir, R., and Vadi, S. “Comprehensive non-intrusive load monitoring process: Device event detection, device feature extraction and device identification using KNN, random forest and decision tree,” in Proceedings of the 2021 10th International Conference on Renewable Energy Research and Application (ICRERA), Istanbul, Turkey, September 2021, 447–452.

Google Scholar

Hart, G. W. (1992). Nonintrusive appliance load monitoring. Proc. IEEE 80 (12), 1870–1891. doi:10.1109/5.192069

CrossRef Full Text | Google Scholar

Hernandez, A. S., Ballado, A. H., and Heredia, A. P. D. “Development of a non-intrusive load monitoring (nilm) with unknown loads using support vector machine,” in Proceedings of the 2021 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), Shah Alam, Malaysia, June 2021, 203–207.

Google Scholar

Kahl, M., Krause, V., Hackenberg, R., Ul Haq, A., Horn, A., Jacobsen, H. A., et al. (2019). Measurement system and dataset for in-depth analysis of appliance energy consumption in industrial environment. Tm-Technisches Mess. 86 (1), 1–13. doi:10.1515/teme-2018-0038

CrossRef Full Text | Google Scholar

Liu, Y., Wang, X., and You, W. (2018b). Non-intrusive load monitoring by voltage–current trajectory enabled transfer learning. IEEE Trans. Smart Grid 10 (5), 5609–5619. doi:10.1109/tsg.2018.2888581

CrossRef Full Text | Google Scholar

Liu, Y., Wang, X., Zhao, L., and Liu, Y. (2018a). Admittance-based load signature construction for non-intrusive appliance load monitoring. Energy Build. 171, 209–219. doi:10.1016/j.enbuild.2018.04.049

CrossRef Full Text | Google Scholar

Lu, J., Zhao, R., Liu, B., Yu, Z., Zhang, J., and Xu, Z. (2023). An overview of non-intrusive load monitoring based on V-I trajectory signature. Energies 16 (2), 939. doi:10.3390/en16020939

CrossRef Full Text | Google Scholar

Makonin, S., and Popowich, F. (2015). Nonintrusive load monitoring (NILM) performance evaluation: A unified approach for accuracy reporting. Energy Effic. 8, 809–814. doi:10.1007/s12053-014-9306-2

CrossRef Full Text | Google Scholar

Nitanda, A. (2014). Stochastic proximal gradient descent with acceleration techniques. Adv. Neural Inf. Process. Syst. 27.

Google Scholar

Rahimian, H., and Mehrotra, S. (2019). Distributionally robust optimization: A review. arXiv preprint arXiv:1908.05659 https://arxiv.org/abs/1908.05659.

Google Scholar

Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus, R. “Regularization of neural networks using dropconnect,” in Proceedings of the International conference on machine learning, Atlanta, Georgia, USA, June 2013 (PMLR), 1058–1066.

Google Scholar

Wang, A. L., Chen, B. X., Wang, C. G., and Hua, D. (2018). Non-intrusive load monitoring algorithm based on features of V–I trajectory. Electr. Power Syst. Res. 157, 134–144. doi:10.1016/j.epsr.2017.12.012

CrossRef Full Text | Google Scholar

Wei, W., Liu, F., Mei, S., and Hou, Y. (2014). Robust energy and reserve dispatch under variable renewable generation. IEEE Trans. Smart Grid 6 (1), 369–380. doi:10.1109/tsg.2014.2317744

CrossRef Full Text | Google Scholar

Xie, H., Jiang, M., Zhang, D., Goh, H. H., Ahmad, T., Liu, H., et al. (2023). IntelliSense technology in the new power systems. Renew. Sustain. Energy Rev. 177, 113229. doi:10.1016/j.rser.2023.113229

CrossRef Full Text | Google Scholar

Yoon, S. H., Kim, S. Y., Park, G. H., Kim, Y. K., Cho, C. H., and Park, B. H. (2018). Multiple power-based building energy management system for efficient management of building energy. Sustain. Cities Soc. 42, 462–470. doi:10.1016/j.scs.2018.08.008

CrossRef Full Text | Google Scholar

Zhang, D., Zhu, H., Zhang, H., Goh, H. H., Liu, H., and Wu, T. (2021). Multi-objective optimization for smart integrated energy system considering demand responses and dynamic prices. IEEE Trans. Smart Grid 13 (2), 1100–1112. doi:10.1109/tsg.2021.3128547

CrossRef Full Text | Google Scholar

Zhao, C., and Guan, Y. (2018). Data-driven risk-averse stochastic optimization with Wasserstein metric. Operations Res. Lett. 46 (2), 262–267. doi:10.1016/j.orl.2018.01.011

CrossRef Full Text | Google Scholar

Zheng, Z., Chen, H., and Luo, X. (2018). A supervised event-based non-intrusive load monitoring for non-linear appliances. Sustainability 10 (4), 1001. doi:10.3390/su10041001

CrossRef Full Text | Google Scholar

Keywords: non-intrusive load monitoring, distributionally robust optimization, Wasserstein metric, convolutional neural network, transfer learning

Citation: Zhang Q, Yan Y, Kong F, Chen S and Yang L (2023) A Wasserstein-based distributionally robust neural network for non-intrusive load monitoring. Front. Energy Res. 11:1171437. doi: 10.3389/fenrg.2023.1171437

Received: 22 February 2023; Accepted: 21 March 2023;
Published: 05 April 2023.

Edited by:

Dongdong Zhang, Guangxi University, China

Reviewed by:

Shanshan Pan, Guangxi University of Science and Technology, China
Chen Zhang, University of Shanghai for Science and Technology, China

Copyright © 2023 Zhang, Yan, Kong, Chen and Yang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yi Yan, Y2NpZUBneHUuZWR1LmNu

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.