Application of data analysis techniques for characterization and estimation in electrical substations

Bustos-Brinez, Oscar A.; Zambrano-Pinto, Alvaro; Rosero Garcia, Javier

doi:10.3389/fenrg.2024.1372347

ORIGINAL RESEARCH article

Front. Energy Res., 27 August 2024

Sec. Smart Grids

Volume 12 - 2024 | https://doi.org/10.3389/fenrg.2024.1372347

This article is part of the Research TopicDemand Side Management in MicrogridsView all 11 articles

Application of data analysis techniques for characterization and estimation in electrical substations

Oscar A. Bustos-Brinez^1,2

Alvaro Zambrano-Pinto¹

Javier Rosero Garcia¹*

¹EM&D Research Group, Electrical and Electronics Engineering Department, Faculty of Engineering, Universidad Nacional de Colombia, Bogotá, Colombia
²MindLab Research Group, Systems and Industrial Engineering Department, Faculty of Engineering, Universidad Nacional de Colombia, Bogotá, Colombia

With the continued growth of smart grids in electrical systems around the world, large amounts of data are continuously being generated and new opportunities are emerging to use this data in a wide variety of applications. In particular, the analysis of data from distribution systems (such as electrical substations) can lead to improvements in real-time monitoring and load forecasting. This paper presents a methodology for substation data analysis based on the application of a series of data analysis methods aimed at three main objectives: the characterization of demand by identifying different types of consumption, the statistical analysis of the distribution of consumption, and the identification of anomalous behavior. The methodology is tested on a data set of hourly measurements from substations located in various geographical regions of Colombia. The results of this methodology show that the analysis of substations data can effectively detect several common consumption patterns and also isolate anomalous ones, with approximately 4% of the substations being identified as outliers. Therefore, the proposed methodology could be a useful tool for decision-making processes of electricity distributors.

Introduction

The incorporation of Smart Grids to electrical networks allows a wide variety of innovations in their management, both in terms of grid infrastructure and information processing, with the primary goal of ensuring a more reliable and efficient supply of electricity to end users while lowering costs and potential risks to operators (Dileep, 2020). The infrastructure that supports Smart Grids, known as Advanced Metering Infrastructure (AMI), includes on-site metering devices (located at transmission lines, distribution nodes, and end users), communication networks to connect such devices, and servers that store the data that is being continuously generated. While the analysis of these amounts of data poses significant challenges in terms of computing power and economic investment, the insights obtained from such process can be used in multiple ways to improve the overall operation of the network (Chakraborty and Sharma, 2016). Various applications have been developed based on data generated by Smart Grids (Bhattarai et al., 2019), including, among many others, real-time optimization of power distribution from generators (Paul et al., 2023), reduction of prices through tariff schemes adapted to consumption (Aurangzeb et al., 2021), assessments of the integration of renewable energy sources given their inherent variability (Mostafa et al., 2022; Paul, 2022), and “demand response”, a mechanism designed to increase the stability of the network through changes at times of peak consumption through strategies such as user incentives or automated monitoring (Siano, 2014; Siddiquee et al., 2021).

Integrating Smart Grids into existing power networks is a complex and expensive process that faces significant and varied challenges in both developed and developing countries. In particular, in the case of Colombia, the growth of Smart Grids has been accompanied by a notable increase in the country’s overall electricity demand and a boost to the diversification of Colombian energy sources, mainly hydroelectric and wind power (Colmenares Quintero et al., 2022). As a result, intelligent management of energy demand and distribution has become a priority for utilities and government agencies responsible for overseeing nationwide and regional distribution and regulating the Colombian energy market (Giral Ramírez et al., 2017; Téllez Gutiérrez et al., 2018). The adoption of AMI systems in the Colombian power grid has been gradually reaching different levels of the network, including end users and power distribution substations, that serve limited areas such as small towns or neighborhoods of a large city (Garcia-Guarin et al., 2019). However, the challenging geographical conditions of Colombia, a highly mountainous country with a wide variety of climates, have limited the development of reliable communication networks and the integration of small local grids (Echeverri Martínez et al., 2020), which constraint the expansion of Smart Grids throughout the country (Molina C et al., 2019).

In scenarios such as the Colombian power network, where Smart Grids are still expanding and have relatively low capabilities, grid operators and other stakeholders are looking for fast and undemanding ways to process the data generated by the network and obtain meaningful information. Therefore, this paper proposes a methodology focused on the analysis of data from electrical substations, so that its results are centered around geographic areas rather than individual users and thus allowing the results to be more focused on regional electrical distribution. The methodology comprises three stages of data processing: dimensional reduction, consumption characterization through clustering, and statistical analysis through density estimation. The results of these three processes (each involving two different methods) include the segmentation of different substation consumption behaviors and the identification of the most common and rarest consumption profiles, that is, the detection of rare or anomalous behaviors. Our proposal is tested by using a series of three data sets provided by three Colombian grid operators, that contain hourly active power measurements made by AMI devices located at 394 electrical substations, covering a period between 2019 and mid-2022. Our methodology is a lightweight, easy-to-implement alternative, suitable for small grid operators; we prove it is able to quickly identify the most frequent behaviors in daily electrical consumption on substations, and also to isolate unexpected or infrequent patterns. The main contributions of this work can be described as:

I. The formulation of a comprehensive methodology for the analysis of electricity consumption measurements in substations. This methodology is composed of data preprocessing, dimensional reduction analysis, segmentation analysis and density estimation analysis. For each of these analyses, two different methods are applied in order to guarantee the robustness of the results.

II. The application of the proposed methodology on three data sets made up of consumption measurements in electrical substations in different regions of Colombia, that shows it is capable of finding common and anomalous behaviors in multiple ways. Since the methodology is composed of different data analysis methods, the results of each are presented in the form of plots and compared using performance metrics.

III. A comparison of the main results obtained for the data sets, highlighting differences and similarities between the three scenarios, and establishing the main advantages of the proposed methodology, together with some possible areas for improvement.

The structure of this paper is as follows: Section 2 provides an overview of related work on worldwide cases of Smart Grids and AMI implementation, as well as a literature review of the most commonly used techniques for analyzing data generated by Smart Grids. Section 3 gives a view of the characteristics of the data sets and presents the framework in which the selected data analysis methods are applied, establishing the order in which they are applied. The results of this process on the data from the three grid operators are presented in Section 4, and the conclusions of the work are presented in Section 5.

Background and related work

Smart Grids overview

Classic power grids, originally designed to distribute power from a few generating hubs to a large number of end users, are currently in dire need of change. The pressure to improve the power grid system can be traced, among other issues, to its inefficiency and environmental footprint, a notable increase in electricity demand in recent years, and the growing importance of less reliable energy sources like renewables (Muench et al., 2014). Increasingly sudden fluctuations in energy supply and demand require efficient and rapid control of power distribution to maintain acceptable levels of quality and reliability. Smart Grids promise to address these challenges, enabling precise and efficient control of large areas of the grid (Berger and Iniewski, 2012), addressing peak demand and other load issues (Bhattarai et al., 2019), allowing a precise management of renewable energy sources (Paul, 2022; Li et al., 2020; Saxena et al., 2021), and giving greater flexibility to address the rising demand of electric mobility, such as electric vehicles and ships (Ismail et al., 2023; Kumar and Panda, 2023).

Smart Grids and AMI infrastructure have been implemented over the last decade in different regions of the world with varying degrees of success. An interesting example of Smart Grids development was presented as part of the implementation of a smart cities scheme in Sydney, Australia between 2009 and 2014. This process was relatively successful, but was also held back by high costs, regulatory issues and poor government leadership (Lovell, 2020). At national level, although there have been serious investments in smart metering and renewable sources, other issues had emerged, including the low levels of grid integration and communication problems in remote areas (Haidar et al., 2015). A more optimistic case is China, where the government’s push for energy efficiency has allowed an accelerated development of smart grid implementation in large areas, albeit with poorly defined horizons and an outdated, fossil-fuel based network that is not well suited to the requirements of Smart Grids (Yu et al., 2012). In the case of Europe, the regulatory frameworks of the European Union have promoted a series of programs that seek standardization among operators in different countries. The geographic and economic particularities of each region make it difficult to draw general conclusions (Fotis et al., 2022), but the most successful projects have been developed following the smart cities paradigm, integrating Smart Grids with transport and water management in large and mid-sized cities across Europe (Farmanbar et al., 2019).

Regarding the implementation of Smart Grids in developing countries, two paradigmatic cases are those of India and Brazil. In the first case, the obsolescence of the country’s electricity grid and the reluctance of consumers to the high costs of AMI meters have been progressively solved through the development of a clear regulatory framework and a strong collaboration between the Indian government and industry organizations (Kappagantu and Daniel, 2018; Asaad et al., 2021). In the second case, Brazil has an electricity grid based on renewable sources, and regulators are the main drivers for the implementation of smart grids in the country to manage the grid efficiently and detect energy losses and illegal connections. The vast and challenging geography, the lack of strong investment in modernization and the technological lag are cited as the main challenges (Di Santo et al., 2015).

Among the challenges that are often common in these cases, it is important to recall those related to leveraging the data obtained as a result of Smart Grid deployment. Although these data have the potential to provide valuable insights for network operators, their exploitation on a large scale is generally difficult and presents several important issues (Mohamed et al., 2019). Data is generated continuously and in large volumes, quickly overwhelming the capabilities of the information systems of the operators and preventing effective analysis; in addition, it is often difficult to integrate data from different operators and from multiple local grids, which hinders the formulation of nationwide conclusions (Bhattarai et al., 2019; Tu et al., 2017). This represents a long-term loss of value, both for companies that could better understand the consumption patterns of their users, and for government agencies interested in formulating more efficient energy distribution policies (Moreno Escobar et al., 2021).

Data analysis methods on Smart Grids

With the development and growth of Smart Grids, processing the data they generate has become one of the main sources of information for electric grid managers. The results of data analysis can be applied to problems such as demand response, identification of profiles or prediction of consumption or long-term costs, among others (Bustos-Brinez et al., 2023). The data generated, however, are generated in large volumes and are increasingly complex, so they usually start with a pre-processing stage that includes data downsizing (Kotsiopoulos et al., 2021). In general, dimensional reduction makes it possible to obtain results with greater efficiency and improve visualization, at the cost of a small loss of information. One of the most commonly used techniques for this purpose is Principal Component Analysis (PCA), a method that constructs linear combinations of existing features by minimizing the loss of information measured by variance (Salem and Hussein, 2019). In the electricity sector, this technique and its variations have been used as part of analysis schemes aimed at managing demand response (Kafash Farkhad and Akbari Foroud, 2023) or detecting IT security breaches in the data generated by Smart Grids (Acosta et al., 2020).

Once data reduction has been performed, there are a large number of applications in which different combinations of methods are used for various purposes. Some of these applications focus, for example, on the identification of load profiles. In this area, the preferred methods are clustering techniques, that aim to segment the data into a series of groups (called “clusters”) such that the data in each group are similar to each other and very different from those in other groups (Si et al., 2021). The most well-known clustering algorithm is K-Means, a distance-based method that constructs a previously defined number of clusters in such a way that minimizes their inner variances by centering each cluster around a central point known as “centroid”. The predefined number of clusters (denoted as $k$ ) is the basic parameter of the method. An extensive list of applications of this method within Smart Grids is presented in (Miraftabzadeh et al., 2023), highlighting its uses to identify multiple load profiles.

Another commonly used clustering method is called Density-based Spatial Clustering of Applications with Noise (DBSCAN), a method that allows the construction of clusters of highly variable sizes and determines some rare or anomalous values that might not belong to any group. The method relies on the definition of dense areas through the revision of the neighborhoods of data points; this depends on two parameters, the size of the considered neighborhood (determined by a parameter called $e p s$ ) and the minimum number of points in a dense area (denoted as $\min_s a m p l e s$ ). Data points in dense areas tend to belong to the same cluster as its neighbors, and data points outside of them are regarded as noise or outliers. Some representative examples of the use of DBSCAN in Smart Grids are shown in (Yang et al., 2018), where a wide variety of consumption profiles are identified for price prediction purposes, and in (Ravinder and Kulkarni, 2023), where the method is used to detect possible intrusions in the network that communicates radio sensors.

There are many other types of data analysis methods that are used in different applications of Smart Grids. In the area of load forecasting, dimensional reduction can be accompanied by regression models [(Mukherjee et al., 2021)-MU1] or classification models such as Support Vector Machines (Ayub et al., 2020). The analysis of the best physical location of devices storing Smart Grid data can be performed with optimization models on graphs (Gallardo et al., 2021). Detection of cybersecurity weaknesses or data injection attacks can be addressed by mechanisms such as neural networks and deep learning [(Vimalkumar and Radhika, 2017; Mukherjee et al., 2022)-MU2], and other unwanted network intrusions, such as power theft, can be addressed through the combination of clustering methods like DBSCAN with density estimation methods like Gaussian Mixtures (Zheng et al., 2017). The latter method is based on the assumption that the data come from a series of normal distributions that may or may not be correlated, and whose parameters are found by the model. The base assumption is the number of different gaussians that make up the distribution. A Gaussian Mixture model similar to the previous one is also used in the area of electric mobility for the identification of load profiles and flexibility analysis, making an analogy between the different gaussians and the groups obtained by clustering models (Märtz et al., 2022).

Finally, some models are used in the area of renewable energies, including the identification of energy generation profiles and their contrast with consumption profiles (Miguel et al., 2016) or the analysis of the distribution of solar energy generation in different geographical areas using density estimation (Bouhorma et al., 2023). In the latter case, where the density presents forms with multiple modes that are difficult to analyze analytically, the algorithm chosen is Kernel Density Estimation (KDE), which constructs a non-parameterized distribution from the sum of the contributions of each data point, measured through a transformation function called a kernel. The distributions obtained with this method, although they do not have an analytical form, are capable of modeling a wide variety of complex scenarios (Hu et al., 2021).

Methodology

A graphical summary of the stages of the proposed methodology and the models included in each stage is presented in Figure 1. By sequentially applying these analysis methods, a series of approximations to the cluster segmentation and probabilistic distribution of the data are constructed. These results are combined to create a robust model of substation consumption that takes into account the different types of behaviors that can occur and separates them into different groups, and also captures the general distribution of the data to point out the most common and most anomalous behaviors. Next, we present a detailed description of the steps performed at each stage.

Figure 1

Figure 1. Structure of the proposed methodology. It consists of four stages: data pre-processing, dimensional reduction, clustering analysis and density estimation. The main results obtained through its application to electrical substation data are also presented.

Data preprocessing

The expected input to the methodology is a set of AMI measurement records containing at a minimum information on the substation where the measurement is taken, the date and time of the measurement, and the value of the measurement. Measurements should be taken every hour continuously, so that substations have records associated with each of the 24 h of the day. Under these conditions, a substation is discarded for further analysis if it has missing or null measurements. For substations with complete measurements, their associated records are preprocessed according to the scheme proposed in (Bustos-Brinez et al., 2023), with the aim of summarizing the consumption of the substations in average load profiles. For each substation considered, all its associated records are isolated and then divided into 24 groups, each one corresponding to the hour of the day (from 0 to 23) in which the measurement was taken. The average values of these 24 groups are obtained and then collected in a load profile corresponding to a vector of dimension 24, where the first value corresponds to the average of the measurements of hour 0, the second value to the average of the measurements of hour 1, and so on until hour 23. In this way, each substation ends up being represented by a load curve made from the averages of its records for each hour of the day.

Dimensional reduction

Once the average load curves have been constructed, each substation is represented by 24 values that depict its average consumption behavior throughout the day. However, not all of these values carry the same amount of information, or some of them can be seen as redundant in some cases. Therefore, in order to maximize the efficiency of subsequent analyses (both in terms of processing time and use of computational resources), it is important to establish how many values are sufficient to analyze the consumption behavior with a small loss of information. Two approaches are chosen for this purpose, considering the examples given in (Duarte et al., 2022) (where dimensional reduction is also stated as a powerful tool for graphical representation of high-dimensional data). The first approach involves a MinMax scaling, which transforms the values in the profiles to the range $[0, 1]$ , followed by the application of a principal component analysis (PCA) that reduces the dimension of each profile from 24 to just two. The scaling is intended to remove information about the magnitude of consumption, allowing two substations with similar consumption patterns but with different magnitudes to have similar representations. The second approach also reduces the profiles from 24 dimensions to two, by using two measures of central tendency, the mean and standard deviation of the 24 values; this discards information about rising or falling patterns along the day to focus on the consumption magnitude and the general variation it presents.

Profile characterization

Since two-dimensional reduction approaches are applied, which generate two alternative representations for each substation, there is a separate analysis for each one of them. The two-dimensional representations are used to identify and isolate different electricity consumption behaviors, in a similar fashion to market segmentation. In particular, it is desired to find behaviors that can be associated with different types of end-users, distinguishing between Residential, Commercial and Industrial load profiles. In (Di Santo et al., 2015), these are identified as follows: Residential users tend to show low consumption in the early morning and peaks in the afternoon or evening, Commercial users have high consumption in the afternoon and lower consumption in the morning and evening, and Industrial users show a more uniform consumption through all the day. Figure 2 shows some of the expected patterns for each user type, representing profiles as 24-h plots.

Figure 2

Figure 2. 24-h plots of some behaviors associated with different types of end users. In general, Residential users show low consumption in the morning and spikes in the afternoon or evening, Commercial users show more consumption in the afternoon and less in the night, and Industrial users tend to have stable consumption throughout the day. The plots have been generated using data from (Bustos-Brinez et al., 2023).

In this stage, two different methods are selected to perform the segmentation of profiles into clusters: DBSCAN and K-Means. These methods depend on a set of hyperparameters that strongly influence the quality of the results. Most of these parameters are set to default values (suggested by the Scikit-Learn Python implementation), leaving only some to be optimized by a grid search process. For DBSCAN, the selected hyperparameters are eps (searched between 0.10 and 0.25 with steps of 0.01) and min_samples (searched from 2 to 5). In general, small values of eps lead to the formation of a larger number of smaller clusters. For K-Means, the main hyperparameter is k, the number of clusters, searched from 3 to 10.

Consumption distribution

In this last stage, the goal is to build a statistical model of the data that helps to identify the most common behaviors exhibited by the substations and allows to perform density estimation and other statistical tests. This statistical model is set up to emulate a density function for the data points, that is, to have higher values in regions where data points appear densely packed and lower values in regions where data points are scarce. Since data points are represented as points in a plane, the density model can also be represented in a plane as a contour plot. The construction of this density model is done twice, choosing two different methods, commonly used for this task: Gaussian Mixture and KDE. Although other, more powerful methods can be used, we select these two methods because of their ease of implementation (both are available as part of the Scikit-Learn Python library) and their interpretability (for Gaussian Mixture, high-density regions are associated with a series of bivariate Gaussian distributions, and for KDE, the density of an area is made up of the weighted contributions of all nearby data points, resulting in higher densities where points lie in higher numbers). Similar to the previous stage, the two models are run separately, and there are a few hyperparameters that undergo grid search optimizations. For Gaussian Mixture, the selected hyperparameter is the number of components (that is, the different Gaussian distributions that compose the overall model), searched between 3 and 8. For KDE, with a fixed Gaussian kernel function, the selected hyperparameter is the bandwidth, a value that controls how much area the contribution of a data point is able to influence; the value of the bandwidth was searched between 0.10 and 0.50 with steps of 0.05 for PCA-based points, and between 0.10 and 0.30 with steps of 0.02 for mean-variance points.

One application of these models that is explored is the identification of the most infrequent data points (anomalies), under the assumption that these appear in low-density regions, and the rarer a data point is, the lower its density value is. These anomalies, due to their rarity, could indicate failures in energy distribution, errors/vulnerabilities in data collection or fraudulent consumption. To identify which points are anomalous and which are not, it is necessary to identify a boundary value, from which a separation between regions of high density and regions of low density can be established. This value usually depends on the number of anomalies assumed to be present in the data, or on a pre-specified percentage of anomalies; in this case, we look for thresholds that leave out a number of points similar to that identified by the segmentation methods. The values taken by the selected threshold in each scenario depend on the values of the contour lines in the density functions built by each model.

Results and analysis

Datasets

The proposed methodology has been tested against a group of four data sets provided by three operators of the Colombian power grid, located in different regions of the country. In total, the four data sets contain active energy measurements for 394 substations, and the number of substations in each data set can be seen in Table 1. In this work, only the records corresponding to the year 2021 will be taken into account, since each data set covers a different time period. All records in each data set share a common structure, containing an alphanumeric identifier of the substation assigned by the respective operator, the date of the measurement separated into year, month and day, the time of the measurement (since only one measurement is taken per hour) and the value of the respective measurement, which can be an integer or a float value depending on the operator.

Table 1

Table 1. Number of substations whose measurements are contained on each dataset.

The proposed methodology was implemented separately for each of the network operators; in this way it is possible to observe how the results change depending on whether there is a large or small amount of data. This analysis is possible because there is much more information available for one of the operators than for the other two. Since two different methods are applied at each stage, the outputs of each are shown for comparison.

Operator A

This grid operator delivered data from 16 grid substations, and its substations are located in the central region of Colombia.

MinMax Scaling and PCA

The first mechanism of dimensional reduction consists in the application of a MinMax scaling followed by the application of PCA. Figure 3 presents a summary of the results of the different methods applied on the data of Operator A, when starting with this method in the dimensional reduction phase. From these data points, the characterization stage is performed, using the two chosen clustering techniques. For DBSCAN, the selected parameters are $e p s = 0.2$ and $\min_s a m p l e s = 2$ . The results of the method are in Figure 3, second level from top to bottom. The blue dot labeled “-1” could not be attached to any cluster, so it is separated as an outlier. The curves obtained by averaging all the points within each cluster are also presented. From the cluster graphs, it is possible to clearly separate the consumption behaviors in each cluster: the red cluster shows Residential behavior, the green cluster shows a more Commercial behavior, and the yellow cluster shows a uniform, more Industrial behavior. The second clustering technique used for data analysis is K-Means. After a test with several values of $k$ , it is decided to use the value $k = 4$ . The result of the method is presented in Figure 3, third level from top to bottom. The clusters obtained with K-Means correspond more or less to the same as with DBSCAN: the green and yellow clusters are retained, while the larger cluster is split into two halves of similar size. The point that DBSCAN could not join to a cluster is again isolated, this time in its own cluster.

Figure 3

Figure 3. Results of the application of the methodology (with dimensional reduction by Scaling and PCA) on the data of Operator A.

In the final stage of the methodology, two different models for density estimation are applied to the data, which allow the identification of anomalous points. The first model is Gaussian Mixture; given the previous results of the clustering methods, it is decided to use three Gaussians. The last level in Figure 3 on the left shows the contour lines of the distribution constructed by the method, where warmer colors represent higher density. The three Gaussian distributions can be distinguished, although two of them overlap. The second density estimation model applied is KDE. The last level in Figure 3 on the left shows the distribution contour lines. The obtained approximation is mostly dominated by the Residential points. The red lines in the two plots represent the level curve corresponding to the separation threshold. In the case of Gaussian Mixture, only one anomalous point is left out, precisely the point that the clustering methods isolated. As for KDE, the separation threshold leaves out three points: the point isolated by the clustering methods, a Commercial type point and an Industrial type point.

Mean and Variance

The second dimensional reduction mechanism uses two main trend measures: the mean and the variance, which are more correlated with each other than the components obtained by PCA. Figure 3 shows the data points in a two-dimensional space where the mean and standard deviation (to keep the units the same) are the X and Y axes respectively. With this new representation of the data, we proceed with the characterization stage, using both DB-SCAN (with the same parameters $e p s = 0.2$ y $\min_s a m p l e s = 2$ ) and K-Means (which looks for $k = 4$ clusters). The results of both methods are presented in the second and third levels of Figure 4. In this case, DBSCAN left three of the points set aside as outliers, and three clusters were formed whose main difference is in their magnitude. With respect to K-Means, the cluster curves change a little, since they include the points separated by DBSCAN; the clusters are still distinguished by consumption (the pink cluster for the lowest consumptions and the purple cluster for the high consumptions), but the intermediate consumptions are separated into two groups, one with low mean and high variance (in turquoise) and the other with high mean and low variance (in orange). Although it is a bit difficult to visualize in the lower consumption curves, all the clusters have a similar Residential-type load profile.

Figure 4

Figure 4. Results of the application of the methodology (with dimensional reduction by Mean and Variance) on the data of Operator A.

Finally, on this alternate representation of the data the density estimation models are applied. The results are shown in the last level of Figure 4. For the Gaussian Mixture model, three Gaussians were again used, which are clearly distinguishable and roughly correspond to the three clusters found by DBSCAN. Only one data point falls outside the separation threshold, one of the three previously detected by DBSCAN. As for KDE, the approximation obtained effectively separates data with high mean and variance values from data with lower means and variances. Again, the only point detected as anomalous is the same as with Gaussian Mixture. This point has a high mean and a very low variance, which could indicate that it corresponds to an Industrial type point, with high and constant energy consumption.

Results Comparison

For this operator’s data, the first dimensional reduction alternative (PCA) favors the distinction of the different types of consumption. Residential, Commercial and Industrial behaviors can be found represented by well-defined clusters, with Residential forming the majority group. The second alternative of dimensional reduction (Mean and Variance), proposes a characterization much more focused on the magnitude of consumption, in which the grouping methods coincide in separating the clusters by distinguishing between high, medium and low consumption. Between the two results it is possible to establish a relationship, presented in Figure 5, where the clusters obtained in the first analysis (with DBSCAN) are plotted on the points obtained in the second analysis. The more Industrial and Commercial substations show less variance for their mean (they are more to the right in their magnitude clusters) and the Residential ones show more variance (more to the left in their magnitude clusters). From this relationship, a strong correlation can be determined between the trend measures of a substation and its behavior, so it would be sufficient for the operator to obtain the mean and variance of the substation to approximately categorize its behavior.

Figure 5

Figure 5. Comparison of results between the two analyzes carried out on the data of Operator A. The yellow dots correspond to the Residential cluster, the green ones to the Commercial cluster and the blue ones to the Industrial cluster. The outlier point is shown in purple.

The density estimation models also show some similarities. The different types of consumption can be approximately modeled by intermixed Gaussian distributions, since both models propose relatively similar distributions in which the contour lines present shapes similar to ellipses. However, the anomalies detected in each case correspond to different substations. When reducing by PCA, the outlier found is characterized by the high number of peaks in its load curve. When reducing by mean and variance, the outlier is detected due to its remarkably low variance for its mean, i.e., a very flat consumption curve. Both of these anomalous substations could be of potential interest to the network operator, as they could indicate unstable service performance or unexpected consumption variations.