Application of Machine Learning Algorithms for Geogenic Radon Potential Mapping in Danyang-Gun, South Korea

Rezaie, Fatemeh; Kim, Sung Won; Alizadeh, Mohsen; Panahi, Mahdi; Kim, Hyesu; Kim, Seonhong; Lee, Jongchun; Lee, Jungsub; Yoo, Juhee; Lee, Saro

doi:10.3389/fenvs.2021.753028

ORIGINAL RESEARCH article

Front. Environ. Sci., 22 September 2021

Sec. Environmental Informatics and Remote Sensing

Volume 9 - 2021 | https://doi.org/10.3389/fenvs.2021.753028

Application of Machine Learning Algorithms for Geogenic Radon Potential Mapping in Danyang-Gun, South Korea

Fatemeh Rezaie^1,2*

Sung Won Kim³

Mohsen Alizadeh⁴

Mahdi Panahi⁵

Hyesu Kim^1,6

Seonhong Kim⁷

Jongchun Lee⁷

Jungsub Lee⁷

Juhee Yoo⁷

Saro Lee^1,2*

¹Geoscience Platform Research Division, Korea Institute of Geoscience and Mineral Resources (KIGAM), Daejeon, South Korea
²Department of Geophysical Exploration, Korea University of Science and Technology, Daejeon, South Korea
³Geology Division, Korea Institute of Geoscience and Mineral Resources (KIGAM), Daejeon, South Korea
⁴Faculty of Built Environment and Surveying, Universiti Teknologi Malaysia, Johor Bahru, Malaysia
⁵Division of Smart Regional Innovation, Kangwon National University, Chuncheon-si, South Korea
⁶Department of Astronomy, Space Science and Geology, Chungnam National University, Daejeon, South Korea
⁷Indoor Environment and Noise Research Division, Environmental Infrastructure Research Department, National Institute of Environmental Research, Incheon, South Korea

Continuous generation of radon gas by soil and rocks rich in components of the uranium chain, along with prolonged inhalation of radon progeny in enclosed spaces, can lead to severe respiratory diseases. Detection of radon-prone areas and acquisition of detailed knowledge regarding relationships between indoor radon variations and geogenic factors can facilitate the implementation of more appropriate radon mitigation strategies in high-risk residential zones. In the present study, 10 factors (i.e., lithology; fault density; mean soil calcium oxide [CaO], copper [Cu], lead [Pb], and ferric oxide [Fe₂O₃] concentrations; elevation; slope; valley depth; and the topographic wetness index [TWI]) were selected to map radon potential areas based on measurements of indoor radon levels in 1,452 dwellings. Mapping was performed using three machine learning methods: long short-term memory (LSTM), extreme learning machine (ELM), and random vector functional link (RVFL). The results were validated in terms of the area under the receiver operating characteristic curve (AUROC), root mean square error (RMSE), and standard deviation (StD). The prediction abilities of all models were satisfactory; however, the ELM model had the best performance, with AUROC, RMSE, and StD values of 0.824, 0.209, and 0.207, respectively. Moreover, approximately 40% of the study area was covered by very high and high-risk radon potential zones that mainly included populated areas in Danyang-gun, South Korea. Therefore, the map can be used to establish more appropriate construction regulations in radon-priority areas, and identify more cost-effective remedial actions for existing buildings, thus reducing indoor radon levels and, by extension, radon exposure-associated effects on human health.

1 Introduction

Radon is a radioactive inert gas, and the only gaseous element produced during the radioactive decay of uranium and thorium. Because the earth’s crust is rich in rocks and soil, which contain uranium and thorium, radon of natural origin exists everywhere and can be transferred from underlying soil and rocks to building environments through cracks or holes in foundations. Although people are frequently exposed to naturally occurring radon, continuous inhalation of radon and its daughter species destroys lung tissues through the emission of alpha particles, thus increasing the risk of life-threatening diseases. The International Agency for Research on Cancer and the World Health Organization (WHO) report that radon (and its products) is the second leading cause of lung cancer after tobacco products (WHO, 2009; Cogliano et al., 2011; Yoon et al., 2016). To reduce the preventable risks associated with radon exposure, the recommended radon level in confined spaces has been set at less than 148 Bqm⁻³; each 100-Bqm⁻³ increase is associated with an approximately 16% increase of lung cancer-related mortality (Kim et al., 2018; WHO, 2021). Thus, there is a growing need to reduce radon levels in enclosed spaces, especially in residential areas (Lee et al., 2015).

Investigations into indoor radon are underway in many countries worldwide; various radon guidelines have been published to raise awareness of its dangers (Dubois, 2005). In 2007, the Korean Ministry of the Environment organized a comprehensive plan for measurement of indoor radon levels. Since 2009, indoor radon measurements have been performed to determine the indoor radon concentration (IRC), with the goal of developing methods for mitigating radon exposure. The data led to the establishment of a national radon map (Djamil, 2016). However, the map was based on mean values for individual administrative districts, where obtaining detailed location-based information proved difficult. Furthermore, the mean indoor radon value is higher in South Korea than in European countries; the number of lung cancer-related deaths attributed to indoor radon accumulation is also remarkably higher (Kim et al., 2018). Consequently, there is a need to develop a detailed radon distribution map to identify radon-priority areas and implement effective methods to reduce the risk of radon exposure.

Local geology, meteorological parameters, soil characteristics, residence type, and building materials substantially contribute to the variability in radon levels. Many studies have assessed the relationships of radon levels with geogenic and anthropogenic factors. Martínez et al. (2014) analyzed the spatial distribution of radon with respect to meteorological and geological variables, including atmospheric pressure, temperature, relative humidity, and distance to fault. Relative humidity and temperature were found to have the greatest impact on IRC values. Pásztor et al. (2016) investigated spatial variations in radon levels with respect to various meteorological variables (e.g., mean annual precipitation, temperature, and evaporation), topographical factors (e.g., elevation, aspect, slope, general curvature, topographical position index, and the topographic wetness index [TWI]), geology, land use/land cover, and physical soil properties. Ciotoli et al. (2017) developed a geogenic radon potential map for the Lazio region in Italy. Their analysis revealed relationships of indoor radon levels with rock permeability, local geology, fault density, and elevation. Park et al. (2018) described the influence of environmental variables (i.e., groundwater usage, season, building materials, residence type, number of residential floors, and construction year) on changes in radon accumulation in residential areas. Ivanova et al. (2019) analyzed the spatial variability of radon levels according to geological parameters including geotectonic units, rock type, and distance to fault. They found that igneous and volcanogenic-sedimentary rocks had high radon emanation. The results provided insight into the combined impact of housing and geology on IRC. Park et al. (2019) generated a geogenic radon potential map of South Korea by considering the effects of geology, fault density, subsoil gravel content, and surface soil radium level on IRC values. They found that these factors were responsible for 36% of the variability of radon levels in South Korea. Phong Thu et al. (2020) evaluated the effects of soil particle size, moisture content, temperature, and pH on radon emanation. Notably, radon increased with increasing soil moisture content and decreasing soil particle size. Kellenbenz and Shakya (2021) investigated seasonal and annual variations of IRC according to various factors (i.e., house type, floor level, and weather conditions) in Pennsylvania, United States. Their findings showed that geology influenced radon levels. In summary, indoor radon exposure can be explained by interactions among diverse variables; thus, the development of an ideal strategy to identify radon-prone areas is a complex problem. Direct and precise measurements of indoor radon levels must be collected and interpreted by experts; precisely calibrated equipment is also needed. Furthermore, continuous long-term radon monitoring for individual dwellings is not feasible in some instances, and long-term field surveys are needed for close sampling intervals. In the context of insufficient numbers of high-quality indoor radon measurements, mathematical models can be applied to predict high-risk areas.

Geographical information systems, integrated with knowledge- or data-driven methods, are currently regarded as a cost-effective alternative for mapping radon levels. Knowledge-driven methods typically rely on expert judgment to determine the relative importance of the independent variables. Fuzzy logic (Cerqueiro-Pequeño et al., 2020) and multi-criteria decision analysis (Ciotoli et al., 2020; Giustini et al., 2021) are knowledge-driven methods widely used to map radon-prone areas. In contrast, data-driven methods employ mathematical expressions to investigate the associations of an event with various factors using small numbers of samples. These methods can be classified into two main types: statistical and machine learning. The frequency ratio (FR) is the most commonly used bivariate statistical model, and can evaluate probabilistic relationships between variables (Cho et al., 2015; Hwang et al., 2017). Although they have the advantage of simplicity, bivariate and multivariate statistical methods both have limited accuracy because of their inability to extract and model nonlinear relationships among variables (Li et al., 2016). Support vector machines (Petermann et al., 2021), random forest algorithms (Vienneau et al., 2021), multivariate adaptive regression splines (Bossew et al., 2020), bagged neural networks (Timkova et al., 2017), extreme gradient boosting (Rafique et al., 2020), weighted k-nearest neighbor algorithms (Pegoretti and Verdi, 2009), and artificial neural networks (Torkar et al., 2010; Duong et al., 2021) are the most commonly used machine learning methods for predicting radon anomalies. Importantly, geographical information systems allow data from various sources, with different scales, to be combined. Machine learning is a promising alternative to statistical methods; it can be applied to analyze complex data with nonlinear correlations and explore latent interactions among all factors, without any statistical assumptions. Moreover, these algorithms can robustly manage noisy and missing data (Al-Fugara et al., 2020). However, the inadequate accuracy of some machine learning methods, for example due to overfitting or potential convergence to local minima (Liu et al., 2021), has led to the use of deep learning-based algorithms, which may enable more accurate prediction of radon levels in enclosed spaces. Deep learning algorithms are able to extract the main features from the input. Deep learning algorithms can identify complex relationships among interdependent variables when processing large unstructured datasets. Against the background of the complex nonlinear relationships of indoor radon levels with various factors, as well as the strengths and weaknesses of each above-mentioned data-driven approach, selection of an appropriate algorithm with acceptable accuracy can greatly influence the likelihood of detection of high radon areas.

The main objective of this study was to map radon-prone areas more accurately with the aid of machine learning methods (i.e., long short-term memory [LSTM], extreme learning machine [ELM], and random vector functional link [RVFL]). To our knowledge, this is the first such study conducted in Danyang-gun, South Korea. Additionally, this study aimed to analyze associations of radon risk areas with various geological, topographical, and geochemical factors and pinpoint the most effective variables.

Essentially, machine learning algorithms’ architecture and hyper-parameters’ value significantly affect the prediction ability of a model and needs to be fine-tuned during the modeling to assist the researchers with achieving results which are more accurate. Robustness, fast training rate, minimum need to adjust parameters during the training process, acceptable generalization ability, and satisfactory capability of universal approximation could be mentioned as the most prominent advantages of selecting LSTM, ELM, and RVFL algorithms compared to the conventional machine learning technique (Ding et al., 2015; Zhang and Suganthan, 2016; Diego et al., 2021). The main novel feature of the present study is to compare the ability of the three above-mentioned machine learning methods to determine which locations enjoy high radon concentrations in spite of the fact that there is not sufficient data available and the relationships among geogenic drivers of IRC spatial variability is complex. The results could help protect the public against the potentially lethal effects of protracted exposure to radon.

2 Materials and Methods

2.1 Study Area

Danyang-gun is a county in the northeast region of Chungcheongbuk-do Province, South Korea, with a population of approximately 29,970. It is located in the range of 128°13′ to 128°39′E and 36°47′ to 37°09′N, and has an area of 780.67 km² (Figure 1). It is well-known for its scenic surroundings, including the Sobaek Mountain range and Namhan River. Sobaek Mountain is the second highest mountain in South Korea (elevation = 1,439 m) and the Namhan River flows for 23.7 km from northeast to southwest along the Sobaek Mountains. Only 11.2% of the county is cultivable, and 83.7% is mountainous. Because of this rugged terrain, both settlements and urban areas are developing in the hills and valleys. The annual mean precipitation is 1,113 mm and the annual mean temperature is 11.5°C; the highest and lowest temperatures are 17.5 and 6.6°C, respectively (KMA, 2021).

FIGURE 1

FIGURE 1. Map of the study area showing radon monitoring sites.

Danyang-gun is composed of various lithological units and strata, as well as complex and diverse geological structures. It consists of Precambrian base rock, Paleozoic sedimentary rock, Mesozoic sedimentary rock, and igneous rock (Figure 2). The Precambrian rock is located in the eastern study area and coincides with Sobaek Mountain. This rock has undergone granitization after regional metamorphism; it is divided into granitic and migmatitic gneisses (Won and Lee, 1967). The sedimentary rock includes unknown age quartzite, Paleozoic clastic sedimentary rock, and carbonate rock. The quartzite covers carbonate rock on the northwest side with an unconformity and is located at the western end of the study area. However, the sequence of formation is unclear because there is no direct contact with other formations (Won and Lee, 1967). The clastic sedimentary rock is composed of Cambrian quartzite and slate; it generally shows a strike of N30°E or N45°E. The carbonate rock is Cambrian–Ordovician and courses in the NE and NW directions (Aum et al., 2019). The NE carbonate rock consists of limestone, dolomitic limestone, dolomite, and banded limestone. The Mesozoic sedimentary rock covers this carbonate rock with a clinounconformity. This formation is mostly composed of clastic sedimentary rock such as shale, sandstone, and conglomerate; layers containing anthracite have also been identified.

FIGURE 2

FIGURE 2. Geological map of Danyang-gun (Modified from Chwae et al. (1995)).

A fault exists in the northern part of the most recent Mesozoic formation, and carbonate rock from the NW direction is distributed to the north of the study area according to this fault. Most carbonate rock from the NW direction is composed of limestone and dolomite; several types of clastic sedimentary rock of unknown age are also present. Mesozoic rock is divided into sedimentary and igneous rock. The sedimentary rock is distributed in the NE direction, as described above. The igneous rock intruded in the Cretaceous period; it includes biotite granite, quartz porphyry, and granite porphyry. The biotite granite, which is widely distributed in the south, is in contact with sedimentary rock; this forms a contact metamorphic zone. There are faults in the NE and NW directions in the study area. The faults in the NE direction cross the center of the study area, and the geology on both sides is clearly distinguished by these faults. The faults in the NW direction cut the sedimentary formations with an NE strike in an almost perpendicular direction (Won and Lee, 1967).

2.2 Theoretical Background of the Methods

2.2.1 Long Short-Term Memory

LSTM is a deep learning algorithm with an architecture analogous to that of an artificial recurrent neural network. The LSTM is designed to capture long-term dependencies between variables; it has been developed to resolve the exploding and vanishing gradient problem of recurrent neural networks via its memory cell structure (Vu et al., 2021). A memory cell comprises a forget gate ( $f_{t}$ ), an input gate ( $i_{t}$ ), and an output gate ( $o_{t}$ ); it regulates the flow of information entering and exiting the cell. Gates are employed to remove, maintain, or add information to the cell. The forget gate is the first filter determining whether information passes to the next time step or is discarded from the cell; it examines the current input ( $x_{t}$ ) and previous hidden state ( $h_{t - 1}$ ). Subsequently, the input gate decides on the input that should be employed to update the memory; ${\tilde{C}}_{t}$ contains the new information. Finally, the output gate determines the information that should be regarded as output (Fang et al., 2021). This process can be expressed mathematically, as follows (Shi et al., 2021):

F o r g e t g a t e : f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f}) (1)

I n p u t g a t e : {\begin{matrix} i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i}) \\ {\tilde{C}}_{t} = t a n h (W_{c} \cdot [h_{t - 1}, x_{t}] + b_{c}) \end{matrix} (2)

O u t p u t g a t e : o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o}) (3)

where $σ$ , $W$ , and $b$ are the sigmoid function, weight matrix, and corresponding bias vector of each gate, respectively. The new memory cell is updated as follows:

C_{t} = f_{t} \times C_{t - 1} + i_{t} \times {\tilde{C}}_{t} (4)

where $\times$ denotes the element-wise multiplication of two vectors, and $C_{t - 1}$ and $C_{t}$ are the previous and new memory cell states, respectively (Chen et al., 2020). The hidden layer is a vector of probabilities is passed to the next time step; it is defined as follows:

h_{t} = o_{t} \times t a n n h (C_{t}) (5)

Finally, the output of the LSTM network at time $t$ is calculated as follows (Zhang et al., 2020):

y_{t} = σ (W_{y} h_{t} + b_{y}) (6)

2.2.2 Extreme Learning Machine

The ELM, a type of feed-forward neural network, has been widely used to solve regression, clustering, image processing and classification problems. Recently, the ELM has attracted considerable attention from researchers because of its high generalization performance and remarkably fast learning rate compared with traditional methods. The minimal requirement for human intervention is another advantage of the ELM approach, where most parameters can be randomly generated (Yahia et al., 2021). In particular, the ELM can adaptively determine the number of nodes in the hidden layer, randomly assign the input weights and hidden layer biases using an activation function, and obtain output layer weights through the least squares method; these abilities appreciably enhance the learning speed and generalization ability (Ding et al., 2015). For a given training dataset composed of $N$ samples $(x_{i}, t_{i}) \in R^{n} \times R^{m} (i = 1,2, \dots, N),$ the ELM model is defined mathematically as follows (Ding et al., 2015):

\begin{matrix} \sum_{i = 1}^{\tilde{N}} β_{i} f_{i} (x_{j}) = \sum_{i = 1}^{\tilde{N}} β_{i} f (a_{i} ⊚ x_{j} + b_{i}) = t_{j} & j = 1,2, \dots, N \end{matrix} (7)

where $\tilde{N}$ represents the number of hidden nodes; good generalization performance will be obtained if $\tilde{N} ≪ N$ . $⊚$ indicates the inner product of vectors, $f (x)$ is the non-linear activation function, and $b_{i}$ denotes the $i$ -th hidden node bias. Finally, $a_{i}$ and $β_{i}$ are the weight vectors, such that $a_{i}$ connects the input nodes to the $i$ -th hidden node and $β_{i}$ connects the $i$ -th hidden node to the output nodes. Equation 7 can be simply expressed as follows:

H β = T (8)

where $β = {[\begin{matrix} β_{1}^{T} \\ β_{2}^{T} \\ ⋮ \\ β_{N}^{T} \end{matrix}]}_{N \times m}$ and $T = {[\begin{matrix} T_{1}^{T} \\ t_{2}^{T} \\ ⋮ \\ t_{N}^{T} \end{matrix}]}_{N \times m}$ . $H$ , as the hidden layer output matrix, is represented as follows:

H (a_{1}, a_{2}, \dots, a_{\tilde{N}}, b_{1}, b_{2}, \dots, b_{\tilde{N}}, x_{1}, x_{2}, \dots, x_{N}) = {[\begin{matrix} f (a_{1} ⊚ x_{1} + b_{1}) & \dots & f (a_{\tilde{N}} ⊚ x_{1} + b_{\tilde{N}}) \\ ⋮ & ⋱ & ⋮ \\ f (a_{1} ⊚ x_{N} + b_{1}) & \dots & f (a_{N} ⊚ x_{\tilde{N}} + b_{\tilde{N}}) \end{matrix}]}_{N \times \tilde{N}} (9)

In summary, the ELM stages can be described as follows:

After defining $f (x)$ and $\tilde{N}$ , training is initiated, and $a_{i}$ and $b_{i}$ are randomly assigned ( $i = 1,2, \dots, \tilde{N})$ . Thereafter, $H$ is calculated according to Eq. 9. Finally, the output weight $β$ is calculated as follows:

\hat{β} = H^{†} T (10)

where $H^{†}$ shows the generalized inverse of $H$ , which can be computed using various methods (e.g., singular value decomposition, orthogonal projection, and iterative and orthogonalization methods) (Rao and Mitra, 1973). However, the singular value decomposition method is mostly used in ELM implementations because of the limitations of the other approaches (Liang et al., 2006).

2.2.3 Random Vector Functional Link Networks

RVFL networks represent another type of single hidden layer feed-forward neural network; these have received considerable attention because of their ability to non-iteratively adjust network weights, fast convergence, and simple network architectures. Moreover, unlike ELM networks, RVFL networks have direct connections between input and output nodes, thus preventing overfitting problems (Zhang et al., 2019). In RVFL networks, hidden-to-output and input-to-output node weights can be determined using the Moore–Penrose pseudo-inverse or ridge regression method during the training stage; other parameters (e.g., weights between the input-to-hidden node and biases) are randomly selected in the interval $[- 1,1]$ without iterative tuning (Cao et al., 2018; Abd Elaziz et al., 2021). An RVFL network with $l$ hidden nodes can be formulated as follows (Zhang et al., 2019):

\begin{matrix} y_{i} = \sum_{j = 1}^{l} β_{j} h_{j} (x_{i}) + \sum_{j = l + 1}^{l + d} β_{j} x_{i j} & i = 1,2, \dots, N \end{matrix} (11)

where $(x_{i}, y_{i}) \in R^{d} \times R^{c} (i = 1,2, \dots, N)$ represents the training samples, among which $x_{i}$ and $y_{i}$ are $d$ - and $c$ -dimensional input and target vectors, respectively. $h_{j} (x_{i})$ represents the activation value for the $j$ -th hidden node, $x_{i j}$ denotes the $j$ -th attribute in the $i$ -th instance, and $β \in R^{(l + d) \times c}$ indicates the output weight matrix for the hidden nodes; these nodes can be calculated through the least squares method, as follows (Zhang et al., 2019):

β = {(H^{T} H)}^{- 1} H^{T} Y (12)

where $H = {[\begin{matrix} h_{1} (x_{1}) & \dots & h_{l} (x_{1}) \\ ⋮ & ⋱ & ⋮ \\ h_{1} (x_{N}) & \dots & h_{l} (x_{N}) \end{matrix} \begin{matrix} x_{11} & \dots & x_{1 d} \\ ⋮ & ⋱ & ⋮ \\ x_{N 1} & \dots & x_{N d} \end{matrix}]}_{N \times (l + d)}$ .

2.3 Factor Selection

Various geological, geochemical, and topographical factors are associated with IRC values. Following a literature review and assessment of the available data, as well as application of the FR method, 10 effective factors were identified for IRC modeling (Table 1). These factors included lithology; fault density; mean soil calcium oxide (CaO), copper (Cu), lead (Pb), and ferric oxide (Fe₂O₃) concentrations; elevation; slope; valley depth; and TWI. Importantly, the FR values reflect probabilistic spatial relationships of dependent variables (IRC values, obtained from field measurements) with the various classes of each independent variable (“radon factors”). The FR values can be calculated as follows:

F R = \frac{N_{r} / T_{r}}{N_{p} / T_{p}} (13)

where $N_{r}$ is the number of training samples in each subclass of IRC effective factors, $T_{r}$ denotes the total number of training samples, $N_{p}$ is the is the number of pixels of each sub-class of the effective factor, and $T_{p}$ indicates the total pixels of the study area. An FR value >1 indicates a high correlation between radon level and a particular factor, an FR value <1 indicates a low correlation, and an FR value of 1 indicates a moderate correlation (Al-Abadi et al., 2016).

TABLE 1

TABLE 1. Factors considered to map indoor radon levels.

To identify relationships among the included effective factors, multicollinearity analysis was performed based on the variance inflation factor ( $V I F$ ) and tolerance ( $T O L$ ) (Arabameri et al., 2021c). Importantly, some factors were found to exert a negative influence on the predictive capacity of the model. Such variables were removed from the model to increase its prediction accuracy (Miraki et al., 2019). The relative importance and predictive abilities of the various factors were determined using the information gain ratio ( $I G R$ ). This is an entropy-based method that only considers variables associated with occurrence of an event (Bui et al., 2018). A higher IGR value indicates that a factor has greater impact on the model predictions (Panahi et al., 2021).

2.4 Geospatial Database of Radon Factors

Radon levels and their controlling factors vary spatially, and the selection of appropriate predictive variables is important for radon mapping accuracy. As shown in Table 1, 10 geogenic effective factors were used to model the indoor radon level. Local lithology and fault density are crucial factors affecting radon production and distribution, even in adjacent areas (Buttafuoco et al., 2010). Radon is released naturally via uranium-bearing mineral decay, such that fractures and faults provide an important route for radon migration from bedrock to the surface (Ciotoli et al., 2017). For the current study, geological and geochemical maps from the Korea Institute of Geoscience and Mineral Resources (https://www.kigam.re.kr/) were used (Figure 3). In addition to site geological characteristics, the concentrations of some chemical elements (i.e., CaO, Cu, Pb, and Fe₂O₃) remaining in minerals and soil after erosion can affect IRCs. Soil geochemistry can serve as a predictor of radon level (Ball et al., 1991; Schumann and Gundersen, 1997; Drolet et al., 2014). The effects of bedrock geochemistry on IRC are reportedly greater than those of topsoil properties, because a large portion of the topsoil tends to be removed during construction; thus, only a few centimeters remain (Appleton, 2013).

FIGURE 3

FIGURE 3. Indoor radon maps: (A) elevation, (B) slope, (C) TWI, (D) valley depth, (E) mean soil CaO concentration, (F) mean soil Cu concentration, (G) mean soil Fe₂O₃ concentration, (H) mean soil Pb concentration, (I) lithology, and (J) fault density.

In addition to geological variables, topographical factors were considered for our indoor radon potential mapping. The data were derived from a digital elevation model with a resolution of 10 m, provided by the National Geographic Information Institute (http://www.ngii.go.kr). The data were processed by SAGA software (http://www.saga-gis.org/en/index.html) to produce slope, valley depth, and TWI maps (Figure 3). In the present study, the TWI was used as a proxy of the spatial distribution of soil moisture, and was calculated as follows:

T W I = \ln \frac{β}{\tan α} (14)

where $β$ and $α$ are the cumulative catchment area in m² and slope angle in radians, respectively (Arabameri et al., 2021b). The TWI can reflect the water transmissivity and infiltration rate at a given location. Areas with low slope angles have high TWI values, while steeper areas have low TWI values (Mattivi et al., 2019). Notably, pores saturated with water trap radon in the soil and slow its transport through soil into the atmosphere (Kellenbenz and Shakya, 2021; Shahrokhi and Kovacs, 2021). However, soil moisture content can influence radon escape from mineral matter if fewer than 30% of the soil pores are filled with water; higher soil moisture content leads to a considerable reduction in radon emanation because of decreased gas permeability (Je et al., 1999). Furthermore, large valley depth values indicate areas with low elevation and gentle slopes (Figure 3). In such areas, the infiltration rate is high; the high soil wetness and fine texture lead to low permeability, in turn causing convective radon flow and slow soil gas exhalation (Wiegand, 2001).

2.5 Model Development

The generation of a radon inventory map is important for developing a machine learning-based model. In the current study, with the aim of obtaining representative samples of indoor radon levels, 1,452 dwellings were selected at random throughout the study area. Since 2008, passive IRC measurements have been conducted by National Institute of Environmental Research (NIER) using alpha-track detectors (Raduet; Radosys Ltd., Budapest, Hungary). The detectors were typically positioned in the living room, where residents spent most of their time. Each measurement period (all in winter) was 90 days in duration; the collected data were returned to NIER for analysis, and showed that the IRC value exceeded the recommended level of 148 Bqm⁻³ in 726 samples. To develop the model, samples were classified in a binary manner in terms of their IRC values. Samples with IRC >148 Bqm⁻³ were coded as 1, indicating locations with high radon levels. All remaining samples with low IRC values were coded as 0, indicating locations with low radon levels. Two classes of data (high and low radon levels) with equal numbers of samples (726) were randomly divided into training and testing subsets at the ratio of 70:30 (Kadirhodjaev et al., 2020; Panahi et al., 2021; Roy et al., 2021). The distribution of the training and testing samples is illustrated in Figure 1. To build the model, the training dataset was constructed by combining 508 samples belonging to the high and low radon level classes. Similarly, to validate the predictive accuracy of the model, the testing dataset was constructed from 218 samples that belonged to the high and low radon level classes. The training and testing datasets were then superimposed with all of the radon factors to extract their attribute characteristics. Finally, the data were transferred into MATLAB software (https://www.mathworks.com) to construct the LSTM, ELM, and RVFL models.

2.6 Model Validation

Model validation is critical to confirm the reliability of machine learning algorithms. Various statistical analysis methods are used to evaluate modeling accuracy. The area under the receiver operating characteristic curve (AUROC) is a useful quantitative parameter, where accurately detected events are plotted on the y-axis (i.e., sensitivity) against false detections on the x-axis (i.e., 1–specificity). AUROC can be constructed from both training and testing datasets, to yield success and prediction rates, respectively. The success rate curve represents model accuracy according to the locations of the samples; the prediction rate curve indicates the predictive power or generalizability of the model (Golkarian and Rahmati, 2018). The AUROC takes a value between 0 and 1, where values closer to 1 reflect better predictive ability (Park et al., 2017).The root mean square error (RMSE) and standard deviation (StD) are another statistical approaches used to assess the prediction accuracy of a model with $n$ total variables, as follows:

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(X_{p r e d i c t e d} - X_{a c t u a l})}^{2}} (15)

S t D = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(X_{p r e d i c t e d} - {\bar{X}}_{p r e d i c t e d})}^{2}} (16)

where ${\bar{X}}_{p r e d i c t e d}$ is the mean value of the predicted dataset, $X_{p r e d i c t e d}$ and $X_{a c t u a l}$ indicate the predicted and actual values of the variable, respectively. An overview of the methods used for indoor radon potential mapping is provided in Figure 4.

FIGURE 4

FIGURE 4. Flowchart of the method used to map indoor radon potential.

3 Results

3.1 Multicollinearity and IGR Analysis

Collinearity among effective radon factors was determined by calculating the $V I F$ and $T O L$ , where $V I F > 10$ and $T O L < 0.1$ indicate collinearity among predictors (Arabameri et al., 2021a). As shown in Table 2, the $V I F$ and $T O L$ values of the selected factors were lower than the critical values; thus, there was no collinearity among inputs. Notably, elevation had the lowest $T O L$ (0.316) and highest $V I F$ (3.160).

TABLE 2

TABLE 2. Multicollinearity analysis using VIF, TOL, and IGR.

The $I G R$ method was applied to rank the predictive capabilities of the variables; the results indicated that elevation had the strongest effect on radon-prone area mapping ( $I G R$ = 0.61), followed by lithology (0.32), valley depth (0.30), and mean soil Cu concentration (0.29). In fact, all factors with $I G R > 0$ had predictive power (Table 2).

3.2 Assessment of the Contributions of Each Factor to Model Performance

Various geogenic factors can affect radon levels; this can be quantified through statistical modeling, such as the FR method. Stronger correlations are indicated by higher FR values, while $F R < 1$ indicates a weak relationship between a given predictor and the IRC value. As shown in Table 3, 10 variables were used to predict areas with potentially dangerous radon levels. The results implied that an elevation of 120–242 m (FR = 3.29) had the greatest influence on the IRC. Importantly, approximately 80% of Danyang-gun is mountainous, with a shallow soil profile that mostly contains coarse fragments. This promotes soil permeability and movement of radon gas within the soil (Hauri et al., 2012). In contrast, soil gas accumulation in lowlands is high; thus, it can easily infiltrate indoor environments via the soil through openings and cracks in basement foundations. Analysis of the FR values for the slope factor showed that the highest value (2.41) was associated with the class of 0–13.5°. This finding implies that the IRC decreases in sloped areas because the released radon is rapidly diluted in outdoor air (Appleton, 2013). In terms of valley depth and TWI, the highest FR values were found in the sixth (161–370 m) and fifth (5.59–7.32) classes, respectively. These factors reflect the effects of hydrological variables (e.g., rainfall-runoff and infiltration rates) and soil moisture content on soil gas exhalation capacity; this capacity is generally diminished when soil wetness is increased (Sasaki et al., 2004; Raduła et al., 2018). Analysis of the relationship between the radon emanation rate and presence of specific uranium- and radium-containing minerals (i.e., in the host rock and remaining soils after weathering) showed that the highest values of FR were 1.47, 1.59, 1.76, and 1.48 for mean soil CaO, Cu, Pb, and Fe₂O₃ concentrations, respectively. For all factors, the FR values were >1, indicating strong correlations with radon levels at the monitoring sites. Furthermore, radon levels were high in areas where the fault density varied between 0.58 and 1.2 (FR = 2.32). Notably, fault systems located in fracture zones provide a route for radon to migrate upward from deeper sources (Han et al., 2006). Finally, in terms of lithology, the FR analysis yielded higher values, of 200.65, 32.74, and 13.71, for Cretaceous acidic dike (Kad), Cretaceous quartz porphyry (Kqp), and Cambrian quartzite and slate (CEdy) units, respectively. Generally, sedimentary, igneous, and metamorphic rocks contain variable amounts of uranium and radium, depending on the rock formation processes (Pasculli et al., 2014).

TABLE 3

TABLE 3. Spatial relationships of predictor variables with the IRC values, determined through FR analysis.

3.3 Radon Potential Mapping

The maps generated using the LSTM, ELM, and RVFL algorithms are shown in Figure 5. The maps included five classes of radon-prone areas (very low, low, moderate, high, and very high), based on the quantile method (Khosravi et al., 2018). The percentage area of each class on each map is shown in Figure 6. The ELM model was the most accurate; it categorized 19.62, 20.64, 19.84, 20.01, and 19.88% of the study area into the very low, low, moderate, high, and very high classes. As depicted in Figure 5, high radon levels were observed in central and southwestern parts of the study area due to the distribution of sedimentary rock and unconsolidated deposits such as carbonate, shale, sandstone, conglomerate, limestone, and dolomite, all of which are rich in uranium and organic materials. These findings were consistent with the results of previous studies (Cho et al., 2015; Hwang et al., 2017; Kim and Ha, 2018; Park et al., 2019).

FIGURE 5

FIGURE 5. Radon potential maps derived from the (A) LSTM, (B) ELM, and (C) RVFL models.

FIGURE 6

FIGURE 6. Percentage areas of the different radon potential classes for the (a) LSTM, (b) ELM, and (c) RVFL models.

The reliability of the results was checked using the FR method, which revealed that most of the samples with high radon levels were from the very high and high radon potential areas. Thus, the models exhibited satisfactory performance in terms of study area classification. The AUROC values were calculated to quantitatively evaluate the predictive accuracy of each model. The AUROC values for the success rate curve analysis of the LSTM, ELM, and RVFL models were 0.81, 0.83, and 0.82, respectively. The AUROC value for the prediction rate curve analysis was 0.82 for the ELM model; the LSTM and RVFL models had lower values of 0.80 and 0.78, respectively (Figure 7). The RMSE values exhibited a similar pattern. As shown in Figure 8, analysis based on training data showed that the RMSE was lowest for the ELM model (0.152); the LSTM and RVFL models exhibited higher RMSEs of 0.163 and 0.182, respectively. Further analysis based on the testing data showed that the RMSEs of the ELM, LSTM, and RVFL models were 0.209, 0.232, and 0.0286, respectively. The standard deviation (StD) values for the ELM model (0.152 and 0.207) were lower than those for the LSTM and RVFL models, during both the training and validation phases. In summary, by comparison of the AUROC, RMSE, and StD values calculated using the training and testing datasets, all of the evaluated models had acceptable performance in terms of classifying radon-prone areas; however, the ELM model was slightly superior to the two other models.

FIGURE 7

FIGURE 7. (A) Success rate curve and (B) prediction rate curve AUROC results.

FIGURE 8

FIGURE 8. Assessment of model performance: (A) LSTM, (B) ELM, and (C) RVFL. (a) Targets and outputs for the training dataset; (b) targets and outputs for the testing dataset; (c) MSE and RMSE for the training dataset; (d) frequency of errors for the training dataset; (e) MSE and RMSE for the testing dataset; (f) frequency of errors for the testing dataset.

4 Discussion

As a subclass of data-driven methods, machine learning algorithms have attracted attentions in geospatial studies because of their robust performance in modelling nonlinear problems. The present study was conducted to determine the effects of geogenic factors on radon levels in residential environments, and to identify areas of high radon risk using machine learning methods. To fulfill these aims, IRCs were measured during field surveys of 1,452 dwellings. Notably, IRCs exceeded the threshold value (148 Bqm⁻³) in 726 locations; they varied from 148.7 to 1,775.1 Bqm⁻³, with a mean value of 346.9 Bqm⁻³. This study demonstrated that the geological and topographical properties of a given site are the fundamental drivers of IRC spatial variability. Higher IRC values were observed in the central and southwestern parts of the study area (Figure 5), where the dominant lithology is limestone; the higher fault density in that region facilitates radon migration from bedrock to the surface. These results were consistent with the findings of Park et al. (2019), who reported that the mean IRCs were higher in Danyang than other counties in South Korea; the high values in that study were attributable to coal-bearing formations in the Daedong system and limestone intercalation in the Pyeongan system. Additionally, more than 200 limestone caves are present in Danyang; radon gas can easily accumulate in the holes within limestone areas and moves to the surface through faults and fractures. Therefore, lithology can be considered as a key predictor in defining geogenic radon-prone areas, in line with former studies including Przylibski et al. (2011) and Cho et al. (2015), who revealed the relationship between radon levels and variability of lithological units in the study area. In addition, Kim et al. (2011) pointed out that the high IRC values were correlated with the concentration of radionuclides in the surface soil and granitic rocks distribution in South Korea.

Furthermore, elevation had a greater effect on the IRC values in the present study than lithology, according to the $I G R$ analysis. In highland areas with steep slopes, soil has coarser fractions; consequently, it also has high permeability, such that radon gas emitted from rocks and surficial soil can easily migrate to the atmosphere and rapidly disperse in open air. Conversely, in areas of low elevation with gentle slopes, where most of the residential areas are located, indoor radon levels are high because there are no mitigation activities (Cinelli et al., 2015). Oliver and Khayrat (2001) showed the inverse relationship between radon concentrations and elevation. It perfectly overlaps the findings of Siaway et al. (2010), Mose et al. (2010), and Cho et al. (2015), who concluded that in highlands with steep slopes, indoor radon levels may be reduced because of high soil permeability. The presence of coarser soil with limited moisture leads to less soil accumulation of radon beneath buildings because of more rapid dilution of radon emanating from host rocks in the outdoor air.

Accurate determination of the geographical distribution of IRCs and prediction of radon priority areas can inform construction regulations and promote more cost-effective radon policies. We used three machine learning algorithms (i.e., LSTM, ELM, and RVFL) to map areas of high radon risk. The AUROC, RMSE, and StD values indicated that the ELM was superior to the LSTM and RVFL, in terms of predictive accuracy, during both the training and validation phases. The main advantage of the ELM method is that only the hidden layer weights require adjustment; therefore, it has better generalizability and is less computational complex, especially for large-scale samples (Liu et al., 2012; Fernández et al., 2019). The present study supports the findings of Lian et al. (2014), Huang et al. (2017), Yadav et al. (2017), and Anupam and Pani (2020), who stated the efficiency and applicability of the ELM algorithm to generating more accurate predictive models in various fields of study such as landslide displacement prediction, landslide susceptibility mapping, groundwater level prediction, and flood forecasting, respectively. However, the suitability of the ELM model for identifying the radon-affected areas has not been reported in the literature.

5 Conclusion

IRCs were measured in 1,425 randomly selected dwellings in Danyang-gun, South Korea, to facilitate indoor radon potential mapping using LSTM, ELM, and RVFL machine learning algorithms. The results showed that the ELM method had the best prediction performance; approximately 40% of the study area was located within very high and high-risk radon potential zones. Elevation was the strongest predictor of radon-prone areas, followed by lithology and valley depth.

Uranium and thorium in soil and rocks are the main sources of variability in IRC values, and more than 80% of the ionizing radiation to which humans are exposed is of natural origin (Pantelić et al., 2019). However, in this study the distribution of radon in indoor environments could not be reliably estimated solely on the basis of geogenic factors. In addition to the characteristics of the underlying soils and rocks, building materials, ventilation systems and resident lifestyles can substantially affect indoor radon levels. Nevertheless, the results of the present study should facilitate identification of high radon areas, and thus allow the negative effects of natural radon on human health to be reduced (through regular monitoring of existing houses and the imposition of restrictions on the construction of new structures in affected areas). An accurate indoor radon map is important for more efficient future surveys.

Data Availability Statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding authors.

Author Contributions

FR: Conceptualization, writing–original draft, writing-review and editing, software, formal analysis, data curation, visualization; SK: Writing–original draft; MA: Writing–original draft; MP: Methodology, validation, writing-review and editing, visualization; HK: Writing–original draft; SK: Resources, review and editing; JL: Resources, review and editing; JL: Resources, review and editing; JY: Resources, review and editing; and SL: Supervision, Funding acquisition, project administration.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Acknowledgments

This research was supported by the Basic Research Project of the Korea Institute of Geoscience and Mineral Resources (KIGAM) and Project of Environmental Business Big Data Platform and Center Construction funded by the Ministry of Science and ICT. Furthermore, this work was supported by a grant from the National Institute of Environmental Research (NIER), funded by the Ministry of Environment (MOE) of the Republic of Korea (NIER-2017-03-01-017).

References

Abd Elaziz, M., Senthilraja, S., Zayed, M. E., Elsheikh, A. H., Mostafa, R. R., and Lu, S. (2021). A New Random Vector Functional Link Integrated with Mayfly Optimization Algorithm for Performance Prediction of Solar Photovoltaic thermal Collector Combined with Electrolytic Hydrogen Production System. Appl. Therm. Eng. 193, 117055. doi:10.1016/j.applthermaleng.2021.117055