AUTHOR=Rahmat Fariq , Zulkafli Zed , Juraiza Ishak Asnor , Mohd Noor Samsul Bahari , Yahaya Hazlina , Masrani Afiqah TITLE=Exploratory Data Analysis and Artificial Neural Network for Prediction of Leptospirosis Occurrence in Seremban, Malaysia Based on Meteorological Data JOURNAL=Frontiers in Earth Science VOLUME=8 YEAR=2020 URL=https://www.frontiersin.org/journals/earth-science/articles/10.3389/feart.2020.00377 DOI=10.3389/feart.2020.00377 ISSN=2296-6463 ABSTRACT=

Leptospirosis outbreaks in various parts of the world have been linked to changes in the weather. Furthermore, the effects have been shown to occur at different lags of up to 10 months, affecting the performance of simulation models that predict leptospirosis occurrence. In Malaysia, the link between different weather parameters, at different time lags, has yet to be established despite an increasing number of cases in recent years. In this study, a combination of data mining and machine learning is used to analyze, capture, and predict the relation between leptospirosis occurrence and temperature, rainfall, and relative humidity using the Seremban district in Malaysia as a case study. First, the optimal time lags for rainfall were determined using graphical exploratory data analysis (EDA) while non-graphical EDA was used for temperature. Then, an artificial neural network (ANN) model is developed to classify the combination of selected features into disease occurrence and non-occurrence using back-propagation training, optimizing the number of hidden layers and hidden nodes. The success is measured using accuracy, sensitivity, and specificity of each model. EDA has shown that leptospirosis occurrence in Seremban is highly correlated with weekly average temperature at lag 16 weeks and weekly rainfall amount at lag 12–20 weeks. Using these selected features, the ANN model achieved the highest accuracy, sensitivity, and specificity at 84.00, 86.44, and 79.33%, respectively. Overall, the EDA approach has increased the accuracy of the predictive model by 13.30–31.26% from the baseline models.