AUTHOR=Liu Siwei , Wang Jingjing , Li Ming , Cui Yanmei , Guo Juan , Shi Yurong , Luo Bingxian , Liu Siqing TITLE=A selective up-sampling method applied upon unbalanced data for flare prediction: potential to improve model performance JOURNAL=Frontiers in Astronomy and Space Sciences VOLUME=10 YEAR=2023 URL=https://www.frontiersin.org/journals/astronomy-and-space-sciences/articles/10.3389/fspas.2023.1082694 DOI=10.3389/fspas.2023.1082694 ISSN=2296-987X ABSTRACT=
The Spaceweather HMI Active Region Patch (SHARP) parameters have been widely used to develop flare prediction models. The relatively small number of strong-flare events leads to an unbalanced dataset that prediction models can be sensitive to the unbalanced data and might lead to bias and limited performance. In this study, we adopted the logistic regression algorithm to develop a flare prediction model for the next 48 h based on the SHARP parameters. The model was trained with five different inputs. The first input was the original unbalanced dataset; the second and third inputs were obtained by using two widely used sampling methods from the original dataset, while the fourth input was the original dataset but accompanied by a weighted classifier. Based on the distribution properties of strong-flare occurrences related to SHARP parameters, we established a new selective up-sampling method and applied it to the mixed-up region (referred to as the confusing distribution areas consisting of both the strong-flare events and non-strong-flare events) to pick up the flare-related samples and add small random values to them and finally create a large number of flare-related samples that are very close to the ground truth. Thus, we obtained the fifth balanced dataset aiming to 1) promote the forecast capability in the mixed-up region and 2) increase the robustness of the model. We compared the model performance and found that the selective up-sampling method has potential to improve the model performance in strong-flare prediction with its F1 score reaching 0.5501 ± 0.1200, which is approximately 22% − 33% higher than other imbalance mitigation schemes.