AUTHOR=Wang Kaiyi , Han Yanyun , Zhang Yuqing , Zhang Yong , Wang Shufeng , Yang Feng , Liu Chunqing , Zhang Dongfeng , Lu Tiangang , Zhang Like , Liu Zhongqiang
TITLE=Maize yield prediction with trait-missing data via bipartite graph neural network
JOURNAL=Frontiers in Plant Science
VOLUME=15
YEAR=2024
URL=https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2024.1433552
DOI=10.3389/fpls.2024.1433552
ISSN=1664-462X
ABSTRACT=
The timely and accurate prediction of maize (Zea mays L.) yields prior to harvest is critical for food security and agricultural policy development. Currently, many researchers are using machine learning and deep learning to predict maize yields in specific regions with high accuracy. However, existing methods typically have two limitations. One is that they ignore the extensive correlation in maize planting data, such as the association of maize yields between adjacent planting locations and the combined effect of meteorological features and maize traits on maize yields. The other issue is that the performance of existing models may suffer significantly when some data in maize planting records is missing, or the samples are unbalanced. Therefore, this paper proposes an end-to-end bipartite graph neural network-based model for trait data imputation and yield prediction. The maize planting data is initially converted to a bipartite graph data structure. Then, a yield prediction model based on a bipartite graph neural network is developed to impute missing trait data and predict maize yield. This model can mine correlations between different samples of data, correlations between different meteorological features and traits, and correlations between different traits. Finally, to address the issue of unbalanced sample size at each planting location, we propose a loss function based on the gradient balancing mechanism that effectively reduces the impact of data imbalance on the prediction model. When compared to other data imputation and prediction models, our method achieves the best yield prediction result even when missing data is not pre-processed.