Classification models for Tobacco Mosaic Virus and Potato Virus Y using hyperspectral and machine learning techniques

Chen, Haitao; Han, Yujing; Liu, Yongchang; Liu, Dongyang; Jiang, Lianqiang; Huang, Kun; Wang, Hongtao; Guo, Leifeng; Wang, Xinwei; Wang, Jie; Xue, Wenxin

doi:10.3389/fpls.2023.1211617

ORIGINAL RESEARCH article

Front. Plant Sci., 16 October 2023

Sec. Sustainable and Intelligent Phytoprotection

Volume 14 - 2023 | https://doi.org/10.3389/fpls.2023.1211617

Classification models for Tobacco Mosaic Virus and Potato Virus Y using hyperspectral and machine learning techniques

Haitao Chen^1†

Yujing Han^2†

Yongchang Liu²

Dongyang Liu³

Lianqiang Jiang³

Kun Huang⁴

Hongtao Wang²

Leifeng Guo⁵

Xinwei Wang²

Jie Wang^2*

Wenxin Xue^2*

¹Tobacco Research Institute of Chongqing Company, Chongqing, China
²Tobacco Research Institute, Chinese Academy of Agricultural Sciences, Qingdao, China
³Science and Technology Department of Sichuan Liangshan Company, Liangshan Yi Autonomous Prefecture, Xichang, China
⁴Science and Technology Department of Yunnan Honghe Company, Hani-Yi Autonomous of Honghe Prefecture, Mile, China
⁵Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing, China

Tobacco Mosaic Virus (TMV) and Potato Virus Y (PVY) pose significant threats to crop production. Non-destructive and accurate surveillance is crucial to effective disease control. In this study, we propose the adoption of hyperspectral and machine learning technologies to discern the type and severity of tobacco leaves affected by PVY and TMV infection. Initially, we applied three preprocessing methods – Multivariate Scattering Correction (MSC), Standard Normal Variate (SNV), and Savitzky-Golay smoothing filter (SavGol) – to corrected the leaf full-length spectral sheet data (350-2500nm). Subsequently, we employed two classifiers, support vector machine (SVM) and random forest (RF), to establish supervised classification models, including binary classification models (healthy/diseased leaves or PVY/TMV infected leaves) and six-class classification models (healthy and various severity levels of diseased leaves). Based on the core evaluation index, our models achieved accuracies in the range of 91–100% in the binary classification. In general, SVM demonstrated superior performance compared to RF in distinguishing leaves infected with PVY and TMV. Different combinations of preprocessing methods and classifiers have distinct capabilities in the six-class classification. Notably, SavGol united with SVM gave an excellent performance in the identification of different PVY severity levels with 98.1% average precision, and also achieved a high recognition rate (96.2%) in the different TMV severity level classifications. The results further highlighted that the effective wavelengths captured by SVM, 700nm and 1800nm, would be valuable for estimating disease severity levels. Our study underscores the efficacy of integrating hyperspectral technology and machine learning, showcasing their potential for accurate and non-destructive monitoring of plant viral diseases.

1 Introduction

Tobacco Mosaic Virus (TMV) and Potato Virus Y (PVY) are widespread virus diseases in fields and cause massive economic losses to crops (Scholthof et al., 2011; Khateri et al., 2014; Korbecka-Glinka et al., 2021). Both PVY and TMV infect a wide range of plants, especially tobacco and other members of the family Solanaceae, causing symptoms such as leaf mosaic, vein clearing, and deformation (McDonald and Singh, 1996; Yang and Klessig, 1996; Quenouille et al., 2013). The key to effective viral disease control is to directly monitor the occurrence and prevalence of diseases. However, the traditional methods of visual or molecular identification are time-consuming, inefficient, and destructive. At the same time, TMV and PVY infections are difficult to be separated and could develop into severe symptoms rapidly under an ideal environment, which leads to missing the best control period (Figure 1B). Therefore, automatic identification of the disease occurrence and severity degree of plants in the field in time will be of great benefit for precise prevention by guiding the chemical application where and when needed at an appropriate dose, further controlling the spread of TMV and PVY in time and avoiding great production loss. Furthermore, the automated disease identification can be integrated into innovative disease-resistant plant breeding process, expediting the phenotyping process and yielding time savings compared to the visual assessment by human raters.

FIGURE 1

Figure 1 (A) The equipment ASD Field Spec4 (up) and the strategy of spectra data collection in this study (down); (B) the PVY/TMV diseased leaves with different severity levels (Wang et al., 2020).

Modern agriculture has benefited greatly from the high-tech vision solutions such as artificial intelligence, machine learning, etc. For example, spectrum technology has often been used in precision agriculture to fill gaps in continued human monitoring. Spectral reflectance captures crop biomass, disease information, and crop quality (Gnyp et al., 2014; Martínez-Martínez et al., 2018). The principle of spectroscopic technology is to identify the content and composition of substances using the different characteristics of different substances such as light absorption and reflection (Fang et al., 2015). In particular, at the onset of PVY and TMV infection, leaf structural characteristics and chlorophyll levels begin to change. These changes will further trigger the reflection spectrum. Hence, taking advantage of these fluctuations in the reflection spectrum, we could carry out disease detection and monitoring thanks to modern techniques avoiding irretrievable yield loss caused by missing the best control time. The advantages of applying spectral technology to monitor crop diseases are fast, non-destructive, and wide-area detectable. It has been verified in previous studies that the usage of leaf spectral information could effectively monitor and distinguish leaf disease (Huang et al., 2012; Hu et al., 2016; Kong et al., 2018; Martínez-Martínez et al., 2018; Long et al., 2021; Cai et al., 2022; Fernández et al., 2022). However, there are scarce studies to explore the application value in the detection of virus diseases.

Furthermore, searching for sensitive wavelength bands is the focus of spectroscopic technology applications. The optimal wavelength of the leaf spectrum can be quickly and accurately located by machine learning (Steddom et al., 2005; Li et al., 2022; Lv et al., 2022). Machine learning algorithms can process a large number of data sets with irregular surfaces, find the potential probability distribution of the data, and make predictions, which can be used for the diagnosis and prediction of crop diseases. Machine learning is mainly classified into supervised classification and unsupervised classification. Supervised classification algorithms include support vector machine (SVM), random forest (RF), decision Tree, KNN algorithm, linear regression, and other methods often used in the binary classification regression analysis. Unsupervised classification algorithms are mainly used for cluster analysis. Most of the applications have been implemented using supervised variants of machine learning algorithms rather than unsupervised ones. Besides, among the above algorithms, RF training speed is fast, having strong model generalization ability. The SVM algorithm is suitable for analyzing finite samples, overcoming the shortcomings of some other binary support vector machines, and improving multi-classification accuracy (Uddin et al., 2019).

Machine learning holds immense potential for enhancing accuracy. Hence, the fusion of spectral data and machine learning techniques has been harnessed for leaf disease diagnosis (Lamba et al., 2021). Nevertheless, spectral data often carries inherent noise, making the careful selection of appropriate algorithms and models paramount in the accurate identification of tobacco diseases. By analyzing the full-length spectra (350-2500nm) of healthy and diseased tobacco leaves, this study aims to establish robust and effective classification models for TMV and PVY diseases by two classifiers: support vector machine (SVM) and random forest (RF) (Figure 1). The outcomes of this research will be helpful to facilitate the early-stage disease type and the severity level assessments, scientifically informed strategies for preventing and controlling leaf diseases.

2 Materials and methods

2.1 Test materials and the test equipment

The study was performed in the JiMo experimental area of the Tobacco Research Institute, Chinese Academy of Agricultural Sciences, Qingdao City, China (120.58°N, 36.45°E). The TMV or PVY^N pathogens and tobacco seedlings K326 were provided by the Plant Protection Institute of the Chinese Academy of Agricultural Sciences.

The test equipment is ASD Field Spec4 portable handheld ground object spectrometer, which is equipped with VNIR (350-1000nm) 512-pixel silicon array detector in the visible region, SWIR1 (1001-1800nm) graded index InGaAs detector, and SWIR2 (1801-2500mm) graded index InGaAs detector. The acquisition wavelength range is 350-2500nm, the wavelength reproducibility is 0.1nm, and the wavelength accuracy is 0.5nm.

2.2 Test methods

2.2.1 Disease inoculation

The tobacco plant variety is K326 cultivated to the 7-8 leaf stage under the greenhouse conditions (25 ± 1°C, 65% ± 5% relative humidity, and 14:10h light: dark photoperiod). After the leaves were dusted, the healthy leaves were inoculated with pathogens TMV or PVY by mechanical inoculation method according to (Piche et al. (2004).

2.2.2 Spectral data acquisition

Parameters of the FieldSpec4 equipment were adjusted according to the usage specifications (Figure 1A). The optical fiber probe was at a 5° angle of view, and a distance of 10 cm above the blade surface for measurement. The lens was aimed at the whiteboard to optimize the instrument, and then the lens was moved to the tested leaf to store the leaf reflectance spectrum data. The whiteboard optimization was done for every ten tobacco plants measured.

Tobacco plants at 7-8 leaf stages were selected for measurement. Six measurement points were picked on each tobacco leaf at the base of the leaf, the middle of the leaf, and the top of the leaf using the leaf vein as the axis of symmetry (Figure 1A). The average value of these six points was taken as the spectral reflectance of the leaf.

2.2.3 Disease data collection

Six disease severity grades were applied to this study according to the disease grading standard GB/T23222-2008, namely, healthy leaf (no symptoms on the whole plant), grade 1 (Zero to one-quarter of the leaf is mosaic), grade 3 (one quarter to one-third of the leaf is mosaic), grade 5 (one third to a half of the leaf is mosaic, slight deformation or slightly darken vein), grade 7 (a half to two-thirds of the leaf is mosaic, deformation or vein necrosis), and grade 9 (two thirds to the whole leaf is mosaic, severe deformation or severe vein necrosis) (Figure 1B).

The spectral reflectance of healthy and TMV or PVY diseased leaves at the five unhealthy classes of grade 1 (TMV1 or PVY1), grade 3 (TMV3 or PVY3), grade 5 (TMV5 or PVY5), grade 7 (TMV7 or PVY7) and grade 9 (TMV9 or PVY9) were collected respectively, using the same method in 2.2.2. 893 samples were obtained in total (Table 1). A number of 286 samples were obtained for TMV-diseased leaves, 456 for PVY-diseased leaves, and 151 for healthy leaves (Table 1).

TABLE 1

Table 1 Spectrum data collection of tobacco diseased leaf and healthy leaf.

2.3 Data processing

2.3.1 Data preprocessing

The collected spectral data includes over 2150 wavenumber points. In the actual spectral data acquisition process, the environmental conditions, sampling time, sampling points, and so on would affect the collecting result by inducing scattering and noises. Therefore, before the model was built, the original spectral data were perpetrated by three data preprocessing methods: Multivariate Scattering Correction (MSC), Standard Normal Variate (SNV), and Savitzky-Golay smoothing filter (SavGol).

MSC is one of the common methods of hyperspectral data preprocessing, which can effectively eliminate the spectral difference caused by the scattering level and correct the baseline shift, and offset the phenomenon of spectral data (Windig et al., 2008). The formulas are as follows:

a. The average of all spectral data as the “ideal spectrum”

\begin{array}{l} \begin{matrix} \vec{D a t a} = \frac{\sum_{i = 1}^{n} D a t a_{i j}}{n} \end{matrix} & (1) \end{array}

b. The baseline translation and offset of each sample were obtained by solving the least squares problem by unary linear regression between the spectrum of each sample and the average spectrum.

\begin{array}{l} \begin{matrix} D a t a_{i} = k_{i} \vec{D a t a} + b_{i} \end{matrix} & (2) \end{array}

c. The spectrum of each sample was corrected by subtracting the obtained baseline translation and dividing by the offset.

\begin{array}{l} \begin{matrix} D a t a_{i (M S C)} = \frac{(D a t a_{i} - b_{i})}{k_{i}} \end{matrix} & (3) \end{array}

SNV was used to eliminate the effects of solid particle size, surface scattering, and optical path variation on the near-infrared band (NIR) diffuse reflectance spectra (Dhanoa et al., 1995). The formula is:

\begin{array}{l} \begin{matrix} X_{i, S N V = \frac{x_{i, k} - \bar{x_{i}}}{\sqrt{\frac{\sum_{k = 1}^{m} (x_{k} - \bar{x_{i}})}{(m - 1)}}}} \end{matrix} & (4) \end{array}

x_i is the average of the spectra of the i sample; k=1,2,…, m. m is the wave point; i=1,2,…,n;

n is the corrected sample number; X_{i, SNV} is the transformed spectrum.

SavGol could improve the smoothness of the spectrum, reduce the interference of noise, and ensure that the shape and width of the signal remain unchanged while filtering out the noise (Savitzky and Golay, 1964; Mishra et al., 2019). SavGol smoothing uses polynomial functions to smooth signals. It involves selecting a window, fitting a polynomial to the data within it, and replacing the central point with the polynomial’s value. The window size and polynomial choice are typically manual, based on visual inspection. In our work, we used a second-order polynomial and a 5-point window for smoothing.

2.3.2 Model establishment

Since spectral data may have multicollinearity, SVM and RF algorithms are used to prevent over-fitting. All the data set was divided into two parts: 80% as the training set, and the remaining 20% of the data set as the test set to evaluate the performance of the trained algorithms in the test set.

The SVM algorithm divides the optimal hyperplane by constructing feature space. The idea of maximizing the classification margin is the core of the SVM method. SVM contains several parameters, such as kernel function, gamma value, and penalty factor C (Boser et al., 1992; Cortes and Vapnik, 1995). In this study, the Linear is used as the kernel function. The optimal model parameters were determined by GridSearch in the training set based on 50% fold cross-validation. The penalty factor Cs were 1 for SVM + MSC, 40 for SVM + SNV, 1 for SVM + SavGol.

RF algorithm is an algorithm, integrated with a large number of decision trees. The final predicted result is obtained according to the summary of the scores of the decision tree nodes on the dataset. Each decision tree classifies the input vector, and the final classification result is determined by the vote of each tree. Therefore, the number of decision trees is the most important parameter affecting RF, and the optimal number of decision trees is determined by GridSearch based on 50% fold cross-validation in the training set (Breiman, 2001; Liaw and Wiener, 2002). The numbers of decision trees were 53 for RF + MSC, 39 for RF + SNV and 25 for RF + SavGol.

The binary classification is the identification of healthy and diseased tobacco leaves, or PVY diseased leaves and TMV diseased leaves. The six-class classification is suffered TMV grades TMV1, TMV3, TMV5, TMV7, TMV9 and healthy leaves, respectively, and suffered PVY grades PVY1, PVY3, PVY5, PVY7, PVY9 and healthy leaves. The sample dataset was randomly divided into a training set and a test set according to 8:2 (Details information was listed in Table 1). Each algorithm was trained with the best parameters through the training set, and the trained algorithm evaluated the performance of the model on the test set (Figure 2).

FIGURE 2

Figure 2 Workflow of this study.

2.3.3 Model performance evaluation

Evaluation indicators of each classifier: “Precision” means the ratio of the number of correctly classified samples of a certain category to the predicted samples of this category; “Recall” means the ratio of the number of correctly classified samples of a certain category to the real number of the category, f1 is the comprehensive evaluation of “precision” and “recall”; “accuracy” means the proportion of the number of samples that are correctly predicted; “support” means the number of samples (Lamba et al., 2021).

\begin{array}{l} \begin{matrix} p r e c i s i o n = \frac{(T P)}{(T P + F P)} \end{matrix} & (5) \end{array}

\begin{array}{l} \begin{matrix} R e c a l l = \frac{T P}{(T P + F N)} \end{matrix} & (6) \end{array}

\begin{array}{l} \begin{matrix} F 1_{s c o r e} = \frac{(2 * T P)}{(2 * T P + F N + F P)} \end{matrix} & (7) \end{array}

\begin{array}{l} \begin{matrix} A c c u r a c y = \frac{(T P + T N)}{(T P + T N + F P + F N)} \end{matrix} & (8) \end{array}

TP is the number of true positive; TN is the number of true negative; FP is the number of false positive; FN is the number of negative.

As more attention is paid to diagnostic accuracy in agricultural disease diagnosis, precision scores are the main evaluation index in this study. The higher the precision score, the better performance of the model will be.

2.4 Data analysis

There are three different methods of data preprocessing and two classifiers to create six combinations of algorithmic models (Table 2). All data processing was run by Python3.9 and R 4.0.1 versions.

TABLE 2

Table 2 The average precision (%) of each algorithm combination in different classification models.

3 Results

3.1 Spectral data preprocessing

Mean spectral reflectance curves under different conditions are shown in Figure 3. We could see that the spectral reflectance patterns of various kinds of leaves give the same variation trend (Figures 3A, E). In general, there are three significant peaks around 780nm, 1250nm, and 1600nm and one valley value under the near-infrared band in healthy, PVY, or TMV-infected leaves. Under the above three peaks, there are different degrees of overlaps among the healthy and various levels of diseased leaves.

FIGURE 3

Figure 3 The leaf reflectance of tobacco leaves. (A) PVY: spectra without any correction; (B) PVY MSC corrected spectra; (C) PVY: SavGol corrected spectra; (D) PVY: SNV corrected spectra; (E) TMV: spectra without any correction; (F) TMV: MSC corrected spectra; (G) TMV: SavGol corrected spectra; (H) TMV: SNV corrected spectra.

After different pretreatments, the spectra’ reflectance changed in different ways. For TMV-infected leaves, SNV (Figure 3C) and MSC (Figure 3D) modified the spectral reflectance significantly in the near-infrared band (NIR), while producing overlaps of spectral reflectance around the 780nm band. Similar to a scenario in the TMV experiment, pretreatments of SNV (Figure 3G) and MSC (Figure 3H) increased the reflectance discrimination ability of different severity levels of PVY diseased leaves but reduced the spectral resolution ability around the 780nm band.

Overall, MSC and SNV preprocessing methods revealed a relatively outstanding ability to improve the NIR band’s spectral resolution. The spectral reflectance after SavGol treatment did not change significantly. In addition, the resolution of PVY-diseased leaves is better than TMV-diseased leaves (Figure 3).

3.2 Tobacco leaf disease binary classification

3.2.1 The binary classification of healthy leaf and diseased leaf

The number of 893 samples was randomly divided into a training set (80% with 714 samples) and a test set (20% with 179 samples) (Table 1). As shown in Tables 2, 3; Figure 4, in the binary classification of healthy leaf and diseased leaf, all the preprocessing methods and classifiers gave high recognition precision and accuracy of healthy leaves with over 93%. The misclassification for the RF classifier mainly came from the mistaken healthy leaves of diseased ones (Figure 4). The recognition precision of SavGol+RF, SavGol+SVM, and SNV+SVM combinations reached up to 98% (Table 2), which could be potentially adopted for the accurate identification of diseased leaves and healthy leaves.

TABLE 3

Table 3 Evaluation index scores of each algorithm model in the binary classification of healthy leaf and diseased leaf.

FIGURE 4

Figure 4 The confusion matrix of each algorithm model in the binary classification of healthy leaves and diseased leaves. A–F: Number of correctly and misclassified samples for six classification models: (A) SavGol+SVM, (B) MSC+SVM, (C) SNV+ SVM, (D) SavGol+RF, (E) MSC+RF, and (F) SNV+RF.

3.2.2 The binary classification of PVY diseased leaf and TMV diseased leaf

A total of 742 samples were input for the classification of PVY diseased leaf and TMV diseased leaf. 590 samples for the training set and 152 samples for the test set. Tables 2, 4; Figure 5 showed that in the identification of PVY diseased leaves, the average precision after the SavGol pretreatment method was the highest at 100%. The recognition precision of TMV-diseased leaves by the combination of MSC+RF was lower than 86% (Table 4), and most of the errors were misjudging TMV-diseased leaves as PVY diseased leaves (Figure 5). The overall classification result was better when combined SVM classifier. The algorithm models of SavGol+RF, SavGol+SVM, and MSC+SVM combinations could greatly help achieve the accurate identification of TMV and PVY diseases.

TABLE 4

Table 4 Evaluation index score of each algorithm model in the binary classification of PVY-infected leaf and TMV-infected leaf.

FIGURE 5

Figure 5 The confusion matrix of each algorithm model in the binary classification of PVY diseased leaf and TMV diseased leaf. A–F: Number of correctly and misclassified samples for six classification models: (A) SavGol+SVM, (B) MSC+SVM, (C) SNV+ SVM, (D) SavGol+RF, (E) MSC+RF, and (F) SNV+RF, respectively.

3.3 TMV six-class classification

There are 437 samples for the TMV six-class classification. 349 samples were randomly separated into the training set, and 88 samples in the test set (Table 1). In the TMV six-class classification, it is known from Table 5; Figure 6 that in the cases of healthy, TMV1, TMV5, TMV7, and TMV9 recognition, most of the models have performed excellently. Unlike the mediocre performance of PVY3 identification, the recognition precisions of the TMV3 leaf were relatively poor in all models (Table 5).

TABLE 5

Table 5 Evaluation index score of the six-class classification of TMV diseased leaf.

FIGURE 6

Figure 6 The confusion matrix of each algorithm model in the six-class classification of TMV diseased leaf. (A–F): Number of correctly and misclassified samples for six classification models: (A) MSC+RF, (B) SavGol+SVM, (C) SNV+RF, (D) MSC+SVM, (E) SavGol+RF, and (F) SNV+ SVM.

We could see that both classifiers had better recognition of healthy leaves and TMV5, and the misjudgments were mostly concentrated in TMV1 and TMV3. For example, the errors of models using SVM as a classifier were mostly misjudgments of TMV1 to TMV3. The misjudgments of combinations including RF as a classifier were mainly at the TMV1 level (Figure 6). The classification precision of SavGol+RF for different TMV disease grades was the highest, with a rate of 98% (Table 2).

The full spectral analysis revealed that all bands contributed fluctuated information. Figure 7 shows the effective bands captured by the combination of MSC+SVM, SavGol+SVM, and SNV+SVM were concentrated around 1801nm and 1802nm. The effective bands captured by the combinations with the RF classifier dispersed extremely. These may indicate that the different important bands captured by different classifiers may be one of the important reasons that affect the recognition accuracy.

FIGURE 7

Figure 7 Characteristic wavelength maps of TMV diseased leaf. The spectral signatures of each wavelength were calculated by two algorithms (SVM and RF) with three pretreatment methods (MSC, SNV and SavGol): (A) MSC+SVM, (B) SavGol+SVM, (C) SNV+ SVM, (D) MSC+RF, (E) SNV+RF, and (F) SavGol+RF. Height of peaks indicate the contribution of each wavelength to the predictive power of the model.

3.4 PVY six-class classification

A total of 607 samples was collected for PVY six-class classification analysis, among which 485 samples were randomly divided into the training set and 122 samples were into the test set.

In the results of the PVY six-class classification, the performance of the same preprocessing method and classifier varied greatly among the severity grade classifications. In the recognition of severity levels PVY1 and PVY3, the precisions of models including the SVM classifier were 100% while RF has a poor estimation ability with low precision rates. For the recognition of PVY5, only the models after SavGol pretreated came up to 90%. For PVY7 and PVY9 identification, all models are generally excellent with high precision rates between 96% and 100% (Table 6).

TABLE 6

Table 6 Evaluation index score of each algorithm model in the PVY diseased leaf six-class classification.

Overall, the results showed that the precision rates of the models combined with the SVM classifier were high (Table 6). The errors centered on misidentifying PVY3 as healthy leaves. For the RF classifier, mainly misjudged PVY3 and PVY5 (Figure 8B). The SavGol+SVM was awarded the best model among six combinations in our data set, with an average precision of 98%, which could be used to identify different disease grades of PVY (Table 2).

FIGURE 8

Figure 8 The confusion matrix of each algorithm model in the six-class classification of PVY diseased leaf. (A–F): Number of correctly and misclassified samples for six classification models: (A) MSC+SVM, (B) SavGol+SVM, (C) SNV+ SVM, (D) MSC+RF, (E) SavGol+RF, and (F) SNV+RF.

The pattern of captured characteristic bands is similar to the TMV experiment. The effective bands of PVY-diseased leaves captured by different treatment combinations were different (Figure 9). Three combinations showed better recognition performances, SavGol +SVM, MSC+SVM, and SNV+SVM. The feature bands contributing more information to the model building were relatively centralized, 699nm, 698nm, 700nm, and other near-infrared bands. However, the important bands caught by combinations using RF as a classifier are highly dispersed. For SNV+RF, they are far infrared bands such as 2338nm and 962nm, while those captured by SavGol +RF are near-infrared bands such as 826nm, 835nm, and 867nm. For MSC+RF, they are far infrared bands around 2331nm, 1803nm, and 2307nm.

FIGURE 9

Figure 9 Characteristic wavelength maps of PVY diseased leaf. The spectral signatures of each wavelength were calculated by two algorithms (SVM and RF) with three pretreatment methods (MSC, SNV and SavGol): (A) MSC+SVM, (B) SNV+ SVM, (C) SavGol+SVM, (D) MSC+RF, (E) SNV+RF, and (F) SavGol+RF. Height of peaks indicate the contribution of each wavelength to the predictive power of the model.

4 Discussion

Spectroscopy has been used in precision agriculture for fast and non-destructive determining crop disease epidemic situations (Martínez-Martínez et al., 2018; San-Blas et al., 2020; Cai et al., 2022; Fernández et al., 2022; Li et al., 2022; Lv et al., 2022). Machine learning in leaf disease recognition is superior to traditional methods such as regression and clustering analyses (Ingale and Baru, 2019). Combined with automatic extracting spectral characteristics to improve the precision of the model, machine learning could be better applied to precision agriculture to guide the diagnosis of agricultural diseases. Here, these two technologies (hyperspectral and machine learning) were utilized to diagnose the type and severity degree of virus diseases PVY and TMV on tobacco. The optimal model for plant virus disease diagnosis was explored by setting up six combinations with three spectral data preprocessing methods (MSC, SNC, and SavGol) and two machine learning classifiers (SVM and RF).

Overall, all the models have excellent capabilities in the identification of the type and infection severities of TMV/PVY diseased leaves. In the binary classification model, the SVM classifier performed better compared to RF with over 97% precision (Table 2). Using an SVM classifier, Fernández et al. (2022) also achieved a good separation effect of yellow powder disease four days after cucumber leaves inoculation, and the overall accuracy rate was above 95% (Fernández et al., 2022). For the six-class classification, the performance of the SVM classifier is also obviously good and the best combination is SavGol+SVM (Table 2). However, our results illustrate that different classifiers have distinct diagnostic precisions for different diseases, which is similar to the diagnosis result of rice diseases using SVM (Sethy et al., 2020). The errors of different classifiers were not completely consistent either (Figures 6, 8), which might be caused by the slight deviations from the original manual definition of severity grades.

Lastly, machine learning was also used to automatically extract spectral features (Barradas et al., 2021). The results illustrated the effective wavelengths for identifying diseases located in visible and near-infrared light bands. In detail, the characteristic wavelengths for identifying PVY diseases were concentrated in the vicinity of 700nm, which was similar to the sensitive wavelength of other leaf diseases (Long et al., 2021). A previous study discovered a strong correlation between the reflectance of the band near 700nm and chlorophyll content, carotene, and other leaf pigments (Chen et al., 2012). Besides, when the leaf is infected by disease pathogens, the visible area will have higher reflectance (Shi et al., 2009; Sankaran and Ehsani, 2013; Wang et al., 2014). The characteristic wavelength of TMV-diseased leaves captured by SVM is near 1800nm. The near-infrared band is related to the internal structure and dry matter of leaves. Therefore, the change in the internal structure of leaves infected by the virus will change the spectral reflectance. But most studies were focused on the spectral wavelength range of 380–1023nm, which only found a similar effective wavelength of around 700nm (Xie et al., 2015; Zhu et al., 2017; Sethy et al., 2020). Here, the full-length spectra study revealed the potential candidate effective bands in the NIR region. Based on the captured effective bands, it is clear that even the spectral reflectance resolution became higher after MSC and SNV treatment in the NIR, but the identification abilities of the models with MSC or SNV were still lower than models with SavGol.

In yet other words, different spectral features could be caused by different diseases, while various feature bands might be caught by diverse classifiers for the same disease further leading to irregular classification accuracies (Kong et al., 2018). Therefore, all the combinations using SVM as a classifier captured the same effective bands of the same disease, while the combinations including the RF classifier had a huge variation in the effective wavelength capture for the same disease leaves. This may be one reason that the RF classifier has underperformed in this scenario.

5 Conclusions

Given the precise plant disease management, early detection plays a pivotal role in guiding timely interventions and preventing potential losses in production. Identification of the disease type proves invaluable in selecting the proper control strategies and expediting the breeding process. Moreover, the rapid development speed of tobacco viral diseases in the field underscores the challenge of sustaining minimal impact over time. Thus, assessing infection severities becomes crucial in choosing the appropriate intensity of control efforts. This study facilitates a comprehensive investigation into a rapid and non-invasion diagnostic model of the type and severity grades of two important virus disease -TMV and PVY, and further explore the optimal classification models. These classifiers process the categorization of tobacco-diseased leaf severity by capturing the feature wavelengths, paving the way for future large-scale promotion and application. Take, for instance, unmanned aerial vehicle (UAV)-based hyperspectral platforms, which have gained significant prominence due to their lightweight, flexibility, and ease of operation for detecting plant diseases (Vanegas et al., 2018). Nevertheless, there are still certain limitations to consider. Specifically, the acquisition cost of hyperspectral data remains relatively high, regardless of financial or labor expenses. External factors can also influence image quality during data collection, including factors such as measurement timing, light intensity, solar altitude angle, and more.

To sum up, all the classification models examined in this study demonstrated commendable performance in distinguishing PVY and TMV diseases in tobacco. Moreover, the SVM classifier did a better job than RF in the binary and six-class classification of PVY and TMV-diseased tobacco leaves. Additionally, the synergy between the SavGol preprocessing method and the SVM classifier yielded exceptional precision rates exceeding 96% across all classification tasks. In the light of feature wavelengths caught by SVM, specifically the 700nm band of PVY diseased leaf and the 1800nm band of TMV diseased leaf holds significant promise for the development of PVY and TMV disease classification model in the future large-scale monitoring, such as UAV spectral detection. In short, the integration of hyperspectral technology and machine learning offers a promising avenue for early detection of PVY and TMV disease leaves to achieve effective crop management.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding authors.

Author contributions

XW, HC, YH, JW, and WX designed the experiments. HC, YH, YL, DL, LJ, KH, and HW performed the experiments. HC, YH, XW, JW, and WX analyzed the data. YH and HC wrote the manuscript with input from LG, XW, JW, and WX. All authors contributed to the article and approved the submitted version.

Funding

This research was funded by the Science and Technology Project of China (110202101026 (LS-01) and 110202201051(SJ-01)), the Science and Technology Project of Chongqing (B2020NY1336), the Science and Technology Project of Sichuan (SCYC202215), the Science and Technology Project of Yunnan Province (2021530000241020), and Science and Technology Planning Project of Inner Mongolia Autonomous Region (2021GG0341). This research was partially supported by the Key Laboratory of Tobacco Pest Monitoring Controlling & Integrated Management (KLTPMIMT2022-18).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Barradas, A., Correia, P. M. P., Silva, S., Mariano, P., Pires, M. C., Matos, A. R., et al. (2021). Comparing machine learning methods for classifying plant drought stress from leaf reflectance spectra in Arabidopsis thaliana. Appl. Sci. 11, 1–15. doi: 10.3390/app11146392

CrossRef Full Text | Google Scholar

Boser, B. E., Guyon, I. M., Vapnik, V. N. (1992). Training algorithm for optimal margin classifiers. Proc. Fifth Annu. ACM Work. Comput. Learn. Theory 144–152. doi: 10.1145/130385.130401

CrossRef Full Text | Google Scholar

Breiman, L. (2001). Random forests. Mach. Learn. 45, 5–32. doi: 10.1023/A:1010933404324

CrossRef Full Text | Google Scholar

Cai, W., Zhang, Y., Liu, H., Zheng, H., Cheng, T., Tian, T., et al. (2022). Early detection on wheat canopy powdery mildew with hyperspectral imaging. China Acad. J. 55, 1110–1126. doi: 10.3864/j.issn.0578-1752.2022.06.005

CrossRef Full Text | Google Scholar

Chen, B., Li, S., Wang, K., Zhou, G., Bai, J. (2012). Evaluating the severity level of cotton Verticillium using spectral signature analysis. Int. J. Remote Sens. 33, 2706–2724. doi: 10.1080/01431161.2011.619586

CrossRef Full Text | Google Scholar

Cortes, C., Vapnik, V. (1995). Support-vector networks. Mach. Learn. 20, 273–297. doi: 10.1007/BF00994018

CrossRef Full Text | Google Scholar

Dhanoa, M. S., Lister, S. J., Barnes, R. J. (1995). On the scales associated with near-infrared reflectance difference spectra. Appl. Spectrosc. 49, 765–772. doi: 10.1366/00037029539646

CrossRef Full Text | Google Scholar

Fang, X., Gao, J., Xie, C., Zhu, F., Huang, L., He, Y. (2015). Review of crop canopy spectral information detection technology and methods. Spectrosc. Spectr. Anal. 35, 1949–1955. doi: 10.3964/j.issn.1000-0593(2015)07-1949-07

CrossRef Full Text | Google Scholar

Fernández, C. I., Leblon, B., Wang, J., Haddadi, A., Wang, K. (2022). Cucumber powdery mildew detection using hyperspectral data. Can. J. Plant Sci. 102, 20–32. doi: 10.1139/cjps-2021-0148

CrossRef Full Text | Google Scholar

Gnyp, M. L., Miao, Y., Yuan, F., Ustin, S. L., Yu, K., Yao, Y., et al. (2014). Hyperspectral canopy sensing of paddy rice aboveground biomass at different growth stages. F. Crop Res. 155, 42–55. doi: 10.1016/j.fcr.2013.09.023

CrossRef Full Text | Google Scholar

Hu, Y., Ping, X., Xu, M., Shan, W., He, Y. (2016). Study on diagnosis of potato leaf late blight by hyperspectral technique. Spectrocopy Spectr. Anal. 36, 515–519.

Google Scholar

Huang, L. S., Zhao, J. L., Zhang, D. Y., Yuan, L., Dong, Y. Y., Zhang, J. C. (2012). Identifying and mapping stripe rust in winter wheat using multi-temporal airborne hyperspectral images. Int. J. Agric. Biol. 14, 697–704. doi: 10.1109/ICCCNT45670.2019.8944556

CrossRef Full Text | Google Scholar

Ingale, S., Baru, V. B. (2019). Plant leaf disease detection recognition using machine learning. Int. J. Eng. Res. Technol. 8, 1179–1182. doi: 10.1109/ICCCNT45670.2019.8944556

CrossRef Full Text | Google Scholar

Khateri, H., Winter, S., Dizadji, A. (2014). Generation of transgenic tobacco plants with immunity against a broad spectrum of Potato virus Y strains. Iran. J. Plant Prot. Sci. 45, 229–239.

Google Scholar

Kong, W., Zhang, C., Cao, F., Liu, F., Luo, S., Tang, Y., et al. (2018). Detection of Sclerotinia stem rot on oilseed rape (Brassica napus L.) leaves using hyperspectral imaging. Sensors (Switzerland) 18, 1–15. doi: 10.3390/s18061764

CrossRef Full Text | Google Scholar

Korbecka-Glinka, G., Przybyś, M., Feledyn-Szewczyk, B. (2021). A survey of five plant viruses in weeds and tobacco in Poland. Agronomy 11, 1–8. doi: 10.3390/agronomy11081667

CrossRef Full Text | Google Scholar

Lamba, M., Gigras, Y., Dhull, A. (2021). Classification of plant diseases using machine and deep learning. Open Comput. Sci. 11, 491–508. doi: 10.1515/comp-2020-0122

CrossRef Full Text | Google Scholar

Li, Y., Ma, Y., Liu, M., Sun, Z., Fu, C., Li, Z. (2022). Combination of near-infrared spectroscopy and partial least squares discriminant analysis in detecting the quality of Panax notoginseng. J. Food Saf. Qual. 13, 3923–3929. doi: 10.19812/j.cnki.jfsq11-5956/ts.2022.12.043

CrossRef Full Text | Google Scholar

Liaw, A., Wiener, M. (2002). Classification and regression by randomForest. R News 2, 18–22.

Google Scholar

Long, T., Li, J., Long, Y., Yan, X., Zhao, J. (2021). Spectral response and intelligent classification of wheat leaves under powdery mildew stress. J. South China Agric. Univ. 42, 86–93. doi: 10.7671/j.issn.1001-411X.202009001

CrossRef Full Text | Google Scholar

Lv, Y., Lv, W., Han, K., Tao, W., Zheng, L., Weng, S., et al. (2022). Determination of wheat kernels damaged by Fusarium head blight using monochromatic images of effective wavelengths from hyperspectral imaging coupled with an architecture self-search deep network. Food Control 135, 108819. doi: 10.1016/j.foodcont.2022.108819

CrossRef Full Text | Google Scholar

Martínez-Martínez, V., Gomez-Gil, J., MaChado, M. L., Pinto, F. A. C. (2018). Leaf and canopy reflectance spectrometry applied to the estimation of angular leaf spot disease severity of common bean crops. PloS One 13, 1–18. doi: 10.1371/journal.pone.0196072

CrossRef Full Text | Google Scholar

McDonald, J. G., Singh, R. P. (1996). Host range, symptomology, and serology of isolates of potato virus Y (PVY) that share properties with both the PVYN and PVYO strain groups. Am. Potato J. 73, 309–315. doi: 10.1007/BF02855210

CrossRef Full Text | Google Scholar

Mishra, P., Karami, A., Nordon, A., Rutledge, D. N., Roger, J. M. (2019). Automatic de-noising of close-range hyperspectral images with a wavelength-specific shearlet-based image noise reduction method. Sensors Actuators B Chem. 281, 1034–1044. doi: 10.1016/j.snb.2018.11.034

CrossRef Full Text | Google Scholar

Piche, L. M., Singh, R. P., Nie, X., Gudmestad, N. C. (2004). Diversity among Potato virus Y isolates obtained from potatoes grown in the United States. Phytopathology 94, 1368–1375. doi: 10.1094/PHYTO.2004.94.12.1368

PubMed Abstract | CrossRef Full Text | Google Scholar

Quenouille, J., Vassilakos, N., Moury, B. (2013). Potato virus Y: A major crop pathogen that has provided major insights into the evolution of viral pathogenicity. Mol. Plant Pathol. 14, 439–452. doi: 10.1111/mpp.12024

PubMed Abstract | CrossRef Full Text | Google Scholar

San-Blas, E., Paba, G., Cubillán, N., Portillo, E., Casassa-Padrón, A. M., González-González, C., et al. (2020). The use of infrared spectroscopy and machine learning tools for detection of Meloidogyne infestations. Plant Pathol. 69, 1589–1600. doi: 10.1111/ppa.13246

CrossRef Full Text | Google Scholar

Sankaran, S., Ehsani, R. (2013). Comparison of visible-near infrared and mid-infrared spectroscopy for classification of Huanglongbing and citrus canker infected leaves. Agric. Eng. Int. CIGR J. 15, 75–79.

Google Scholar

Savitzky, A., Golay, M. J. E. (1964). Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36, 1627–1639. doi: 10.1021/ac60214a047

CrossRef Full Text | Google Scholar

Scholthof, K. B. G., Adkins, S., Czosnek, H., Palukaitis, P., Jacquot, E., Hohn, T., et al. (2011). Top 10 plant viruses in molecular plant pathology. Mol. Plant Pathol. 12, 938–954. doi: 10.1111/j.1364-3703.2011.00752.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Sethy, P. K., Barpanda, N. K., Rath, A. K., Behera, S. K. (2020). Deep feature based rice leaf disease identification using support vector machine. Comput. Electron. Agric. 175, 105527. doi: 10.1016/j.compag.2020.105527

CrossRef Full Text | Google Scholar

Shi, J., Liu, Z., Zhang, L., Zhou, W., Huang, J. (2009). Hyperspectral recognition of rice dam aged by rice leaf roller based on Support Vector Machine. China J. Rice Sci. 23, 331–334.

Google Scholar

Steddom, K., Bredehoeft, M. W., Khan, M., Rush, C. M. (2005). Comparison of visual and multispectral radiometric disease evaluations of Cercospora leaf spot of sugar beet. Plant Dis. 89, 153–158. doi: 10.1094/PD-89-0153

PubMed Abstract | CrossRef Full Text | Google Scholar

Uddin, S., Khan, A., Hossain, M. E., Moni, M. A. (2019). Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 19, 1–17. doi: 10.1186/s12911-019-1004-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Vanegas, F., Bratanov, D., Powell, K., Weiss, J., Gonzalez, F. (2018). A novel methodology for improving plant pest surveillance in vineyards and crops using UAV-based hyperspectral and spatial data. Sensors (Switzerland) 18, 1–21. doi: 10.3390/s18010260

CrossRef Full Text | Google Scholar

Wang, J., Lili, S., Xiufang, W., Fenglong, W., Ren, Guangwei, Jinguang, Y., et al. (2020). Classification Aty Levels of Pests and Diseases on Tobacco. (Beijing: China agricultural science and technology press).

Google Scholar

Wang, X., Ran, L., Peng, P., Cui, Z. (2014). Analysis of the hyperspectral Characteristics of tea leaves under anhracnose disease stress. Plant Prot. 40, 13–17.

Google Scholar

Windig, W., Shaver, J., Bro, R. (2008). Loopy MSC: A simple way to improve multiplicative scatter correction. Appl. Spectrosc. 62, 1153–1159. doi: 10.1366/000370208786049097

PubMed Abstract | CrossRef Full Text | Google Scholar

Xie, C., Shao, Y., Li, X., He, Y. (2015). Detection of early blight and late blight diseases on tomato leaves using hyperspectral imaging. Sci. Rep. 5, 1–11. doi: 10.1038/srep16564

CrossRef Full Text | Google Scholar

Yang, Y., Klessig, D. F. (1996). Isolation and characterization of a tobacco mosaic virus-inducible myb oncogene homolog from tobacco. Proc. Natl. Acad. Sci. U. S. A. 93, 14972–14977. doi: 10.1073/pnas.93.25.14972

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhu, H., Chu, B., Zhang, C., Liu, F., Jiang, L., He, Y. (2017). Hyperspectral imaging for presymptomatic detection of tobacco disease with successive projections algorithm and machine-learning classifiers. Sci. Rep. 7, 1–12. doi: 10.1038/s41598-017-04501-2

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: precision agriculture, virus diseases, machine learning, hyperspectral, nondestructive

Citation: Chen H, Han Y, Liu Y, Liu D, Jiang L, Huang K, Wang H, Guo L, Wang X, Wang J and Xue W (2023) Classification models for Tobacco Mosaic Virus and Potato Virus Y using hyperspectral and machine learning techniques. Front. Plant Sci. 14:1211617. doi: 10.3389/fpls.2023.1211617

Received: 27 April 2023; Accepted: 03 October 2023;
Published: 16 October 2023.

Edited by:

Kioumars Ghamkhar, AgResearch Ltd, New Zealand

Reviewed by:

Jingli Lu, AgResearch Ltd, New Zealand
Michael Gomez Selvaraj, Consultative Group on International Agricultural Research (CGIAR), United States

Copyright © 2023 Chen, Han, Liu, Liu, Jiang, Huang, Wang, Guo, Wang, Wang and Xue. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jie Wang, d2FuZ2ppZUBjYWFzLmNu; Wenxin Xue, eHVld2VueGluQGNhYXMuY24=

^†These authors have contributed equally to this work and share first authorship

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.