Cross-modal retrieval based on multi-dimensional feature fusion hashing

Ren, Dongxiao; Xu, Weihua

doi:10.3389/fphy.2024.1379873

ORIGINAL RESEARCH article

Front. Phys., 19 June 2024

Sec. Social Physics

Volume 12 - 2024 | https://doi.org/10.3389/fphy.2024.1379873

This article is part of the Research TopicSecurity, Governance, and Challenges of the New Generation of Cyber-Physical-Social SystemsView all 16 articles

Cross-modal retrieval based on multi-dimensional feature fusion hashing

Dongxiao Ren¹*

Weihua Xu²

¹Department of Data Science, School of Sciene, Zhejiang University of Science and Technology, Hangzhou, China
²Department of Digital Finance, Quanzhou Branch of Industrial and Commercial Bank of China, Quanzhou, China

Along with the continuous breakthrough and popularization of information network technology, multi-modal data, including texts, images, videos, and audio, is growing rapidly. We can retrieve different modal data to meet our needs, so cross-modal retrieval has important theoretical significance and application value. In addition, because the data of different modalities can be mutually retrieved by mapping them to a unified Hamming space, hash codes have been extensively used in the cross-modal retrieval field. However, existing cross-modal hashing models generate hash codes based on single-dimension data features, ignoring the semantic correlation between data features in different dimensions. Therefore, an innovative cross-modal retrieval method using Multi-Dimensional Feature Fusion Hashing (MDFFH) is proposed. To better get the image’s multi-dimensional semantic features, a convolutional neural network, and Vision Transformer are combined to construct an image multi-dimensional fusion module. Similarly, we apply the multi-dimensional text fusion module to the text modality to obtain the text’s multi-dimensional semantic features. These two modules can effectively integrate the semantic features of data in different dimensions through feature fusion, making the generated hash code more representative and semantic. Extensive experiments and corresponding analysis results on two datasets indicate that MDFFH’s performance outdoes other baseline models.

1 Introduction

The swift growth of multimedia data has brought a lot of demand for cross-modal retrieval. With the growing scale of data on the Internet, data types are becoming more and more diversified, including text, images, videos, audio, etc. The data modality that users are interested in is no longer single, and the user retrieval shows a development trend from single modality to cross modalities. Data has different modalities and these expression forms are different, while the semantics behind them may be related to each other and good use of different modal data can facilitate our lives to a certain extent. For instance, when you visit the Great Wall of China, you can retrieve the corresponding text and video introduction through the photos of the Great Wall. The information supplement helps you to quickly familiarize yourself with scenic spots for the first time. Besides the field of daily life, cross-modal retrieval has important applications in many domains such as medicine [1], finance [2], and information security [3]. Therefore, it is an interesting and challenging problem to construct an effective cross-modal retrieval system.

Since the data distributions and feature representations of different modal data are different, they cannot be compared directly. Representation learning can effectively deal with this problem. In such methods, the aim is to learn a function that can transform different modalities into a common feature space [4, 5], where we can compare them directly. Due to the quick expansion of the data scale and the decline of data retrieval efficiency, the hashing codes are applied to cross-modal retrieval tasks [6–8]. This type of method maps high-dimensional features to the Hamming space by transforming data into hash binary codes and uses XOR of hash binary codes to calculate the Hamming distance. Hash binary codes with small Hamming distance have similar original data, and vice versa.

Through many scholars’ research and efforts, cross-modal hashing retrieval has achieved many successes. Specifically, based on artificial features representing the original data, many models [9–14] are proposed, known as traditional cross-modal hashing models. Due to the limitations of handmade features, the retrieval efficiency of such models is hard to further breakthrough. Because of the good performance in feature learning, deep learning has been applied in cross-modal hashing retrieval. For example, deep neural networks can automatically capture the data features and hash functions in Refs. [15–20].

However, existing deep cross-modal hash models usually only pay attention to the single-dimensional semantic features of data and do not fully consider the information complementation between specific features presented by data in different dimensions. Besides, the multi-dimensional fusion of semantic information is more conducive to capturing the semantic correlation of different modal data, thus helping to narrow the semantic gap. So, effective fusing of multi-dimensional semantic features of different modal data is very important in improving cross-modal retrieval. Because of Transformer’s excellent performance in the computer vision field in recent years, we try to use it to better learn the images’ semantic features in different dimensions. Similarly, we construct a text multi-dimensional fusion module in the text network, which learns the text multi-dimensional semantic features. Based on these, we propose a novel method for cross-modal retrieval, which is called Multi-Dimensional Feature Fusion Hashing (MDFFH). Our method has these three characteristics.

• MDFFH constructs multi-dimensional fusion modules in image networks and text networks to learn multi-dimensional semantic features of data, which can effectively complement the semantic features of data in specific dimensions. It is better in semantic relevance, obtained hash codes are more semantic as well.

• Vision Transformer is integrated with a convolutional neural network to form an image multi-dimensional fusion module in MDFFH so the image’s local and global information can be well fused.

• Feature extraction and hash function generation are well integrated into a deep learning framework in MDFFH. Comparative experiments and corresponding analyses on two datasets show that MDFFH is superior to other baseline models.

This paper mainly includes five sections. The related work is introduced in Section 2, MDFFH is given in Section 3, and the experiments and comparative analysis are demonstrated in Section 4. Finally, the conclusion is in Section 5.

2 Related work

Representative cross-modal hashing models: There are two categories in Cross-modal hashing models. If supervised information (such as data tags) needs to be used during model training, this type of model is called an unsupervised model; the other type needs to use supervision information during model training, which is called a supervision model. According to the way they learn features, cross-modal hashing retrieval models are divided into two categories, namely, hand-crafted models and deep network models. Data labels are not used to guide hash codes’ learning in Unsupervised models during model training. For instance, the subspace shared by different modal data is learned and then the correlation between similar different modal data is maximized in Canonical Correlation Analysis (CCA) [21]. Implicit factors of different modal data are learned and unified hash codes are generated based on matrix decomposition in Collective Matrix Factorization Hashing (CMFH) [22]. In latent semantic sparse hashing (LSSH), sparse coding and matrix decomposition are used to capture important structures in images and potential semantics in texts, respectively [23]. Semantic topics and semantic concepts for images and texts are learned and discrete characteristics of different modal data are maintained in Semantic topic multi-modal hashing (STMH) [25]. Cross-Modal Self-Taught Hashing (CMSTH) [24] applies semantic information to detect multimodal topics, and then uses robust matrix decomposition to convert these different modal data into hash codes that are suitable for quantization. Spectral Multimodal Hashing (SMH) [26] uses spectrum analysis of correlation matrices of multi-modal data, learning parameters from the distribution of multi-modal data to get hash codes. On the contrary, supervised models use available data labels to learn more accurate hash features, which is better than unsupervised models in performance. Semantic correlation maximization (SCM) [27] applies nonnegative matrix decomposition and the nearest neighbor preservation algorithm to preserve semantic consistency within modalities and between modalities. Semantic Preserving Hashing (SePH) [28] transforms the semantic matrix into a probability distribution, makes it as close as possible by minimizing the Kullback-Leibler (KL) divergence, and then applies logical regression to learn the hash function of each modal data [29]. Hash functions and binary codes can be learned simultaneously by the data’s similarity matrix with discrete constraints in Enhanced Discrete Multi-modal Hashing (EDMH) [30].

However, the above unsupervised and supervised models all belong to hand-crafted models, which are unable to get the feature relevance between different modal data very well. With the continuous improvement of feature learning, deep neural networks are extensively applied in the cross-modal retrieval field. A deep neural network is introduced into feature learning in Deep Cross-Modal Hashing (DCMH) [31], so the unified model includes feature learning and the generation of hash codes. In Pairwise Relationship Deep Hashing (PRDH) [32], the similarity degree between different modal data is preserved in hash codes while taking into account the similarity between the same modal data. A high-level semantic similarity matrix of continuous values is constructed to guide the learning of hash codes in Deep Multi-level Semantic Hashing (DMSH) [33], which captures the degree of similarity between different modal data. To generate more representative image features, Mask Cross-Modal Hashing (MCMH) [34] effectively combines convolution features with mask features extracted by the Mask R-CNN. Self-supervised adversarial Hashing (SSAH) [35] introduces adversarial loss through the construction of a label network to shorten the distance between image and text distribution, which brings a better retrieval effect. Using cosine distance and Euclidean distance, the same measurement index can accurately reflect the similarity between different modal data in Deep Semantic Cross-Modal Hashing Based on Graph Similarity of Modal-Specific (DCMHGMS) [36]. The distance between similar data can be reduced by constructing ranking alignment loss to unearth the semantic structure between different modal data in Deep Rank Cross-modal Hashing (DRCH) [37, 38]. Semantic weight factors are constructed to guide the optimization of the loss function and obtain better retrieval performance in Multiple Deep neural networks with Multiple labels for Cross-modal Hashing (MDMCH) [39]. A label network is constructed to jointly guide the feature learning of different modal data and innovates discrete optimization strategies to learn hash codes in Deep Discrete Cross-modal Hashing (DDCH) [40]. To increase the correlation between hash codes, Deep Cross-Modal Hashing with Hashing Functions and Unified Hash Codes Jointly Learning (DCHUC) [41] has constructed a new unified joint hash code framework. To improve the accuracy of hash codes in comparative learning, Unsupervised Contrastive Cross-Modal Hashing (UCCH) [42] proposes a momentum optimizer to make the generated hash codes more accurate.

Transformer: The excellent performance of the Transformer is attributed to the exertion of the attention mechanism, and it is widely used in the field of Natural Language Processing (NLP) [43]. It can assign attention weight according to the input data, to determine which part of the data needs attention. On this basis, limited information processing resources are allocated to important parts and so the performance of the model is improved. Google Deep Mind [44] applied it to the computer vision field for the first time and achieved good performance by combining it with Recurrent Neural Network (RNN). Bahdanau et al. [45] prove the effectiveness of attention mechanisms in the NLP. In [46], Google has successfully constructed the Transformer network structure based on the attention mechanism. Due to the limited feature subspace, it is hard to enhance the performance of this ordinary attention mechanism. The multi-head attention mechanism is more likely to capture features from multiple dimensions by dividing attention operations. Inspired by this important achievement, many researchers tried to introduce Transformer structure into computer vision tasks and achieved good results. In 2020, the Vision Transformer (ViT) proposed by Dosovitskiy et al. [47] performed well in many image classification tasks, because it can capture contextual dependencies at different positions in an image. It is simple and effective, with strong scalability. The larger the amount of data, the better the performance of the ViT. When there is enough data for pre-training, the performance of the ViT is even better than that of the convolutional neural network model, which fully proves that ViT can extract excellent features from images.

3 Proposed method

The innovative networks of this paper will be introduced in this section, and the structural framework of MDFFH is shown in Figure 1. To facilitate comparison with other models, images and texts are selected in our model. Our model can be extended to other modalities easily.

Figure 1

Figure 1. The structural framework of MDFFH.

3.1 Notations and problem definitions

Throughout this paper, vectors are denoted by lowercase bold letters (e.g., $z$ ), matrices are represented by uppercase bold letters (e.g., $Z$ ), and the transposition of the matrix $Z$ is expressed as $Z^{T}$ . For the matrix $Z$ , the ith row, the jth column, the element located in ith row and jth column and the Frobenius norm are denoted by $Z_{i *}$ , $Z_{* j}$ , $Z_{i j}$ and ${‖Z‖}_{F}$ , respetively. The sign function represented by $s i g n (x)$ is that the value is −1 when x is less than 0, otherwise, the value is 1.

Assume that $O = {\{O_{n}\}}_{n = 1}^{N}$ denotes the image-text pair dataset, each sample $o_{n} = (x_{n}, y_{n}, l_{n})$ includes three parts: one part $x_{n} \in R^{D_{x}}$ represents an image feature vector, another part $y \in R^{D_{y}}$ denotes a text feature vector, and the last part $l_{n} \in R^{C}$ denotes the corresponding category labels, where $D_{x}$ , $D_{y}$ and $C$ are the dimensions of these two modal data’s feature and the number of the category labels respectively. $S \in {\{0,1\}}^{N \times N}$ is the matrix to measure the similarity degree between different modalities, called the similarity matrix. $S_{i j} = 0$ means that $x_{i}$ and $y_{j}$ are not similar to each other and $S_{i j} = 1$ denotes that these two data have at least one same category label. The input data is transformed into the corresponding hash codes and the similarity degree between different hash codes is obtained by calculating their Hamming distance in our model. The more similar the hash codes, the smaller the Hamming distance; the greater the difference between hash codes, the greater the Hamming distance. The formula for calculating Hamming distance is

d (c_{i}, c_{j}) = \frac{1}{2} (k - ⟨c_{i}, c_{j}⟩), (1)

In Eq. 1, $c_{i}$ and $c_{j}$ are the hash codes for the vector $x_{i}$ and $y_{j}$ , $⟨c_{i}, c_{j}⟩$ represents their inner product and k is the length of hash codes.

MDFFH aims to obtain two hash functions through training, one is $f (x_{i}; θ_{x})$ for images, and the other is $g (y_{j}; θ_{y})$ for texts while maintaining the similarity degree of the original data. Here, $θ_{x}$ and $θ_{x}$ denote parameters in the different networks. These hash functions can convert the data into hash codes with unified dimensions for comparison.

3.2 Network architecture

The specific details of the networks in our model are as follows.

Image network: Image network is mainly composed of an image multi-dimensional fusion module and a fully connected neural network. Specifically, the multi-dimensional image fusion module includes a Vision Transformer network and a convolutional neural network. In the Vision Transformer network, the ViT-B/16 model is chosen as the basic framework and fine-tuned on this basis. We replace the last MLP Head used for the image classification in the ViT-B/16 model with a single-layer completely connected network with 4,096 neurons where the size of each image patch is 16 × 16. The transformer Encoder has 12 Encoder Blocks, which are shown in Figure 2. At the same time, the first six layers of CNN-F [48] are selected as the model of a convolution neural network. In addition, these two networks are pre-rained on ImageNet [49] to obtain initialization parameters. Finally, the output results of these two networks are fused into the multi-dimensional semantic features learned by the image fusion module by vector concatenation. The fully connected neural network has three layers, in which the number of neurons is 8,192, 4,096, and the hash code length in turn.

Figure 2

Figure 2. Detailed introduction of Vision Transformer.

Text network: Bag-of-Words (BoW) is usually used to convert text into vectors, but the sparsity of vectors makes it impossible to fully capture the text’s semantic information. Inspired by [28], we adopt a text multi-dimensional fusion module to solve this problem. The text multi-dimensional fusion module extracts the text semantic features in different dimensions through five average pool layers (the scales are 1a, 2a, 3a, 6a, and 10a, where “a” represents the parameter), and uses 1 × 1 convolution layer to integrate multiple features. At the end of this network, there is a three-layer completely connected network to extract the text’s hash codes and the numbers of neurons in every layer are 4,096, 4,096, and the hash code length.

3.3 Hash code learning

The performance of the cross-modal hashing model depends on whether generated hash codes can effectively reflect the similarity degree between different modalities. Generally speaking, the Hamming distance of hash codes generated by similar original data should be small, and vice versa. To ensure that MDFFH can achieve excellent retrieval performance, we have established an objective function composed of two terms: semantic similarity loss and hashing code quantization loss. We apply $P_{* i} = f (x_{i}; θ_{x})$ to denote the learned feature from the image network, where $θ_{x}$ presents the network parameters. Let $Q_{* i} = g (y_{i}; θ_{y})$ denote the learned feature from the text network, where $θ_{y}$ refers to the network parameters.

To minimize the semantic gap, we transform different modal data to the same common semantic space to measure similarity. Here, the formula of the likelihood function can be written as follows:

p (S_{i j} | P_{* i}, Q_{* j}) = \{\begin{cases} σ (Φ_{i j}), & S_{i j} = 1 \\ 1 - σ (Φ_{i j}), & S_{i j} = 0 \end{cases} (2)

In Eq. 2, $Φ_{i j} = \frac{1}{2} P_{* i}^{T} Q_{* j}$ and $σ (Φ_{i j}) = \frac{1}{1 + e^{- Φ_{i j}}}$ . When $S_{i j} = 1$ , the inner product of $P_{* i}$ and $Q_{* j}$ will be bigger, which is equivalent to that the two data are more similar. On the contrary, the more dissimilar the two data are when $S_{i j} = 0$ .

The maximization of the likelihood function is equal to the maximization of the negative log-likelihood function. To facilitate the training of MDFFH, the above formula can be converted into the following formula:

J_{similarity} = - \sum_{i, j = 1}^{N} (S_{i j} Φ_{i j} - \log (1 + e^{Φ_{i j}})), (3)

where $Φ_{i j} = \frac{1}{2} P_{* i}^{T} Q_{* j}$ .

Since the output of the continuous variables from the network is converted into hash binary codes through symbolic functions, there is a certain quantization loss. Therefore, we set the quantization loss term of hash binary codes to reduce this error:

J_{quantization} = {‖H^{x} - P‖}_{F}^{2} + {‖H^{y} - Q‖}_{F}^{2}, (4)

where $H^{x} = s i g n (P)$ and $H^{y} = s i g n (Q)$ .

From Equations 3, 4, we can get the objective function for optimizing MDFFH as follows:

\begin{aligned} \min_{H, θ_{x}, θ_{y}} J & = J_{similarity} + η J_{quantization} \\ = - \sum_{i, j = 1}^{N} (S_{i j} Φ_{i j} - \log (1 + e^{Φ_{i j}})) \\ + η ({‖H^{x} - P‖}_{F}^{2} + {‖H^{y} - V‖}_{F}^{2}), \end{aligned} (5)

In Eq. 5, $η$ denotes the hyper-parameter of the hash code quantization loss. Inspired by Jiang et al. [31], we set $H = H^{x} = H^{y}$ during model training.

3.4 Optimization

Given the discreteness of hash codes, we apply an alternating learning strategy to optimize MDFFH: at one time, only one parameter is optimized while the rest of the parameters are unchanged. In the optimization process, the model parameters are updated by the back-propagation with stochastic gradient descent (SGD). The optimization steps are shown in Algorithm 1. Generally, it includes three steps:

1. Optimize $θ_{x}$ with $θ_{y}$ and $H$ fixed.

Select any image data $x_{i}$ , and obtain the partial derivative of our objective function as following in Eq. 6:

\frac{\partial J}{\partial P_{* i}} = \frac{1}{2} \sum_{j = 1}^{N} (σ (Φ_{i j}) Q_{* j} - S_{i j} Q_{* j}) + 2 η (P_{* i} - H_{* i}) . (6)

Then through the chain derivation rule, we can get $\frac{\partial J}{\partial θ_{x}}$ from $\frac{\partial J}{\partial P_{* i}}$ and optimize $θ_{x}$ according to BP.

2. Optimize $θ_{y}$ with $θ_{x}$ and $H$ fixed.

Select any data $y_{i}$ , and obtain the derivative of the objective function as following in Eq. 7:

\frac{\partial J}{\partial Q_{* j}} = \frac{1}{2} \sum_{i = 1}^{N} (σ (Φ_{i j}) P_{* i} - S_{i j} P_{* i}) + 2 η (Q_{* j} - H_{* j}) . (7)

Then through the chain derivation rule, we can get $\frac{\partial J}{\partial θ_{y}}$ from $\frac{\partial J}{\partial Q_{* j}}$ and optimize $θ_{y}$ according to BP.

3. Optimize hash codes $H$ .

The objective function can be converted into the formula as follows:

\begin{matrix} \max_{H} t r (H^{T} (η (P + Q))) = t r (H^{T} R) = \sum_{i, j} H_{i j} R_{i j}, \\ s . t . H \in {\{- 1, + 1\}}^{k \times N} \end{matrix} (8)

In Eq. 8 $R = η (P + Q)$ . At last, the hash code matrix H is updated according to the feature matrixes of images and text as following in Eq. 9:

H = s i g n (η (P + Q)) . (9)

3.5 Out-of-sample extension

The hash codes of the data not used for training are generated by the hash functions learned by MDFFH. For example, given the query image $x_{q}$ , we can get its hash codes by the hash function as following in Eq. 10:

h_{q}^{x} = s i g n (f (x_{q}; θ_{x})) (10)

Similarly for text data $y_{q}$ , we can get its hash codes by the hash function as following in Eq. 11:

h_{q}^{y} = s i g n (g (y_{q}; θ_{y})) (11)

4 Experiments

Based on two commonly used data sets, namely, MIRFLICKR-25K [50] and NUS-WIDE [51], we conduct a large number of experiments comparing the results with some representative baselines to verify the validity of our model. It is noted that our model can be easily applied to other similar datasets.

4.1 Datasets

MIRFLICKR-25K [50]: There are 25,000 images from the Flickr website in this dataset, and every image has text descriptions and labels, thus forming data pairs. During the experiment, we only retain 20,015 data pairs, because there are too few text descriptions for some data pairs. For each text description, the Bag-of-Word model is applied to convert it into 1386-dimensional vector form, and the corresponding label is transformed into 24-dimensional vector form. 2000 data pairs are randomly selected for querying and the rest for retrieval. For model training, we select 10,000 data pairs from retrieval.

NUS-WIDE [51]: There are 269,648 data pairs in this dataset, and each includes images, text descriptions, and data labels. There is a total of 81 categories of original data labels in this dataset. We selected 21 of the most common data labels as the experimental dataset and finally retained 195,834 data pairs after processing. Text descriptions and data labels in each data pair are converted into 1,000 and 21-dimensional vector forms through the Bag-of-Word model. The partition of different sets for model training in this dataset is consistent with the MIRFLICKR-25 dataset.

4.2 Evaluation and baselines

Evaluation: For cross-modal retrieval, researchers usually study two typical tasks: retrieving text with images and retrieving images with text.

To evaluate MDFFH’s performance, we select the two most commonly used evaluation criteria, namely, the Precision-Recall (PR) Curve and Mean Average Precision (MAP) [52]. The average accuracy (AP) of any query data is calculated as follows:

A P = \frac{1}{K} \sum_{s = 1}^{M} U (s) V (s), (12)

where $K$ and $M$ are the numbers of retrieved relevant data and the retrieval set, $U (s)$ denotes the proportion of the first $s$ retrieved data related to the query data, and $V (s)$ shows whether the retrieved sth data is related to the query data, which can be judged by the category label. If two data are related, $V (s) = 1$ , otherwise, $V (s) = 0$ . The MAP value can be calculated by averaging the APs of all query data and is positively correlated with model performance.

In addition, the PR curve is another indicator for evaluating the model performance. The performance can be directly judged by drawing a PR curve of this model: if the area under this curve is larger, the model performance is better. Moreover, the corresponding recall and precision can be obtained by altering the Hamming radius and drawing the PR curve.

Baselines: We compare our MDFFH with nine representative models, which are CCA, CMFH, SCM, STMH, SePH, DCMH, DDCH, DCHUC, and UCCH. The first four models belong to hand-crafted models and the rest are deep network models.

4.3 Implementation details

We use PyTorch, which is a deep-learning framework based on dynamic tensors, to implement our MDFFH on the NVIDIA RTX 3090 server and the iteration number is set to 300. In the iteration, the learning rate gradually decreases from 0.03 initialized to 10^–6. The hyper-parameter $η$ is set to 1, and the detailed parameter analysis is in the section Parameter Analysis. For each model result, experiments have been run five times and the average value is obtained as a representative.

4.4 Performance

The MAP scores of MDFFH and nine baseline models based on two general datasets are shown in Table 1, where “I → T” represents from image to retrieve text and “T → I” represents from text to retrieve images. We can find that for hash codes with different lengths, our model is superior to baseline models. For example, when we select the MIRFLICKR-25K dataset, compared with DCMH which is the most representative deep cross-modal hashing model, MDFFH on “I → T” tasks increased by 3.05% on average, and its MAP score on text retrieval image tasks increased by 1.04% on average. On the NUS-WIDE dataset, compared with DCMH, MDFFH’s MAP score on image retrieval text tasks increased by 5.52% on average, and its MAP score on text retrieval image tasks increased by 0.95% on average. In particular, compared with these five hand-crafted baseline models, MDFFH has been greatly improved. This proves that better performance can be achieved by integrating feature learning and the generation of hash codes into a unified end-to-end network. At the same time, MDFFH has a better performance compared with DCMH and DDCH. The reason is that DCMH and DDCH generate hash codes only using single-dimensional semantic features, ignoring the information complementation between multi-dimensional semantic features, which has certain limitations. On the contrary, MDFFH applies the image multi-dimensional fusion module and the text multi-dimensional fusion module to get the multi-dimensional semantic features of different modal data, which can mine richer semantic associations and establish more accurate modal relationships, thus helping to narrow the modal gap to greatly improve the retrieval accuracy.

Table 1

Table 1. MAP scores of different models.

When the hash code length is set to 16 bits, the PR curves of MDFFH and baseline models under MIRFLICKR-25K and NUS-WIDE datasets are demonstrated in Figure 3. For PR curves of different models, which curve has a larger area represents better performance. From this figure, it is clear that the performance of MDFFH outperforms other baselines, which is consistent with the application of MAP as a performance evaluation index.

Figure 3

Figure 3. The PR curves with code length 16. (A) MIRFLICKR-25K. (B) MIRFLICKR-25K. (C) NUS-WIDE. (D) NUS-WIDE.

4.5 Parameter analysis

The influence of hyper-parameter values in the model based on the MIRFLICKR-25K dataset is studied in this section. The hash code length is uniformly 16 bits and the experimental results are shown in Figure 4. The MAP scores of two cross-modal retrieval tasks change with the hyper-parameter. During the manual adjustment of the hyper-parameter, the range of values is 0.01, 0.1, 1, and 2. The experimental results demonstrate the MDFFH performance can reach the best under the setting of $γ = 1$ . The initial values of other network parameters are randomly generated and then determined through network learning.

Figure 4

Figure 4. The sensitivity analysis of the hyper-parameter.

4.6 Ablation study

We have designed one variant and carried out experiments to verify whether the innovative module in MDFFH improves the overall performance. MDFFH-1 is a variant of MOFFH without a Vision Transformer. The variant aims to check the important influence of the innovative image multi-dimensional fusion module on our model’s retrieval performance. Table 2 shows the comparative results. From this table, it is clear that MDFFH’s performance is better than MDFFH-1’s performance on the MIRFLICKR-25K dataset because of the effective role of the image multi-dimensional fusion module. The image multi-dimensional fusion module effectively combines the global image information concerned by the Vision Transformer with the local image information concerned by the convolutional neural network to generate more representative multi-dimensional semantic features. This can more effectively get the semantic similarity between different data to learn more accurate hash mapping functions, and so improve our model performance.

Table 2

Table 2. The MAP scores of MDFFH and its variant.

4.7 Convergence analysis

For analyzing MDFFH’s convergence, experiments are conducted on MIRFLICKR-25K and NUS-WIDE datasets. During the experiment, the hash code length is 16 bits and the relative loss is used as an evaluation criterion. The relative loss of the ith iteration is the ratio of the loss function value of the ith iteration divided by the loss function value of the first iteration and the experimental results are shown in Figure 5. With the number of iterations increasing, the relative loss value decreases rapidly and becomes stable, which means our optimization algorithm is effective.

Figure 5

Figure 5. The convergence curve of MDFFH. (A) MIRFLICKR-25K. (B) NUS-WIDE.

5 Conclusion

A new cross-modal hashing model named MDFFH is proposed from the perspective of multi-dimensional semantic features. The image multi-dimensional fusion module constructed effectively combines the convolutional neural network and Vision Transformer and can generate multi-dimensional semantic features of images with richer semantic information. Similarly, we apply the text multi-dimensional fusion module to generate more representative text multi-dimensional semantic features, which provides a basis for mining richer semantic associations and building more accurate modal relationships, thus making the generated hash code more semantic. Experimental analysis of two general datasets can verify that our MDFFH model improves the performance of cross-modal retrieval. In future work, we will attempt to investigate its applications in the field of multimodal generation, multimodal question answering, and health and medical big data retrieval.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

DR: Conceptualization, Data curation, Formal Analysis, Investigation, Resources, Supervision, Writing–review and editing. WX: Writing–original draft and Writing–review and editing.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Conflict of interest

Author WX was employed by the company Industrial and Commercial Bank of China.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Li Q, Li L, Li Y. Developing ChatGPT for biology and medicine: a complete review of biomedical question answering. Biophys Rep (2024) 9:1–20. doi:10.52601/bpr.2024.240004