Dataset of suspicious phishing URL detection

Tamal, Maruf Ahmed; Islam, Md Kabirul; Bhuiyan, Touhid; Sattar, Abdus

doi:10.3389/fcomp.2024.1308634

DATA REPORT article

Front. Comput. Sci. , 06 March 2024

Sec. Computer Security

Volume 6 - 2024 | https://doi.org/10.3389/fcomp.2024.1308634

This article is part of the Research Topic Digital Transformation and Cybersecurity Challenges View all 7 articles

Dataset of suspicious phishing URL detection

$\r\nMaruf Ahmed Tamal$ Maruf Ahmed Tamal¹^*

Md Kabirul Islam²

Touhid Bhuiyan¹

Abdus Sattar¹

¹Department of Computer Science and Engineering, Daffodil International University, Dhaka, Bangladesh
²Faculty of Graduate Studies, Daffodil International University, Dhaka, Bangladesh

1 Introduction

The contemporary world is witnessing a transformative shift driven by technological advancement. As of October 2023, there were 5.3 billion Internet users globally, comprising 65.7 percent of the world's population (Internet and Social Media Users in the World 2023 | Statista, 2023). This exponential growth of the Internet has brought about significant transformations in traditional systems and people's daily lives (Hoehe and Thibaut, 2020). However, alongside this progress suspicious online activities have also increased alarmingly, especially phishing has taken a terrifying shape. It is a form of cyber-enabled crime, uses social engineering and technical subterfuge to deceive individuals into divulging confidential information (Ejaz et al., 2023). Unlike other cybercrimes with consistent victim profiles and known attacker motives, phishing attacks are characterized by their diverse targets, motivations, and goals.

To combat phishing attacks, two types of approaches are commonly adopted: (1) preventive approach (Daengsi et al., 2021; Quinkert et al., 2021; Alahmari et al., 2022), and (2) detective approach (Chiew et al., 2015; Rao and Pais, 2017; Aljofey et al., 2022). While phishing preventive approaches focus on educating individuals to raise user awareness against phishing attacks, detective approaches leverage technical measures like list-based, rule-based, similarity-based, and machine learning (ML)-based methods. However, among all the approaches, ML-based approaches have been extensively utilized by scholars and security experts globally. Considering phishing detection as a binary classification problem, both supervised (Nagaraj et al., 2018; Sahingoz et al., 2019; Zamir et al., 2020) and deep learning algorithms (Dhanavanthini and Chakkravarthy, 2023) have been employed to differentiate phishing sites from legitimate ones. However, none of the approaches performs as a “bullet of silver” against phishing (Gupta et al., 2016). The dynamic and sophisticated nature of phishing attacks has made phishing detection a pressing challenge for both end-users and security experts. Phishers continuously evolve their tactics, seeking new and creative ways to bypass existing anti-phishing tools. Consequently, phishing has become one of the most organized and challenging cybercrimes of the 21st century. As reported by the Anti-Phishing Working Group, 1270883 unique phishing attacks took place in the 3rd quarter of 2022, which was the worst APWG had ever recorded (APWG | Phishing Activity Trends Reports, 2022). This rising tendency underscores the limitations of current anti-phishing methods, particularly their inability to detect zero-hour attacks and their lack of robustness. Unfortunately, existing resources and countermeasures are demonstrably inadequate in detecting and preventing these attacks. One of the most significant challenges hindering the development of robust and effective ML-based phishing detection systems is the lack of a comprehensive and up-to-date labeled training dataset (Catal et al., 2022; Salloum et al., 2022; Zieni et al., 2023). As ML models rely heavily on labeled data to learn the distinguishing characteristics of phishing attacks, this scarcity of labeled data significantly hinders the development of data-driven approaches for designing effective anti-phishing tools.

To address this gap, this article introduces a new, large-scale labeled dataset specifically designed for URL-based phishing detection. This dataset comprises 247,950 instances, meticulously categorized into 128,541 phishing URLs and 119,409 legitimate URLs (see full specification in Table 1). Instead of content-based aspects like text, message, DOM, CSS, logos, etc., this dataset solely focuses on intra-URL features. This strategic choice leverages the fact that many phishing red flags are readily apparent within the URL itself, encompassing typosquatting, unusual extensions, subdomains mimicking legitimate brands, and excessive parameters. So, URLs can reveal patterns and anomalies indicative of phishing attempts. To extract the most discriminatory features from URLs, we employed the Optimal Feature Vectorization Algorithm (OFVA). This rigorous approach yielded 42 optimal intra-URL features. These features demonstrate high efficacy in classifying phishing URLs, contributing significantly to the advancement of data-driven anti-phishing techniques. The availability of this extensive dataset is expected to assist security experts, practitioners, and researchers in developing more sophisticated, resilient, and effective solutions for combating phishing attacks.

Table 1

Table 1. Data specification table.

2 Value of the data

• The scarcity of large labeled data has been a significant challenge in developing robust and effective anti-phishing tools. To this end, this dataset can address this gap by providing a large number of labeled instances, consisting of both phishing and legitimate URLs.

• The dataset can be used for phishing URL detection using supervised machine learning and deep learning algorithms.

• This dataset can benefit various stakeholders, especially security experts, practitioners, and researchers in the cybersecurity domain by enabling them to stay up-to-date on evolving phishing attacks, advance anti-phishing research, and design sophisticated data-driven anti-phishing solutions for combating phishing attacks.

• The dataset can be utilized to gain insights and develop experiments in phishing detection, including training machine learning models, analyzing intra-URL feature significance and relevance, improving classification performance, developing tailored feature engineering techniques, and exploring model generalization to new phishing attack patterns.

3 Experimental design, materials, and methods

In the process of preparing the phishing detection dataset, we considered three key phases depicted in Figure 1.

Figure 1

Figure 1. Step by step process of data preparation.

3.1 Dataset acquisition

In the first phase, raw unstructured phishing and legitimate URLs were acquired and merged from different reliable and valid sources. To gather the data, we followed similar strategies followed by similar previous studies. Initially, we gathered raw unstructured URLs, encompassing both phishing and legitimate ones, from reputable publicly available sources. Among the 274,446 URLs (before undergoing preprocessing), 48,009 legitimate URLs and 48,009 phishing URLs were obtained from Aalto University's research data (Marchal, 2014), while 86,491 phishing URLs were collected from OpenPhish (OpenPhish, n.d.) and 91,937 legitimate URLs collected from DomCop (Top 10 million Websites Based on Open Data from Common Crawl and Common Search, n.d.). These URLs were in their original form (e.g., https://www.facebook.com/), lacking any specific structure or organization where analysis can be performed. All these data were collected between 01/03/2022 and 31/05/2023.

3.2 Dataset preprocessing

3.2.1 Feature generation

In the second phase, unstructured raw URLs (strings) were initially transformed into semi-structured components (scheme, network location, path, etc.) using the “urllib.parse” python module (urllib.parse - Parse URLs into components, n.d.). Subsequently, a list of 41 features was extracted to generate a particular feature vector (x = F₁, F₂, F₃, …………F₄₁) for each of the URLs to create a labeled dataset using a self-developed Optimal Feature Vectorization Algorithm (OFVA) (see Figure 2). The key purpose of the OFVA was to extract the optimal intra-URL features from a given semi-unstructured URL list (see Phase 2 of Figure 1). Table 2 depicts the extracted feature list with a detailed explanation. Among the 41 features, 31 features (F₁−F₂, F₄−F₂₁, F₂₅−F₂₆, F₃₀−F₃₃, F₃₅−F₃₉) were extracted based on findings of the prior studies (Jeeva and Rajsingh, 2016; Singh, 2020; Vrbančič et al., 2020; Mourtaji et al., 2021). These features capture known red flags related to URL, host, domain, sub-domain, path, query, network location components, etc. However, while adopting these features, we performed few features removals, modifications and adjustments to optimize their relevance and improve the overall performance of our feature set. These modifications were informed by an analysis of current phishing trends and emerging threat vectors. Additionally, recognizing the evolving nature of phishing tactics, we introduced 10 novel features (F₃, F₂₂−F₂₄, F₂₇−F₂₉, F₃₄, F₄₀−F₄₁) (for details, see Table 2). These features encapsulate nuanced aspects that are not traditionally considered in feature sets, providing a unique contribution to the anti-phishing tool landscape.

Figure 2

Figure 2. Optimal feature vectorization algorithm (OFVA).

Table 2

Table 2. Feature description.

3.2.2 Data cleansing and curation

After feature generation, data cleansing and curation were performed. As data was obtained from multiple sources, there was a possibility of having duplicate URLs. Hence, in order to achieve optimal data quality, the data cleansing phase involved the removal of a total of 9,725 duplicate URLs. Moreover, to maintain the robustness of the dataset, rigorous outlier detection techniques were employed, focusing particularly on the interquartile range (IQR) (Mohr et al., 2022) and box plot analysis (McGill et al., 1978). The rationale behind this approach was to identify and address outliers, with specific attention given to URL length as a key variable. Through the application of the IQR method, data points that fell outside the acceptable range were flagged as outliers. A total of 16,771 such outliers were identified and subsequently removed from the dataset. This process is illustrated in detail in Figures 3–6.

Figure 3

Figure 3. Distribution of phishing URLs (with outliers).

Figure 4

Figure 4. Distribution of phishing URLs (without outliers).

Figure 5

Figure 5. Distribution of legitimate URLs (with outliers).

Figure 6

Figure 6. Distribution of legitimate URLs (without outliers).

3.3 Final dataset

After data cleansing and outlier removal the final dataset was uploaded in Mendeley Data [66] and made publicly accessible. The final data is comprised of 2,47,950 records (phishing URLs = 119,409, legitimate URLs = 128,541).

4 Data description

The dataset available in the repository consists of a single CSV file with a total of 247,950 instances. Among these instances, 128,541 are classified as phishing URLs, while 119,409 are classified as legitimate URLs. Table 2 provides a comprehensive overview of the dataset, including 42 features associated with both phishing and legitimate URLs. Here, the target feature in the dataset is the “Type” column, which indicates whether a URL is classified as phishing (1) or legitimate (0). This binary classification nature of the target feature makes the dataset suitable for binary classification tasks. The remaining features are organized based on their distinct characteristics. For instance, the URL-related features, represented by columns F1–F16, offer valuable insights into the URLs. These features provide information such as the length of the URL, the presence of specific characters or symbols (e.g., dots, hyphens, slashes), and the count of digits or special characters within the URL. Most of these features consist of numeric values representing counts or lengths.

On the other hand, the domain-related features span from F16–F24 and focus on attributes associated with the domain within the URL. These attributes include the length of the domain, the presence of dots or hyphens, the occurrence of special characters or digits, and the number of subdomains. These domain-related features incorporate both Boolean values and numeric counts, providing a comprehensive perspective on the characteristics of the domain.

The subdomain-related features (F24–F34) specifically examine the subdomain section of the URL. These features provide information about the presence of dots, hyphens, special characters, and digits within the subdomain. Additionally, these features calculate averages and counts of these elements. The subdomain-related features contribute to a more detailed analysis of the URLs.

Furthermore, the dataset includes a few other features (F35–F39) that determine the presence of a path, query, fragment, and anchor in the URL. These features employ Boolean values to indicate the existence or absence of these components. Lastly, the table incorporates two continuous features (F40 and F41) that calculate the Shannon entropy of the URL and domain, respectively. These features quantify the randomness or complexity of characters within the URL or domain. Higher values of these features indicate a higher degree of entropy.

5 Comparison with exiting datasets

Table 3 provides a comprehensive comparison between the proposed dataset and existing datasets, highlighting the distinctive features of the former in the realm of phishing detection. In contrast to the limited datasets presented by Orunsolu et al. (2022) and Aljofey et al. (2022), which consist of 5,041 and 60,252 samples, respectively, the proposed dataset sets itself apart by offering a substantially larger volume of data, comprising 247,950 samples. Comparatively, Zouina and Outtaj (2017) and Chiew et al. (2019) present more modest datasets, containing 2,000 and 10,000 samples, respectively. Notably, Vrbančič et al. (2020) boasts a larger dataset with 88,647 samples, however, it lacks information on novel features and preprocessing applied, making it difficult to directly compare its effectiveness. Furthermore, the proposed dataset excels in its feature richness, providing a diverse set of 42 features. This includes the incorporation of 10 novel features that are absent in other datasets. This comprehensive feature set spans numeric, Boolean, and continuous data types, thereby creating the potential for the development of more sophisticated and effective phishing detection models.

Table 3

Table 3. Comparison with exiting datasets.

6 Limits and suggestions for future works

While the proposed dataset boasts several strengths, it is crucial to recognize and address its inherent limitations. Firstly, despite the dataset's innovation with 10 novel features, there is a lack of novelty in the approach to dataset preparation. Our methodology aligns with common practices used in the preparation of similar existing datasets. Future efforts should explore alternative approaches to dataset creation to enhance originality. Secondly, in the pursuit of a streamlined, efficient model that prioritizes simplicity, speed, and responsiveness, certain content-related features, such as web images, logos, the Document Object Model (DOM), as well as HTML and CSS structural elements, were deliberately excluded. Although this design decision was made to optimize speed and responsiveness, it is essential to acknowledge that the inclusion of these features could potentially contribute to improved accuracy. To this end, future research endeavors should investigate the impact of incorporating these omitted features, exploring whether their inclusion enhances the overall performance of the model.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://doi.org/10.17632/6tm2d6sz7p.1.

Author contributions

MT: Conceptualization, Writing – original draft. MI: Supervision, Writing – review & editing. TB: Writing – review & editing. AS: Data curation, Writing – review & editing.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Acknowledgments

We would like to extend our deepest gratitude to the ICT Division, People's Republic of Bangladesh, for their invaluable support in the successful completion of this study.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Alahmari, S., Renaud, K., and Omoronyia, I. (2022). Moving beyond cyber security awareness and training to engendering security knowledge sharing. Inform. Syst. E-Busi. Manage. 21, 123–158. doi: 10.1007/s10257-022-00575-2

Dataset of suspicious phishing URL detection

1 Introduction

2 Value of the data

3 Experimental design, materials, and methods

3.1 Dataset acquisition

3.2 Dataset preprocessing

3.2.1 Feature generation

3.2.2 Data cleansing and curation

3.3 Final dataset

4 Data description

5 Comparison with exiting datasets

6 Limits and suggestions for future works

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher's note

References

95% of researchers rate our articles as excellent or good