Advances in Fault Tolerance for Large-scale HPC Systems

  • 23

    Total views and downloads

About this Research Topic

Submission deadlines

  1. Manuscript Summary Submission Deadline 27 April 2025 | Manuscript Submission Deadline 15 August 2025

  2. This Research Topic is still accepting articles.

Background

HPC (High performance computing) has been playing a major role in modelling and simulations of grand challenge problems in many different scientific domains for many decades. The significant science outcomes that HPC has yielded in the different domains has encouraged the community to build larger and larger HPC systems for exploration of large-scale problems. The advent of the exascale era, marked by the deployment of the Frontier system in 2022, highlights the recent expansion in HPC capabilities. Such large-scale systems are built using intricate integration of a large number of different components including processors, memory units, accelerators, interconnects, storage systems and many others, manifesting in architectures with high complexities. Such large-scale systems with high complexities invariably lead to faults in the components which increase multiple folds with increasing system sizes. It has been shown that in modern-day large-scale systems, the mean time between failures of any components can be as low as less than an hour. Executions in the presence of these faults, if not addressed, will lead to application failures resulting in wasted resources and energy consumption up to 20% larger as has been shown. Thus, the HPC community faces a clear and present challenge to provide sustained long-running executions for grand challenge applications for which larger systems are continued to be built. Accordingly, fault tolerance in large-scale HPC systems is identified as one of the top 10 exascale research challenges as per the DoE, USA, report published in 2014.

Providing fault tolerance requires a comprehensive set of research methods addressing several challenges. The solutions for fault tolerance should be developed such that they result in low performance and energy overheads. Mechanisms that rely on predictions of faults will have to deal with increasing difficulty in the predictions on systems that keep evolving with myriads of components of increasing complexity. While the early efforts were predominantly related to solutions for hardware failures, finding solutions for software-related faults and silent data corruptions (SDCs) has become increasingly essential in recent years due to the demands for accuracy in Machine Learning and allied fields, and as well as in numerical methods. As systems are built with smaller transistors and higher circuit density, bit flips and SDCs have become common place. In addition to these hardware-related causes, software complexities and varying reliability across inputs have also placed high focus on the SDCs.

This Research Topic will cover and invites papers that provides research methodologies on various aspects of fault tolerance for both hardware and software related faults. To gather further insights in effective fault management strategies scalable to exascale environments, we welcome articles addressing, but not limited to, the following themes:
• Checkpointing/Recovery optimization techniques such as asynchronous and multi-level checkpointing
• Proactive fault tolerance techniques including live migration and just-in-time checkpointing
• Alternative techniques including replication strategies and algorithm-based fault tolerance etc.
• Resilience techniques across diverse platforms such as traditional HPC setups, cloud frameworks, and edge devices.
• Application-specific resilience techniques, particularly for AI applications.
• Best practices for maintenance and usage of failure logs in HPC systems.
• Analyzing failure logs, diagnosing and characterizing failures, correlating failure events.
• Predictive models for component, node, and system failures with sufficient lead times.
• Strategies for the analysis and mitigation of silent data corruptions (SDCs) and other software-related faults, especially in AI applications.

Preference will be given for those works that bring multiple of the above items together towards development of comprehensive fault tolerance frameworks.

Topic Editor Sathish Vadhiyar has a funded project with Shell India focused on parallel frameworks for AI that involves developing fault tolerance for AI models. All other Topic Editors declare no conflicts of interest.

Article types and fees

This Research Topic accepts the following article types, unless otherwise specified in the Research Topic description:

  • Brief Research Report
  • Community Case Study
  • Conceptual Analysis
  • Data Report
  • Editorial
  • Hypothesis and Theory
  • Methods
  • Mini Review
  • Opinion

Articles that are accepted for publication by our external editors following rigorous peer review incur a publishing fee charged to Authors, institutions, or funders.

Keywords: Fault tolerance, checkpointing, soft errors or silent data corruption, replication, failure logs and predictions

Important note: All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.

Frequently asked questions

  • Frontiers' Research Topics are collaborative hubs built around an emerging theme.Defined, managed, and led by renowned researchers, they bring communities together around a shared area of interest to stimulate collaboration and innovation.

    Unlike section journals, which serve established specialty communities, Research Topics are pioneer hubs, responding to the evolving scientific landscape and catering to new communities.

  • The goal of Frontiers' publishing program is to empower research communities to actively steer the course of scientific publishing. Our program was implemented as a three-part unit with fixed field journals, flexible specialty sections, and dynamically emerging Research Topics, connecting communities of different sizes and maturity.

    Research Topics originate from the scientific community. Many of our Research Topics are suggested by existing editorial board members who have identified critical challenges or areas of interest in their field.

  • As an editor, Research Topics will help you build your journal, as well as your community, around emerging, cutting-edge research. As research trailblazers, Research Topics attract high-quality submissions from leading experts all over the world.

    A thriving Research Topic can potentially evolve into a new specialty section if there is sustained interest and a growing community around it.

  • Each Research Topic must be approved by the specialty chief editor, and it falls under the editorial oversight of our editorial boards, supported by our in-house research integrity team. The same standards and rigorous peer review processes apply to articles published as part of a Research Topic as for any other article we publish.

    In 2023, 80% of the Research Topics we published were edited or co-edited by our editorial board members, who are already familiar with their journal's scope, ethos, and publishing model. All other topics are guest edited by leaders in their field, each vetted and formally approved by the specialty chief editor.

  • Publishing your article within a Research Topic with other related articles increases its discoverability and visibility, which can lead to more views, downloads, and citations. Research Topics grow dynamically as more published articles are added, causing frequent revisiting, and further visibility.

    As Research Topics are multidisciplinary, they are cross-listed in several fields and section journals – increasing your reach even more and giving you the chance to expand your network and collaborate with researchers in different fields, all focusing on expanding knowledge around the same important topic.

    Our larger Research Topics are also converted into ebooks and receive social media promotion from our digital marketing team.

  • Frontiers offers multiple article types, but it will depend on the field and section journals in which the Research Topic will be featured. The available article types for a Research Topic will appear in the drop-down menu during the submission process.

    Check available article types here 

  • Yes, we would love to hear your ideas for a topic. Most of our Research Topics are community-led and suggested by researchers in the field. Our in-house editorial team will contact you to talk about your idea and whether you’d like to edit the topic. If you’re an early-stage researcher, we will offer you the opportunity to coordinate your topic, with the support of a senior researcher as the topic editor. 

    Suggest your topic here 

  • A team of guest editors (called topic editors) lead their Research Topic. This editorial team oversees the entire process, from the initial topic proposal to calls for participation, the peer review, and final publications.

    The team may also include topic coordinators, who help the topic editors send calls for participation, liaise with topic editors on abstracts, and support contributing authors. In some cases, they can also be assigned as reviewers.

  • As a topic editor (TE), you will take the lead on all editorial decisions for the Research Topic, starting with defining its scope. This allows you to curate research around a topic that interests you, bring together different perspectives from leading researchers across different fields and shape the future of your field. 

    You will choose your team of co-editors, curate a list of potential authors, send calls for participation and oversee the peer review process, accepting or recommending rejection for each manuscript submitted.

  • As a topic editor, you're supported at every stage by our in-house team. You will be assigned a single point of contact to help you on both editorial and technical matters. Your topic is managed through our user-friendly online platform, and the peer review process is supported by our industry-first AI review assistant (AIRA).

  • If you’re an early-stage researcher, we will offer you the opportunity to coordinate your topic, with the support of a senior researcher as the topic editor. This provides you with valuable editorial experience, improving your ability to critically evaluate research articles and enhancing your understanding of the quality standards and requirements for scientific publishing, as well as the opportunity to discover new research in your field, and expand your professional network.

  • Yes, certificates can be issued on request. We are happy to provide a certificate for your contribution to editing a successful Research Topic.

  • Research Topics thrive on collaboration and their multi-disciplinary approach around emerging, cutting-edge themes, attract leading researchers from all over the world.

  • As a topic editor, you can set the timeline for your Research Topic, and we will work with you at your pace. Typically, Research Topics are online and open for submissions within a few weeks and remain open for participation for 6 – 12 months. Individual articles within a Research Topic are published as soon as they are ready.

    Find out more about our Research Topics

  • Our fee support program ensures that all articles that pass peer review, including those published in Research Topics, can benefit from open access – regardless of the author's field or funding situation.

    Authors and institutions with insufficient funding can apply for a discount on their publishing fees. A fee support application form is available on our website.

  • In line with our mission to promote healthy lives on a healthy planet, we do not provide printed materials. All our articles and ebooks are available under a CC-BY license, so you can share and print copies.

Manuscripts can be submitted to this Research Topic via the main journal or any other participating journal.

Impact

  • 23Topic views
View impact