Skip to main content

METHODS article

Front. Robot. AI
Sec. Robot Learning and Evolution
Volume 11 - 2024 | doi: 10.3389/frobt.2024.1444188

HPRS: Hierarchical Potential-based Reward Shaping from Task Specifications

Provisionally accepted
Luigi Berducci Luigi Berducci 1*Edgar A. Aguilar Edgar A. Aguilar 2Dejan Ničković Dejan Ničković 2Radu Grosu Radu Grosu 1
  • 1 Vienna University of Technology, Vienna, Austria
  • 2 Austrian Institute of Technology (AIT), Vienna, Vienna, Austria

The final, formatted version of the article will be published soon.

    The automatic synthesis of policies for robotics systems through reinforcement learning relies upon, and is intimately guided by, a reward signal. Consequently, this signal should faithfully reflect the designer's intentions, which are often expressed as a collection of high-level requirements. Several works have been developing automated reward definitions from formal requirements, but show limitations in producing a signal which is both effective in training and able to capture multiple heterogeneous requirements. In this paper, we define a task as a partially-ordered set of safety, target, and comfort requirements, and introduce an automated methodology to enforce a natural order among requirements into the reward signal. We do this by automatically translating the requirements into a sum of safety, target, and comfort rewards, where the target reward is a function of the safety reward and the comfort reward is a function of the safety and target rewards. Using a potential-based formulation, we enhance sparse to dense rewards, and formally prove this to maintain policy optimality. We call our novel approach hierarchical, potential-based reward-shaping (HPRS). Our experiments on 8 robotics benchmarks demonstrate that HPRS is able to generate policies satisfying complex hierarchical requirements. Moreover, compared with the state of the art, HPRS achieves faster convergence and superior performance with respect to the rank-preserving policy-assessment metric. By automatically balancing competing requirements, HPRS produces task-satisfying policies with improved comfort, without manual parameter tuning. Through ablation studies, we analyze the impact of individual requirement classes on emergent behavior. Our experiments show that HPRS benefits from comfort requirements when aligned with target and safety, and ignores them when in conflict with the safety or target requirements. Finally, we validate the practical usability of HPRS in real-world robotics applications, including two sim-to-real experiments using F1TENTH vehicles. These experiments show that a hierarchical design of task specifications facilitates the sim-to-real transfer without any domain adaptation.

    Keywords: Robotics, robot learning, reinforcement learning, reward shaping, Formal specifications

    Received: 05 Jun 2024; Accepted: 23 Dec 2024.

    Copyright: © 2024 Berducci, Aguilar, Ničković and Grosu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Luigi Berducci, Vienna University of Technology, Vienna, Austria

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.