Elastic step DQN : a novel multi-step algorithm to alleviate overestimation in deep Q-networks
- Ly, Adrian, Dazeley, Richard, Vamplew, Peter, Cruz, Francisco, Aryal, Sunil
- Authors: Ly, Adrian , Dazeley, Richard , Vamplew, Peter , Cruz, Francisco , Aryal, Sunil
- Date: 2024
- Type: Text , Journal article
- Relation: Neurocomputing Vol. 576, no. (2024), p.
- Full Text:
- Reviewed:
- Description: Deep Q-Networks algorithm (DQN) was the first reinforcement learning algorithm using deep neural network to successfully surpass human level performance in a number of Atari learning environments. However, divergent and unstable behaviour have been long standing issues in DQNs. The unstable behaviour is often characterised by overestimation in the Q-values, commonly referred to as the overestimation bias. To address the overestimation bias and the divergent behaviour, a number of heuristic extensions have been proposed. Notably, multi-step updates have been shown to drastically reduce unstable behaviour while improving agent's training performance. However, agents are often highly sensitive to the selection of the multi-step update horizon (n), and our empirical experiments show that a poorly chosen static value for n can in many cases lead to worse performance than single-step DQN. Inspired by the success of n-step DQN and the effects that multi-step updates have on overestimation bias, this paper proposes a new algorithm that we call ‘Elastic Step DQN’ (ES-DQN) to alleviate overestimation bias in DQNs. ES-DQN dynamically varies the step size horizon in multi-step updates based on the similarity between states visited. Our empirical evaluation shows that ES-DQN out-performs n-step with fixed n updates, Double DQN and Average DQN in several OpenAI Gym environments while at the same time alleviating the overestimation bias. © 2024 The Authors
- Authors: Ly, Adrian , Dazeley, Richard , Vamplew, Peter , Cruz, Francisco , Aryal, Sunil
- Date: 2024
- Type: Text , Journal article
- Relation: Neurocomputing Vol. 576, no. (2024), p.
- Full Text:
- Reviewed:
- Description: Deep Q-Networks algorithm (DQN) was the first reinforcement learning algorithm using deep neural network to successfully surpass human level performance in a number of Atari learning environments. However, divergent and unstable behaviour have been long standing issues in DQNs. The unstable behaviour is often characterised by overestimation in the Q-values, commonly referred to as the overestimation bias. To address the overestimation bias and the divergent behaviour, a number of heuristic extensions have been proposed. Notably, multi-step updates have been shown to drastically reduce unstable behaviour while improving agent's training performance. However, agents are often highly sensitive to the selection of the multi-step update horizon (n), and our empirical experiments show that a poorly chosen static value for n can in many cases lead to worse performance than single-step DQN. Inspired by the success of n-step DQN and the effects that multi-step updates have on overestimation bias, this paper proposes a new algorithm that we call ‘Elastic Step DQN’ (ES-DQN) to alleviate overestimation bias in DQNs. ES-DQN dynamically varies the step size horizon in multi-step updates based on the similarity between states visited. Our empirical evaluation shows that ES-DQN out-performs n-step with fixed n updates, Double DQN and Average DQN in several OpenAI Gym environments while at the same time alleviating the overestimation bias. © 2024 The Authors
Position : intent-aligned ai systems must optimize for agency preservation
- Mitelut, Catalin, Smith, Ben, Vamplew, Peter
- Authors: Mitelut, Catalin , Smith, Ben , Vamplew, Peter
- Date: 2024
- Type: Text , Conference paper
- Relation: 41st International Conference on Machine Learning, ICML 2024, Vienna, 21-27 July 2024, Proceedings of Machine Learning Research Vol. 235, p. 35851-35875
- Full Text:
- Reviewed:
- Description: A central approach to AI-safety research has been to generate aligned AI systems: i.e. systems that do not deceive users and yield actions or recommendations that humans might judge as consistent with their intentions and goals. Here we argue that truthful AIs aligned solely to human intent are insufficient and that preservation of long-term agency of humans may be a more robust standard that may need to be separated and explicitly optimized for. We discuss the science of intent and control and how human intent can be manipulated and we provide a formal definition of agency-preserving AI-human interactions focusing on forward-looking explicit agency evaluations. Our work points to a novel pathway for human harm in AI-human interactions and proposes solutions to this challenge. Copyright 2024 by the author(s)
- Authors: Mitelut, Catalin , Smith, Ben , Vamplew, Peter
- Date: 2024
- Type: Text , Conference paper
- Relation: 41st International Conference on Machine Learning, ICML 2024, Vienna, 21-27 July 2024, Proceedings of Machine Learning Research Vol. 235, p. 35851-35875
- Full Text:
- Reviewed:
- Description: A central approach to AI-safety research has been to generate aligned AI systems: i.e. systems that do not deceive users and yield actions or recommendations that humans might judge as consistent with their intentions and goals. Here we argue that truthful AIs aligned solely to human intent are insufficient and that preservation of long-term agency of humans may be a more robust standard that may need to be separated and explicitly optimized for. We discuss the science of intent and control and how human intent can be manipulated and we provide a formal definition of agency-preserving AI-human interactions focusing on forward-looking explicit agency evaluations. Our work points to a novel pathway for human harm in AI-human interactions and proposes solutions to this challenge. Copyright 2024 by the author(s)
Utility-based reinforcement learning : unifying single-objective and multi-objective reinforcement learning
- Vamplew, Peter, Foale, Cameron, Hayes, Conor, Mannion, Patrick, Howley, Enda, Dazeley, Richard, Johnson, Scott, Källström, Johan, Ramos, Gabriel, Rădulescu, Roxana, Röpke, Willem, Roijers, Diederik
- Authors: Vamplew, Peter , Foale, Cameron , Hayes, Conor , Mannion, Patrick , Howley, Enda , Dazeley, Richard , Johnson, Scott , Källström, Johan , Ramos, Gabriel , Rădulescu, Roxana , Röpke, Willem , Roijers, Diederik
- Date: 2024
- Type: Text , Conference paper
- Relation: 23rd International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2024, Auckland, 6-10 May 2024, Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS Vol. 2024-May, p. 2717-2721
- Full Text:
- Reviewed:
- Description: Research in multi-objective reinforcement learning (MORL) has introduced the utility-based paradigm, which makes use of both environmental rewards and a function that defines the utility derived by the user from those rewards. In this paper we extend this paradigm to the context of single-objective reinforcement learning (RL), and outline multiple potential benefits including the ability to perform multi-policy learning across tasks relating to uncertain objectives, risk-aware RL, discounting, and safe RL. We also examine the algorithmic implications of adopting a utility-based approach. © 2024 International Foundation for Autonomous Agents and Multiagent Systems.
- Authors: Vamplew, Peter , Foale, Cameron , Hayes, Conor , Mannion, Patrick , Howley, Enda , Dazeley, Richard , Johnson, Scott , Källström, Johan , Ramos, Gabriel , Rădulescu, Roxana , Röpke, Willem , Roijers, Diederik
- Date: 2024
- Type: Text , Conference paper
- Relation: 23rd International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2024, Auckland, 6-10 May 2024, Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS Vol. 2024-May, p. 2717-2721
- Full Text:
- Reviewed:
- Description: Research in multi-objective reinforcement learning (MORL) has introduced the utility-based paradigm, which makes use of both environmental rewards and a function that defines the utility derived by the user from those rewards. In this paper we extend this paradigm to the context of single-objective reinforcement learning (RL), and outline multiple potential benefits including the ability to perform multi-policy learning across tasks relating to uncertain objectives, risk-aware RL, discounting, and safe RL. We also examine the algorithmic implications of adopting a utility-based approach. © 2024 International Foundation for Autonomous Agents and Multiagent Systems.
A brief guide to multi-objective reinforcement learning and planning JAAMAS track
- Hayes, Conor, Bargiacchi, Eugenio, Källström, Johan, Macfarlane, Matthew, Reymond, Mathieu, Verstraeten, Timothy, Zintgraf, Luisa, Dazeley, Richard, Heintz, Fredrik, Howley, Enda, Irissappane, Athirai, Mannion, Patrick, Nowé, Ann, Ramos, Gabriel, Restelli, Marcello, Vamplew, Peter, Roijers, Diederik
- Authors: Hayes, Conor , Bargiacchi, Eugenio , Källström, Johan , Macfarlane, Matthew , Reymond, Mathieu , Verstraeten, Timothy , Zintgraf, Luisa , Dazeley, Richard , Heintz, Fredrik , Howley, Enda , Irissappane, Athirai , Mannion, Patrick , Nowé, Ann , Ramos, Gabriel , Restelli, Marcello , Vamplew, Peter , Roijers, Diederik
- Date: 2023
- Type: Text , Conference paper
- Relation: 22nd International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2023, London, 29 May to 2 June 2023, Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS Vol. 2023-May, p. 1988-1990
- Full Text:
- Reviewed:
- Description: Real-world sequential decision-making tasks are usually complex, and require trade-offs between multiple - often conflicting - objectives. However, the majority of research in reinforcement learning (RL) and decision-theoretic planning assumes a single objective, or that multiple objectives can be handled via a predefined weighted sum over the objectives. Such approaches may oversimplify the underlying problem, and produce suboptimal results. This extended abstract outlines the limitations of using a semi-blind iterative process to solve multi-objective decision making problems. Our extended paper [4], serves as a guide for the application of explicitly multi-objective methods to difficult problems. © 2023 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved.
- Authors: Hayes, Conor , Bargiacchi, Eugenio , Källström, Johan , Macfarlane, Matthew , Reymond, Mathieu , Verstraeten, Timothy , Zintgraf, Luisa , Dazeley, Richard , Heintz, Fredrik , Howley, Enda , Irissappane, Athirai , Mannion, Patrick , Nowé, Ann , Ramos, Gabriel , Restelli, Marcello , Vamplew, Peter , Roijers, Diederik
- Date: 2023
- Type: Text , Conference paper
- Relation: 22nd International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2023, London, 29 May to 2 June 2023, Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS Vol. 2023-May, p. 1988-1990
- Full Text:
- Reviewed:
- Description: Real-world sequential decision-making tasks are usually complex, and require trade-offs between multiple - often conflicting - objectives. However, the majority of research in reinforcement learning (RL) and decision-theoretic planning assumes a single objective, or that multiple objectives can be handled via a predefined weighted sum over the objectives. Such approaches may oversimplify the underlying problem, and produce suboptimal results. This extended abstract outlines the limitations of using a semi-blind iterative process to solve multi-objective decision making problems. Our extended paper [4], serves as a guide for the application of explicitly multi-objective methods to difficult problems. © 2023 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved.
A conceptual framework for externally-influenced agents: an assisted reinforcement learning review
- Bignold, Adam, Cruz, Francisco, Taylor, Matthew, Brys, Tim, Dazeley, Richard, Vamplew, Peter, Foale, Cameron
- Authors: Bignold, Adam , Cruz, Francisco , Taylor, Matthew , Brys, Tim , Dazeley, Richard , Vamplew, Peter , Foale, Cameron
- Date: 2023
- Type: Text , Journal article
- Relation: Journal of Ambient Intelligence and Humanized Computing Vol. 14, no. 4 (2023), p. 3621-3644
- Full Text:
- Reviewed:
- Description: A long-term goal of reinforcement learning agents is to be able to perform tasks in complex real-world scenarios. The use of external information is one way of scaling agents to more complex problems. However, there is a general lack of collaboration or interoperability between different approaches using external information. In this work, while reviewing externally-influenced methods, we propose a conceptual framework and taxonomy for assisted reinforcement learning, aimed at fostering collaboration by classifying and comparing various methods that use external information in the learning process. The proposed taxonomy details the relationship between the external information source and the learner agent, highlighting the process of information decomposition, structure, retention, and how it can be used to influence agent learning. As well as reviewing state-of-the-art methods, we identify current streams of reinforcement learning that use external information in order to improve the agent’s performance and its decision-making process. These include heuristic reinforcement learning, interactive reinforcement learning, learning from demonstration, transfer learning, and learning from multiple sources, among others. These streams of reinforcement learning operate with the shared objective of scaffolding the learner agent. Lastly, we discuss further possibilities for future work in the field of assisted reinforcement learning systems. © 2021, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.
- Authors: Bignold, Adam , Cruz, Francisco , Taylor, Matthew , Brys, Tim , Dazeley, Richard , Vamplew, Peter , Foale, Cameron
- Date: 2023
- Type: Text , Journal article
- Relation: Journal of Ambient Intelligence and Humanized Computing Vol. 14, no. 4 (2023), p. 3621-3644
- Full Text:
- Reviewed:
- Description: A long-term goal of reinforcement learning agents is to be able to perform tasks in complex real-world scenarios. The use of external information is one way of scaling agents to more complex problems. However, there is a general lack of collaboration or interoperability between different approaches using external information. In this work, while reviewing externally-influenced methods, we propose a conceptual framework and taxonomy for assisted reinforcement learning, aimed at fostering collaboration by classifying and comparing various methods that use external information in the learning process. The proposed taxonomy details the relationship between the external information source and the learner agent, highlighting the process of information decomposition, structure, retention, and how it can be used to influence agent learning. As well as reviewing state-of-the-art methods, we identify current streams of reinforcement learning that use external information in order to improve the agent’s performance and its decision-making process. These include heuristic reinforcement learning, interactive reinforcement learning, learning from demonstration, transfer learning, and learning from multiple sources, among others. These streams of reinforcement learning operate with the shared objective of scaffolding the learner agent. Lastly, we discuss further possibilities for future work in the field of assisted reinforcement learning systems. © 2021, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.
A nethack learning environment language wrapper for autonomous agents
- Goodger, Nikolaj, Vamplew, Peter, Foale, Cameron, Dazeley, Richard
- Authors: Goodger, Nikolaj , Vamplew, Peter , Foale, Cameron , Dazeley, Richard
- Date: 2023
- Type: Text , Journal article
- Relation: Journal of Open Research Software Vol. 11, no. (2023), p.
- Full Text:
- Reviewed:
- Description: This paper describes a language wrapper for the NetHack Learning Environment (NLE) [1]. The wrapper replaces the non-language observations and actions with comparable language versions. The NLE offers a grand challenge for AI research while MiniHack [2] extends this potential to more specific and configurable tasks. By providing a language interface, we can enable further research on language agents and directly connect language models to a versatile environment. © 2023 The Author(s). This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See http://creativecommons.org/licenses/by/4.0/.
- Authors: Goodger, Nikolaj , Vamplew, Peter , Foale, Cameron , Dazeley, Richard
- Date: 2023
- Type: Text , Journal article
- Relation: Journal of Open Research Software Vol. 11, no. (2023), p.
- Full Text:
- Reviewed:
- Description: This paper describes a language wrapper for the NetHack Learning Environment (NLE) [1]. The wrapper replaces the non-language observations and actions with comparable language versions. The NLE offers a grand challenge for AI research while MiniHack [2] extends this potential to more specific and configurable tasks. By providing a language interface, we can enable further research on language agents and directly connect language models to a versatile environment. © 2023 The Author(s). This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See http://creativecommons.org/licenses/by/4.0/.
AI apology : interactive multi-objective reinforcement learning for human-aligned AI
- Harland, Hadassah, Dazeley, Richard, Nakisa, Bahareh, Cruz, Francisco, Vamplew, Peter
- Authors: Harland, Hadassah , Dazeley, Richard , Nakisa, Bahareh , Cruz, Francisco , Vamplew, Peter
- Date: 2023
- Type: Text , Journal article
- Relation: Neural Computing and Applications Vol. 35, no. 23 (2023), p. 16917-16930
- Full Text:
- Reviewed:
- Description: For an Artificially Intelligent (AI) system to maintain alignment between human desires and its behaviour, it is important that the AI account for human preferences. This paper proposes and empirically evaluates the first approach to aligning agent behaviour to human preference via an apologetic framework. In practice, an apology may consist of an acknowledgement, an explanation and an intention for the improvement of future behaviour. We propose that such an apology, provided in response to recognition of undesirable behaviour, is one way in which an AI agent may both be transparent and trustworthy to a human user. Furthermore, that behavioural adaptation as part of apology is a viable approach to correct against undesirable behaviours. The Act-Assess-Apologise framework potentially could address both the practical and social needs of a human user, to recognise and make reparations against prior undesirable behaviour and adjust for the future. Applied to a dual-auxiliary impact minimisation problem, the apologetic agent had a near perfect determination and apology provision accuracy in several non-trivial configurations. The agent subsequently demonstrated behaviour alignment with success that included up to complete avoidance of the impacts described by these objectives in some scenarios. © 2023, The Author(s).
- Authors: Harland, Hadassah , Dazeley, Richard , Nakisa, Bahareh , Cruz, Francisco , Vamplew, Peter
- Date: 2023
- Type: Text , Journal article
- Relation: Neural Computing and Applications Vol. 35, no. 23 (2023), p. 16917-16930
- Full Text:
- Reviewed:
- Description: For an Artificially Intelligent (AI) system to maintain alignment between human desires and its behaviour, it is important that the AI account for human preferences. This paper proposes and empirically evaluates the first approach to aligning agent behaviour to human preference via an apologetic framework. In practice, an apology may consist of an acknowledgement, an explanation and an intention for the improvement of future behaviour. We propose that such an apology, provided in response to recognition of undesirable behaviour, is one way in which an AI agent may both be transparent and trustworthy to a human user. Furthermore, that behavioural adaptation as part of apology is a viable approach to correct against undesirable behaviours. The Act-Assess-Apologise framework potentially could address both the practical and social needs of a human user, to recognise and make reparations against prior undesirable behaviour and adjust for the future. Applied to a dual-auxiliary impact minimisation problem, the apologetic agent had a near perfect determination and apology provision accuracy in several non-trivial configurations. The agent subsequently demonstrated behaviour alignment with success that included up to complete avoidance of the impacts described by these objectives in some scenarios. © 2023, The Author(s).
Elastic step DDPG : multi-step reinforcement learning for improved sample efficiency
- Ly, Adrian, Dazeley, Richard, Vamplew, Peter, Cruz, Francisco, Aryal, Sunil
- Authors: Ly, Adrian , Dazeley, Richard , Vamplew, Peter , Cruz, Francisco , Aryal, Sunil
- Date: 2023
- Type: Text , Conference paper
- Relation: 2023 International Joint Conference on Neural Networks, IJCNN 2023 Vol. 2023-June
- Full Text: false
- Reviewed:
- Description: A major challenge in deep reinforcement learning is that it requires more data to converge to an policy for complex problems. One way to improve sample efficiency is to use n-step updates to reduce the number of samples required to converge to a good policy. However n-step updates are known to be brittle and difficult to tune. Elastic Step DQN has shown that it is possible to automate the value of n in DQN to solve problems involving discrete action spaces, however the efficacy of the technique when applied on more complex problems and against problems with continuous action spaces is yet to be shown. In this paper we adapt the innovations proposed by Elastic Step DQN onto the DDPG algorithm and show empirically that Elastic Step DDPG is able to achieve a much stronger final training policy and is more sample efficient than DDPG. © 2023 IEEE.
Explainable reinforcement learning for broad-XAI: a conceptual framework and survey
- Dazeley, Richard, Vamplew, Peter, Cruz, Francisco
- Authors: Dazeley, Richard , Vamplew, Peter , Cruz, Francisco
- Date: 2023
- Type: Text , Journal article
- Relation: Neural Computing and Applications Vol. 35, no. 23 (2023), p. 16893-16916
- Full Text:
- Reviewed:
- Description: Broad-XAI moves away from interpreting individual decisions based on a single datum and aims to provide integrated explanations from multiple machine learning algorithms into a coherent explanation of an agent’s behaviour that is aligned to the communication needs of the explainee. Reinforcement Learning (RL) methods, we propose, provide a potential backbone for the cognitive model required for the development of Broad-XAI. RL represents a suite of approaches that have had increasing success in solving a range of sequential decision-making problems. However, these algorithms operate as black-box problem solvers, where they obfuscate their decision-making policy through a complex array of values and functions. EXplainable RL (XRL) aims to develop techniques to extract concepts from the agent’s: perception of the environment; intrinsic/extrinsic motivations/beliefs; Q-values, goals and objectives. This paper aims to introduce the Causal XRL Framework (CXF), that unifies the current XRL research and uses RL as a backbone to the development of Broad-XAI. CXF is designed to incorporate many standard RL extensions and integrated with external ontologies and communication facilities so that the agent can answer questions that explain outcomes its decisions. This paper aims to: establish XRL as a distinct branch of XAI; introduce a conceptual framework for XRL; review existing approaches explaining agent behaviour; and identify opportunities for future research. Finally, this paper discusses how additional information can be extracted and ultimately integrated into models of communication, facilitating the development of Broad-XAI. © 2023, The Author(s).
- Authors: Dazeley, Richard , Vamplew, Peter , Cruz, Francisco
- Date: 2023
- Type: Text , Journal article
- Relation: Neural Computing and Applications Vol. 35, no. 23 (2023), p. 16893-16916
- Full Text:
- Reviewed:
- Description: Broad-XAI moves away from interpreting individual decisions based on a single datum and aims to provide integrated explanations from multiple machine learning algorithms into a coherent explanation of an agent’s behaviour that is aligned to the communication needs of the explainee. Reinforcement Learning (RL) methods, we propose, provide a potential backbone for the cognitive model required for the development of Broad-XAI. RL represents a suite of approaches that have had increasing success in solving a range of sequential decision-making problems. However, these algorithms operate as black-box problem solvers, where they obfuscate their decision-making policy through a complex array of values and functions. EXplainable RL (XRL) aims to develop techniques to extract concepts from the agent’s: perception of the environment; intrinsic/extrinsic motivations/beliefs; Q-values, goals and objectives. This paper aims to introduce the Causal XRL Framework (CXF), that unifies the current XRL research and uses RL as a backbone to the development of Broad-XAI. CXF is designed to incorporate many standard RL extensions and integrated with external ontologies and communication facilities so that the agent can answer questions that explain outcomes its decisions. This paper aims to: establish XRL as a distinct branch of XAI; introduce a conceptual framework for XRL; review existing approaches explaining agent behaviour; and identify opportunities for future research. Finally, this paper discusses how additional information can be extracted and ultimately integrated into models of communication, facilitating the development of Broad-XAI. © 2023, The Author(s).
Explainable robotic systems : understanding goal-driven actions in a reinforcement learning scenario
- Cruz, Francisco, Dazeley, Richard, Vamplew, Peter, Moreira, Ithan
- Authors: Cruz, Francisco , Dazeley, Richard , Vamplew, Peter , Moreira, Ithan
- Date: 2023
- Type: Text , Journal article
- Relation: Neural Computing and Applications Vol. 35, no. 25 (2023), p. 18113-18130
- Full Text:
- Reviewed:
- Description: Robotic systems are more present in our society everyday. In human–robot environments, it is crucial that end-users may correctly understand their robotic team-partners, in order to collaboratively complete a task. To increase action understanding, users demand more explainability about the decisions by the robot in particular situations. Recently, explainable robotic systems have emerged as an alternative focused not only on completing a task satisfactorily, but also on justifying, in a human-like manner, the reasons that lead to making a decision. In reinforcement learning scenarios, a great effort has been focused on providing explanations using data-driven approaches, particularly from the visual input modality in deep learning-based systems. In this work, we focus rather on the decision-making process of reinforcement learning agents performing a task in a robotic scenario. Experimental results are obtained using 3 different set-ups, namely, a deterministic navigation task, a stochastic navigation task, and a continuous visual-based sorting object task. As a way to explain the goal-driven robot’s actions, we use the probability of success computed by three different proposed approaches: memory-based, learning-based, and introspection-based. The difference between these approaches is the amount of memory required to compute or estimate the probability of success as well as the kind of reinforcement learning representation where they could be used. In this regard, we use the memory-based approach as a baseline since it is obtained directly from the agent’s observations. When comparing the learning-based and the introspection-based approaches to this baseline, both are found to be suitable alternatives to compute the probability of success, obtaining high levels of similarity when compared using both the Pearson’s correlation and the mean squared error. © 2021, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
Explainable robotic systems : understanding goal-driven actions in a reinforcement learning scenario
- Authors: Cruz, Francisco , Dazeley, Richard , Vamplew, Peter , Moreira, Ithan
- Date: 2023
- Type: Text , Journal article
- Relation: Neural Computing and Applications Vol. 35, no. 25 (2023), p. 18113-18130
- Full Text:
- Reviewed:
- Description: Robotic systems are more present in our society everyday. In human–robot environments, it is crucial that end-users may correctly understand their robotic team-partners, in order to collaboratively complete a task. To increase action understanding, users demand more explainability about the decisions by the robot in particular situations. Recently, explainable robotic systems have emerged as an alternative focused not only on completing a task satisfactorily, but also on justifying, in a human-like manner, the reasons that lead to making a decision. In reinforcement learning scenarios, a great effort has been focused on providing explanations using data-driven approaches, particularly from the visual input modality in deep learning-based systems. In this work, we focus rather on the decision-making process of reinforcement learning agents performing a task in a robotic scenario. Experimental results are obtained using 3 different set-ups, namely, a deterministic navigation task, a stochastic navigation task, and a continuous visual-based sorting object task. As a way to explain the goal-driven robot’s actions, we use the probability of success computed by three different proposed approaches: memory-based, learning-based, and introspection-based. The difference between these approaches is the amount of memory required to compute or estimate the probability of success as well as the kind of reinforcement learning representation where they could be used. In this regard, we use the memory-based approach as a baseline since it is obtained directly from the agent’s observations. When comparing the learning-based and the introspection-based approaches to this baseline, both are found to be suitable alternatives to compute the probability of success, obtaining high levels of similarity when compared using both the Pearson’s correlation and the mean squared error. © 2021, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
Human engagement providing evaluative and informative advice for interactive reinforcement learning
- Bignold, Adam, Cruz, Francisco, Dazeley, Richard, Vamplew, Peter, Foale, Cameron
- Authors: Bignold, Adam , Cruz, Francisco , Dazeley, Richard , Vamplew, Peter , Foale, Cameron
- Date: 2023
- Type: Text , Journal article
- Relation: Neural Computing and Applications Vol. 35, no. 25 (2023), p. 18215-18230
- Full Text:
- Reviewed:
- Description: Interactive reinforcement learning proposes the use of externally sourced information in order to speed up the learning process. When interacting with a learner agent, humans may provide either evaluative or informative advice. Prior research has focused on the effect of human-sourced advice by including real-time feedback on the interactive reinforcement learning process, specifically aiming to improve the learning speed of the agent, while minimising the time demands on the human. This work focuses on answering which of two approaches, evaluative or informative, is the preferred instructional approach for humans. Moreover, this work presents an experimental setup for a human trial designed to compare the methods people use to deliver advice in terms of human engagement. The results obtained show that users giving informative advice to the learner agents provide more accurate advice, are willing to assist the learner agent for a longer time, and provide more advice per episode. Additionally, self-evaluation from participants using the informative approach has indicated that the agent’s ability to follow the advice is higher, and therefore, they feel their own advice to be of higher accuracy when compared to people providing evaluative advice. © 2022, The Author(s).
- Authors: Bignold, Adam , Cruz, Francisco , Dazeley, Richard , Vamplew, Peter , Foale, Cameron
- Date: 2023
- Type: Text , Journal article
- Relation: Neural Computing and Applications Vol. 35, no. 25 (2023), p. 18215-18230
- Full Text:
- Reviewed:
- Description: Interactive reinforcement learning proposes the use of externally sourced information in order to speed up the learning process. When interacting with a learner agent, humans may provide either evaluative or informative advice. Prior research has focused on the effect of human-sourced advice by including real-time feedback on the interactive reinforcement learning process, specifically aiming to improve the learning speed of the agent, while minimising the time demands on the human. This work focuses on answering which of two approaches, evaluative or informative, is the preferred instructional approach for humans. Moreover, this work presents an experimental setup for a human trial designed to compare the methods people use to deliver advice in terms of human engagement. The results obtained show that users giving informative advice to the learner agents provide more accurate advice, are willing to assist the learner agent for a longer time, and provide more advice per episode. Additionally, self-evaluation from participants using the informative approach has indicated that the agent’s ability to follow the advice is higher, and therefore, they feel their own advice to be of higher accuracy when compared to people providing evaluative advice. © 2022, The Author(s).
Scalar reward is not enough JAAMAS Track
- Vamplew, Peter, Smith, Benjamin, Källström, Johan, Ramos, Gabriel, Rădulescu, Roxana, Roijers, Diederik, Hayes, Conor, Heintz, Fredrik, Mannion, Patrick, Libin, Pieter, Dazeley, Richard, Foale, Cameron
- Authors: Vamplew, Peter , Smith, Benjamin , Källström, Johan , Ramos, Gabriel , Rădulescu, Roxana , Roijers, Diederik , Hayes, Conor , Heintz, Fredrik , Mannion, Patrick , Libin, Pieter , Dazeley, Richard , Foale, Cameron
- Date: 2023
- Type: Text , Conference paper
- Relation: 22nd International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2023, London, 29 May to 2 June 2023, Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, Vol. 2023-May, p. 839-841
- Full Text: false
- Reviewed:
- Description: Silver et al. [14] posit that scalar reward maximisation is sufficient to underpin all intelligence and provides a suitable basis for artificial general intelligence (AGI). This extended abstract summarises the counter-argument from our JAAMAS paper[19]. © 2023 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved.
A practical guide to multi-objective reinforcement learning and planning
- Hayes, Conor, Rădulescu, Roxana, Bargiacchi, Eugenio, Källström, Johan, Macfarlane, Matthew, Reymond, Mathieu, Verstraeten, Timothy, Zintgraf, Luisa, Dazeley, Richard, Heintz, Fredrik, Howley, Enda, Irissappane, Athirai, Mannion, Patrick, Nowé, Ann, Ramos, Gabriel, Restelli, Marcello, Vamplew, Peter, Roijers, Diederik
- Authors: Hayes, Conor , Rădulescu, Roxana , Bargiacchi, Eugenio , Källström, Johan , Macfarlane, Matthew , Reymond, Mathieu , Verstraeten, Timothy , Zintgraf, Luisa , Dazeley, Richard , Heintz, Fredrik , Howley, Enda , Irissappane, Athirai , Mannion, Patrick , Nowé, Ann , Ramos, Gabriel , Restelli, Marcello , Vamplew, Peter , Roijers, Diederik
- Date: 2022
- Type: Text , Journal article
- Relation: Autonomous Agents and Multi-Agent Systems Vol. 36, no. 1 (2022), p.
- Full Text:
- Reviewed:
- Description: Real-world sequential decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multi-objective methods to difficult problems, and is aimed at researchers who are already familiar with single-objective reinforcement learning and planning methods who wish to adopt a multi-objective perspective on their research, as well as practitioners who encounter multi-objective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multi-objective decision-making systems for complex problems. © 2022, The Author(s).
- Authors: Hayes, Conor , Rădulescu, Roxana , Bargiacchi, Eugenio , Källström, Johan , Macfarlane, Matthew , Reymond, Mathieu , Verstraeten, Timothy , Zintgraf, Luisa , Dazeley, Richard , Heintz, Fredrik , Howley, Enda , Irissappane, Athirai , Mannion, Patrick , Nowé, Ann , Ramos, Gabriel , Restelli, Marcello , Vamplew, Peter , Roijers, Diederik
- Date: 2022
- Type: Text , Journal article
- Relation: Autonomous Agents and Multi-Agent Systems Vol. 36, no. 1 (2022), p.
- Full Text:
- Reviewed:
- Description: Real-world sequential decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multi-objective methods to difficult problems, and is aimed at researchers who are already familiar with single-objective reinforcement learning and planning methods who wish to adopt a multi-objective perspective on their research, as well as practitioners who encounter multi-objective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multi-objective decision-making systems for complex problems. © 2022, The Author(s).
An online scalarization multi-objective reinforcement learning algorithm : TOPSIS Q-learning
- Mirzanejad, Mohammad, Ebrahimi, Morteza, Vamplew, Peter, Veisi, Hadi
- Authors: Mirzanejad, Mohammad , Ebrahimi, Morteza , Vamplew, Peter , Veisi, Hadi
- Date: 2022
- Type: Text , Journal article
- Relation: Knowledge Engineering Review Vol. 37, no. 4 (2022), p.
- Full Text: false
- Reviewed:
- Description: Conventional reinforcement learning focuses on problems with single objective. However, many problems have multiple objectives or criteria that may be independent, related, or contradictory. In such cases, multi-objective reinforcement learning is used to propose a compromise among the solutions to balance the objectives. TOPSIS is a multi-criteria decision method that selects the alternative with minimum distance from the positive ideal solution and the maximum distance from the negative ideal solution, so it can be used effectively in the decision-making process to select the next action. In this research a single-policy algorithm called TOPSIS Q-Learning is provided with focus on its performance in online mode. Unlike all single-policy methods, in the first version of the algorithm, there is no need for the user to specify the weights of the objectives. The user's preferences may not be completely definite, so all weight preferences are combined together as decision criteria and a solution is generated by considering all these preferences at once and user can model the uncertainty and weight changes of objectives around their specified preferences of objectives. If the user only wants to apply the algorithm for a specific set of weights the second version of the algorithm efficiently accomplishes that. ©
Discrete-to-deep reinforcement learning methods
- Kurniawan, Budi, Vamplew, Peter, Papasimeon, Michael, Dazeley, Richard, Foale, Cameron
- Authors: Kurniawan, Budi , Vamplew, Peter , Papasimeon, Michael , Dazeley, Richard , Foale, Cameron
- Date: 2022
- Type: Text , Journal article
- Relation: Neural Computing and Applications Vol. 34, no. 3 (2022), p. 1713-1733
- Full Text:
- Reviewed:
- Description: Neural networks are effective function approximators, but hard to train in the reinforcement learning (RL) context mainly because samples are correlated. In complex problems, a neural RL approach is often able to learn a better solution than tabular RL, but generally takes longer. This paper proposes two methods, Discrete-to-Deep Supervised Policy Learning (D2D-SPL) and Discrete-to-Deep Supervised Q-value Learning (D2D-SQL), whose objective is to acquire the generalisability of a neural network at a cost nearer to that of a tabular method. Both methods combine RL and supervised learning (SL) and are based on the idea that a fast-learning tabular method can generate off-policy data to accelerate learning in neural RL. D2D-SPL uses the data to train a classifier which is then used as a controller for the RL problem. D2D-SQL uses the data to initialise a neural network which is then allowed to continue learning using another RL method. We demonstrate the viability of our algorithms with Cartpole, Lunar Lander and an aircraft manoeuvring problem, three continuous-space environments with low-dimensional state variables. Both methods learn at least 38% faster than baseline methods and yield policies that outperform them. © 2021, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
- Authors: Kurniawan, Budi , Vamplew, Peter , Papasimeon, Michael , Dazeley, Richard , Foale, Cameron
- Date: 2022
- Type: Text , Journal article
- Relation: Neural Computing and Applications Vol. 34, no. 3 (2022), p. 1713-1733
- Full Text:
- Reviewed:
- Description: Neural networks are effective function approximators, but hard to train in the reinforcement learning (RL) context mainly because samples are correlated. In complex problems, a neural RL approach is often able to learn a better solution than tabular RL, but generally takes longer. This paper proposes two methods, Discrete-to-Deep Supervised Policy Learning (D2D-SPL) and Discrete-to-Deep Supervised Q-value Learning (D2D-SQL), whose objective is to acquire the generalisability of a neural network at a cost nearer to that of a tabular method. Both methods combine RL and supervised learning (SL) and are based on the idea that a fast-learning tabular method can generate off-policy data to accelerate learning in neural RL. D2D-SPL uses the data to train a classifier which is then used as a controller for the RL problem. D2D-SQL uses the data to initialise a neural network which is then allowed to continue learning using another RL method. We demonstrate the viability of our algorithms with Cartpole, Lunar Lander and an aircraft manoeuvring problem, three continuous-space environments with low-dimensional state variables. Both methods learn at least 38% faster than baseline methods and yield policies that outperform them. © 2021, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
Evaluating human-like explanations for robot actions in reinforcement learning scenarios
- Cruz, Francisco, Young, Charlotte, Dazeley, Richard, Vamplew, Peter
- Authors: Cruz, Francisco , Young, Charlotte , Dazeley, Richard , Vamplew, Peter
- Date: 2022
- Type: Text , Conference paper
- Relation: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2022, Kyoto, Japan, 23-27 October 2022, 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Vol. 2022-October, p. 894-901
- Full Text:
- Reviewed:
- Description: Explainable artificial intelligence is a research field that tries to provide more transparency for autonomous intelligent systems. Explainability has been used, particularly in reinforcement learning and robotic scenarios, to better understand the robot decision-making process. Previous work, however, has been widely focused on providing technical explanations that can be better understood by AI practitioners than non-expert end-users. In this work, we make use of human-like explanations built from the probability of success to complete the goal that an autonomous robot shows after performing an action. These explanations are intended to be understood by people who have no or very little experience with artificial intelligence methods. This paper presents a user trial to study whether these explanations that focus on the probability an action has of succeeding in its goal constitute a suitable explanation for non-expert end-users. The results obtained show that non-expert participants rate robot explanations that focus on the probability of success higher and with less variance than technical explanations generated from Q-values, and also favor counterfactual explanations over standalone explanations. © 2022 IEEE.
- Authors: Cruz, Francisco , Young, Charlotte , Dazeley, Richard , Vamplew, Peter
- Date: 2022
- Type: Text , Conference paper
- Relation: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2022, Kyoto, Japan, 23-27 October 2022, 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Vol. 2022-October, p. 894-901
- Full Text:
- Reviewed:
- Description: Explainable artificial intelligence is a research field that tries to provide more transparency for autonomous intelligent systems. Explainability has been used, particularly in reinforcement learning and robotic scenarios, to better understand the robot decision-making process. Previous work, however, has been widely focused on providing technical explanations that can be better understood by AI practitioners than non-expert end-users. In this work, we make use of human-like explanations built from the probability of success to complete the goal that an autonomous robot shows after performing an action. These explanations are intended to be understood by people who have no or very little experience with artificial intelligence methods. This paper presents a user trial to study whether these explanations that focus on the probability an action has of succeeding in its goal constitute a suitable explanation for non-expert end-users. The results obtained show that non-expert participants rate robot explanations that focus on the probability of success higher and with less variance than technical explanations generated from Q-values, and also favor counterfactual explanations over standalone explanations. © 2022 IEEE.
Scalar reward is not enough : a response to Silver, Singh, Precup and Sutton (2021)
- Vamplew, Peter, Smith, Benjamin, Källström, Johan, Ramos, Gabriel, Rădulescu, Roxana, Roijers, Diederik, Hayes, Conor, Heintz, Fredrik, Mannion, Patrick, Libin, Pieter, Dazeley, Richard, Foale, Cameron
- Authors: Vamplew, Peter , Smith, Benjamin , Källström, Johan , Ramos, Gabriel , Rădulescu, Roxana , Roijers, Diederik , Hayes, Conor , Heintz, Fredrik , Mannion, Patrick , Libin, Pieter , Dazeley, Richard , Foale, Cameron
- Date: 2022
- Type: Text , Journal article
- Relation: Autonomous Agents and Multi-Agent Systems Vol. 36, no. 2 (2022), p.
- Full Text:
- Reviewed:
- Description: The recent paper “Reward is Enough” by Silver, Singh, Precup and Sutton posits that the concept of reward maximisation is sufficient to underpin all intelligence, both natural and artificial, and provides a suitable basis for the creation of artificial general intelligence. We contest the underlying assumption of Silver et al. that such reward can be scalar-valued. In this paper we explain why scalar rewards are insufficient to account for some aspects of both biological and computational intelligence, and argue in favour of explicitly multi-objective models of reward maximisation. Furthermore, we contend that even if scalar reward functions can trigger intelligent behaviour in specific cases, this type of reward is insufficient for the development of human-aligned artificial general intelligence due to unacceptable risks of unsafe or unethical behaviour. © 2022, The Author(s).
- Authors: Vamplew, Peter , Smith, Benjamin , Källström, Johan , Ramos, Gabriel , Rădulescu, Roxana , Roijers, Diederik , Hayes, Conor , Heintz, Fredrik , Mannion, Patrick , Libin, Pieter , Dazeley, Richard , Foale, Cameron
- Date: 2022
- Type: Text , Journal article
- Relation: Autonomous Agents and Multi-Agent Systems Vol. 36, no. 2 (2022), p.
- Full Text:
- Reviewed:
- Description: The recent paper “Reward is Enough” by Silver, Singh, Precup and Sutton posits that the concept of reward maximisation is sufficient to underpin all intelligence, both natural and artificial, and provides a suitable basis for the creation of artificial general intelligence. We contest the underlying assumption of Silver et al. that such reward can be scalar-valued. In this paper we explain why scalar rewards are insufficient to account for some aspects of both biological and computational intelligence, and argue in favour of explicitly multi-objective models of reward maximisation. Furthermore, we contend that even if scalar reward functions can trigger intelligent behaviour in specific cases, this type of reward is insufficient for the development of human-aligned artificial general intelligence due to unacceptable risks of unsafe or unethical behaviour. © 2022, The Author(s).
The impact of environmental stochasticity on value-based multiobjective reinforcement learning
- Vamplew, Peter, Foale, Cameron, Dazeley, Richard
- Authors: Vamplew, Peter , Foale, Cameron , Dazeley, Richard
- Date: 2022
- Type: Text , Journal article
- Relation: Neural Computing and Applications Vol. 34, no. 3 (2022), p. 1783-1799
- Full Text:
- Reviewed:
- Description: A common approach to address multiobjective problems using reinforcement learning methods is to extend model-free, value-based algorithms such as Q-learning to use a vector of Q-values in combination with an appropriate action selection mechanism that is often based on scalarisation. Most prior empirical evaluation of these approaches has focused on deterministic environments. This study examines the impact on stochasticity in rewards and state transitions on the behaviour of multi-objective Q-learning. It shows that the nature of the optimal solution depends on these environmental characteristics, and also on whether we desire to maximise the Expected Scalarised Return (ESR) or the Scalarised Expected Return (SER). We also identify a novel aim which may arise in some applications of maximising SER subject to satisfying constraints on the variation in return and show that this may require different solutions than ESR or conventional SER. The analysis of the interaction between environmental stochasticity and multi-objective Q-learning is supported by empirical evaluations on several simple multiobjective Markov Decision Processes with varying characteristics. This includes a demonstration of a novel approach to learning deterministic SER-optimal policies for environments with stochastic rewards. In addition, we report a previously unidentified issue with model-free, value-based approaches to multiobjective reinforcement learning in the context of environments with stochastic state transitions. Having highlighted the limitations of value-based model-free MORL methods, we discuss several alternative methods that may be more suitable for maximising SER in MOMDPs with stochastic transitions. © 2021, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
- Authors: Vamplew, Peter , Foale, Cameron , Dazeley, Richard
- Date: 2022
- Type: Text , Journal article
- Relation: Neural Computing and Applications Vol. 34, no. 3 (2022), p. 1783-1799
- Full Text:
- Reviewed:
- Description: A common approach to address multiobjective problems using reinforcement learning methods is to extend model-free, value-based algorithms such as Q-learning to use a vector of Q-values in combination with an appropriate action selection mechanism that is often based on scalarisation. Most prior empirical evaluation of these approaches has focused on deterministic environments. This study examines the impact on stochasticity in rewards and state transitions on the behaviour of multi-objective Q-learning. It shows that the nature of the optimal solution depends on these environmental characteristics, and also on whether we desire to maximise the Expected Scalarised Return (ESR) or the Scalarised Expected Return (SER). We also identify a novel aim which may arise in some applications of maximising SER subject to satisfying constraints on the variation in return and show that this may require different solutions than ESR or conventional SER. The analysis of the interaction between environmental stochasticity and multi-objective Q-learning is supported by empirical evaluations on several simple multiobjective Markov Decision Processes with varying characteristics. This includes a demonstration of a novel approach to learning deterministic SER-optimal policies for environments with stochastic rewards. In addition, we report a previously unidentified issue with model-free, value-based approaches to multiobjective reinforcement learning in the context of environments with stochastic state transitions. Having highlighted the limitations of value-based model-free MORL methods, we discuss several alternative methods that may be more suitable for maximising SER in MOMDPs with stochastic transitions. © 2021, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
A prioritized objective actor-critic method for deep reinforcement learning
- Nguyen, Ngoc, Nguyen, Thanh, Vamplew, Peter, Dazeley, Richard, Nahavandi, Saeid
- Authors: Nguyen, Ngoc , Nguyen, Thanh , Vamplew, Peter , Dazeley, Richard , Nahavandi, Saeid
- Date: 2021
- Type: Text , Journal article
- Relation: Neural Computing and Applications Vol. 33, no. 16 (2021), p. 10335-10349
- Full Text: false
- Reviewed:
- Description: An increasing number of complex problems have naturally posed significant challenges in decision-making theory and reinforcement learning practices. These problems often involve multiple conflicting reward signals that inherently cause agents’ poor exploration in seeking a specific goal. In extreme cases, the agent gets stuck in a sub-optimal solution and starts behaving harmfully. To overcome such obstacles, we introduce two actor-critic deep reinforcement learning methods, namely Multi-Critic Single Policy (MCSP) and Single Critic Multi-Policy (SCMP), which can adjust agent behaviors to efficiently achieve a designated goal by adopting a weighted-sum scalarization of different objective functions. In particular, MCSP creates a human-centric policy that corresponds to a predefined priority weight of different objectives. Whereas, SCMP is capable of generating a mixed policy based on a set of priority weights, i.e., the generated policy uses the knowledge of different policies (each policy corresponds to a priority weight) to dynamically prioritize objectives in real time. We examine our methods by using the Asynchronous Advantage Actor-Critic (A3C) algorithm to utilize the multithreading mechanism for dynamically balancing training intensity of different policies into a single network. Finally, simulation results show that MCSP and SCMP significantly outperform A3C with respect to the mean of total rewards in two complex problems: Food Collector and Seaquest. © 2021, The Author(s), under exclusive licence to Springer-Verlag London Ltd. part of Springer Nature.
An evaluation methodology for interactive reinforcement learning with simulated users
- Bignold, Adam, Cruz, Francisco, Dazeley, Richard, Vamplew, Peter, Foale, Cameron
- Authors: Bignold, Adam , Cruz, Francisco , Dazeley, Richard , Vamplew, Peter , Foale, Cameron
- Date: 2021
- Type: Text , Journal article
- Relation: Biomimetics Vol. 6, no. 1 (2021), p. 1-15
- Full Text:
- Reviewed:
- Description: Interactive reinforcement learning methods utilise an external information source to evaluate decisions and accelerate learning. Previous work has shown that human advice could significantly improve learning agents’ performance. When evaluating reinforcement learning algorithms, it is common to repeat experiments as parameters are altered or to gain a sufficient sample size. In this regard, to require human interaction every time an experiment is restarted is undesirable, particularly when the expense in doing so can be considerable. Additionally, reusing the same people for the experiment introduces bias, as they will learn the behaviour of the agent and the dynamics of the environment. This paper presents a methodology for evaluating interactive reinforcement learning agents by employing simulated users. Simulated users allow human knowledge, bias, and interaction to be simulated. The use of simulated users allows the development and testing of reinforcement learning agents, and can provide indicative results of agent performance under defined human constraints. While simulated users are no replacement for actual humans, they do offer an affordable and fast alternative for evaluative assisted agents. We introduce a method for performing a preliminary evaluation utilising simulated users to show how performance changes depending on the type of user assisting the agent. Moreover, we describe how human interaction may be simulated, and present an experiment illustrating the applicability of simulating users in evaluating agent performance when assisted by different types of trainers. Experimental results show that the use of this methodology allows for greater insight into the performance of interactive reinforcement learning agents when advised by different users. The use of simulated users with varying characteristics allows for evaluation of the impact of those characteristics on the behaviour of the learning agent. © 2021 by the authors. Licensee MDPI, Basel, Switzerland.
- Authors: Bignold, Adam , Cruz, Francisco , Dazeley, Richard , Vamplew, Peter , Foale, Cameron
- Date: 2021
- Type: Text , Journal article
- Relation: Biomimetics Vol. 6, no. 1 (2021), p. 1-15
- Full Text:
- Reviewed:
- Description: Interactive reinforcement learning methods utilise an external information source to evaluate decisions and accelerate learning. Previous work has shown that human advice could significantly improve learning agents’ performance. When evaluating reinforcement learning algorithms, it is common to repeat experiments as parameters are altered or to gain a sufficient sample size. In this regard, to require human interaction every time an experiment is restarted is undesirable, particularly when the expense in doing so can be considerable. Additionally, reusing the same people for the experiment introduces bias, as they will learn the behaviour of the agent and the dynamics of the environment. This paper presents a methodology for evaluating interactive reinforcement learning agents by employing simulated users. Simulated users allow human knowledge, bias, and interaction to be simulated. The use of simulated users allows the development and testing of reinforcement learning agents, and can provide indicative results of agent performance under defined human constraints. While simulated users are no replacement for actual humans, they do offer an affordable and fast alternative for evaluative assisted agents. We introduce a method for performing a preliminary evaluation utilising simulated users to show how performance changes depending on the type of user assisting the agent. Moreover, we describe how human interaction may be simulated, and present an experiment illustrating the applicability of simulating users in evaluating agent performance when assisted by different types of trainers. Experimental results show that the use of this methodology allows for greater insight into the performance of interactive reinforcement learning agents when advised by different users. The use of simulated users with varying characteristics allows for evaluation of the impact of those characteristics on the behaviour of the learning agent. © 2021 by the authors. Licensee MDPI, Basel, Switzerland.