Li, Mao (2019) Reinforcement learning from internal, partially correct, and multiple demonstrations. PhD thesis, University of York.
Abstract
Typically, a reinforcement learning agent interacts with the environment and learns how to select an action to gain cumulative reward in one trajectory of a task. However, classic reinforcement learning emphasises knowledge free learning processes. The agent only learns from state-action-reward-next state samples. The learning process has the problem of sample inefficiency and needs a huge number of interactions to converge upon an optimal policy. One of the solutions to deal with this challenge is to employ human behaviour records in the same task as demonstrations for the agent to speed up the learning process.
Demonstrations are not, however, from the optimal policy and may be in conflict in many states especially when demonstrations come from multiple resources. Meanwhile, the agent's behaviour in the learning process can be used as demonstration data. To address the research gaps mentioned above, three novel techniques, including; introspective reinforcement learning, two-level Q-learning, and the radius restrained weighted vote, are proposed in this thesis. Introspective reinforcement learning uses a priority queue as a filter to select qualified agent behaviours during the learning process as demonstrations. It applies reward shaping to give the agent an extra reward when it performs similar behaviours as demonstrations in the filter. The two-level-Q-learning deals with the issue of conflicting demonstrations. Two Q-tables (or Q-net in function approximation) for storing state-expert value and state-action value are proposed respectively. The two-level-Q-learning allows the agent not only to learn a strategy from selected actions but also to learn to distribute credits to experts through trial and error. The Radius restrained weighted vote can derive a guidance policy from demonstrations which satisfy a restriction through a hyper-parameter radius. The Radius restrained weighted vote applied the Gaussian distances between the current state and demonstrations as weights of the votes. Softmax was applied to the total number of weighted votes from all candidate demonstrations to derive the guidance policy.
Metadata
Supervisors: | Kudenko, Daniel |
---|---|
Awarding institution: | University of York |
Academic Units: | The University of York > Computer Science (York) |
Identification Number/EthosID: | uk.bl.ethos.811380 |
Depositing User: | Dr Mao Li |
Date Deposited: | 31 Jul 2020 20:33 |
Last Modified: | 21 Aug 2020 09:53 |
Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:25660 |
Download
Examined Thesis (PDF)
Filename: PhD_Thesis_final_submission.pdf
Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.