Policy Learning Using Weak Supervision

In Advances in Neural Information Processing Systems (NeurIPS), 2021
Weak supervision signals are everywhere! We provide a unified formulation of the weakly supervised policy learning problems. We also propose PeerPL, a new way to perform policy evaluation under weak supervision.


News



Abstract

Most existing policy learning solutions require the learning agents to receive high-quality supervision signals such as well-designed rewards in reinforcement learning (RL) or high-quality expert demonstrations in behavioral cloning (BC). These quality supervisions are usually infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the available cheap weak supervisions to perform policy learning efficiently. To handle this problem, we treat the "weak supervision" as imperfect information coming from a peer agent, and evaluate the learning agent's policy based on a "correlated agreement" with the peer agent's policy (instead of simple agreements). Our approach explicitly punishes a policy for overfitting to the weak supervision. In addition to theoretical guarantees, extensive evaluations on tasks including RL with noisy rewards, BC with weak demonstrations, and standard policy co-training show that our method leads to substantial performance improvements, especially when the complexity or the noise of the learning environments is high.

BibTeX
@inproceedings{wang2021policy,
title     = {Policy Learning Using Weak Supervision},
author    = {Jingkang Wang and Hongyi Guo and Zhaowei Zhu and Yang Liu},
booktitle = {Thirty-Fifth Conference on Neural Information Processing Systems},
year      = {2021},
url       = {https://openreview.net/forum?id=UZgQhsTYe3R}
}
Text citation

Jingkang Wang, Hongyi Guo, Zhaowei Zhu and Yang Liu. Policy Learning Using Weak Supervision. In Thirty-Fifth Conference on Neural Information Processing Systems (NeurIPS), 2021.


PeerPL with Correlated Agreement
Results
PeerPL can also be plugged in DAgger!

Citation
@inproceedings{wang2021policy,
title     = {Policy Learning Using Weak Supervision},
author    = {Jingkang Wang and Hongyi Guo and Zhaowei Zhu and Yang Liu},
booktitle = {Thirty-Fifth Conference on Neural Information Processing Systems},
year      = {2021},
url       = {https://openreview.net/forum?id=UZgQhsTYe3R}
}

Past Works
@inproceedings{liu2020peer,
title     = {Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates},
author    = {Yang Liu and Hongyi Guo},
booktitle = {Thirty-Seventh International Conference on Machine Learning},
year      = {2020},
}
@inproceedings{wang2020reinforcement,
title     = {Reinforcement Learning with Perturbed Rewards},
author    = {Jingkang Wang and Yang Liu and Bo Li},
booktitle = {Thirty-Fourth AAAI Conference on Artificial Intelligence},
year      = {2020},
}