论文链接:https://ieeexplore.ieee.org/document/11062808
复杂博弈场景因其交互序列冗长、反馈信息稀少等特性,面临信用分配困难等挑战,当下智能博弈算法多采用分层强化学习应对,然而,现有分层博弈算法为避免“分层泛化灾难”问题(各层级决策呈现趋同行为特征,技能集合持续缩减),各层间智能体多独立优化,协同关系差,性能受限。为此,本文首创“局部联合优化”分层决策架构,以少维探索空间为代价,实现了分层架构下各层智能体的联合优化,提升算法博弈性能的同时,有效避免“分层泛化灾难”的出现;此外,提出时间-事件双驱动机制,以更好的平衡智能体响应速度与信用分配间的关系。实验表明,与4种SOTA算法对比,本文所提算法表现了更强的探索能力,且在交叉博弈中取得不低于71%的胜率,并在第七届全国兵棋推演大赛专项赛道“智能空中博弈算法挑战赛”中取得科目一智能算法第一名。
C. Qian, X. Zhang*, L. Li, Y. Wang, M. Zhao and Y. Fang, "A Partial Joint Optimization Algorithm for Autonomous Air Combat Based on Hierarchical Reinforcement Learning," in IEEE Transactions on Cybernetics, doi: 10.1109/TCYB.2025.3579745
Abstract
Designing intelligent game strategies for autonomous air combat has suffered from the vast exploration space, lengthy decision-making process, and sparse rewards. Some existing approaches adopt the hierarchical framework to improve the exploration efficiency. However, in these methods, agents in different layers are typically trained independently and operate at fixed frequencies, which limits their performance and hampers their ability to respond to highly dynamic combat situations. In view of this, we present PJOH-TED2, a partial-joint-optimization-based hierarchical (PJOH) learning framework with a time-event dual-driven (TED2) mechanism, for one-on-one beyond-visual-range (BVR) air combat. Specifically, the PJOH learning framework embeds the partial joint optimization mechanism into hierarchical reinforcement learning (HRL), thus improving the exploration efficiency dramatically while enhancing the integration across hierarchical levels. Moreover, the TED2 mechanism combines the advantages of event-driven and time-driven methods, which promote the dynamic response speed of agents as well as avoid redundant actions. In addition, we evaluated this work through a series of games against the state-of-the-art (SOTA) methods in a high-fidelity air combat simulation environment. The results empirically demonstrate that the proposed approach outperforms four SOTA methods with a win rate of at least 71%. Finally, this approach achieved the 1st place in learning methods in the intelligent air game algorithm challenge (IAGAC) by the Chinese Institute of Command and Control among 43 teams.