AirSim无人机仿真入门:手把手教你用Python实现Q-learning与Sarsa算法(附完整代码)
AirSim无人机仿真入门手把手教你用Python实现Q-learning与Sarsa算法在无人机自主导航领域强化学习正逐渐成为关键技术之一。想象一下当你第一次看到无人机通过自我学习完成复杂任务时那种震撼感难以言表。本文将带你从零开始在AirSim仿真环境中实现Q-learning和Sarsa算法让无人机学会自主移动。1. 环境准备与基础配置1.1 AirSim环境搭建首先需要安装AirSim仿真平台推荐使用Windows系统搭配Unreal Engine 4运行环境。以下是具体安装步骤从Epic Games Launcher安装Unreal Engine 4.27克隆AirSim仓库git clone https://github.com/microsoft/AirSim.git编译AirSim插件并放入Unreal项目Plugins目录配置完成后运行Blocks示例场景你应该能看到如下默认控制台命令# 启动无人机仿真 start -ResX1920 -ResY1080 -windowed常见问题排查如果遇到Vulkan not supported错误尝试添加-opengl3参数确保系统已安装最新版显卡驱动Python API需要安装airsim包pip install airsim1.2 Python开发环境配置建议使用conda创建独立环境conda create -n airsim-rl python3.8 conda activate airsim-rl pip install numpy pandas pyyaml msgpack-rpc-python项目目录结构建议如下/airsim-drone-rl │── /configs │ └── drone_config.yaml │── /src │ ├── environment.py │ ├── q_learning.py │ └── sarsa.py └── main.py2. 强化学习基础实现2.1 Q-learning算法核心Q-learning作为经典的off-policy算法其更新规则为Q(s,a) ← Q(s,a) α[r γ max Q(s,a) - Q(s,a)]Python实现关键代码class QLearningAgent: def __init__(self, actions, lr0.1, gamma0.9, epsilon0.1): self.q_table defaultdict(lambda: np.zeros(len(actions))) self.actions actions self.lr lr self.gamma gamma self.epsilon epsilon def choose_action(self, state): if np.random.uniform() self.epsilon: return np.random.choice(self.actions) return np.argmax(self.q_table[state]) def learn(self, state, action, reward, next_state): current_q self.q_table[state][action] max_next_q np.max(self.q_table[next_state]) self.q_table[state][action] self.lr * ( reward self.gamma * max_next_q - current_q )注意ε-greedy策略中的探索率ε应随训练逐步衰减典型做法是εε_max-(ε_max-ε_min)*episode/max_episodes2.2 Sarsa算法实现差异与Q-learning不同Sarsa是on-policy算法其更新规则为Q(s,a) ← Q(s,a) α[r γ Q(s,a) - Q(s,a)]代码实现主要区别在learn方法class SarsaAgent(QLearningAgent): def learn(self, state, action, reward, next_state, next_action): current_q self.q_table[state][action] next_q self.q_table[next_state][next_action] self.q_table[state][action] self.lr * ( reward self.gamma * next_q - current_q )两种算法性能对比特性Q-learningSarsa策略类型off-policyon-policy探索性更强更保守收敛速度较快较慢稳定性较低较高3. AirSim无人机环境封装3.1 无人机控制接口设计创建DroneEnv类封装AirSim基础操作class DroneEnv: def __init__(self, config_file): with open(config_file) as f: self.config yaml.safe_load(f) self.client airsim.MultirotorClient() self.client.confirmConnection() self.client.enableApiControl(True) self.client.armDisarm(True) self.actions [move_forward, move_backward, hover] self.state_space [...] # 定义状态空间 def reset(self): self.client.reset() self.client.takeoffAsync().join() return self._get_state() def step(self, action): if action 0: # 前进 self.client.moveByVelocityAsync(2, 0, 0, 1).join() elif action 1: # 后退 self.client.moveByVelocityAsync(-2, 0, 0, 1).join() else: # 悬停 self.client.hoverAsync().join() next_state self._get_state() reward self._calculate_reward(next_state) done self._check_done(next_state) return next_state, reward, done3.2 状态设计与奖励函数无人机状态可简化为位置坐标def _get_state(self): kinematics self.client.simGetGroundTruthKinematics() pos kinematics.position return (round(pos.x_val, 1), round(pos.y_val, 1))奖励函数设计示例def _calculate_reward(self, state): target (10, 0) # 目标位置 distance np.sqrt((state[0]-target[0])**2 (state[1]-target[1])**2) if distance 1.0: # 到达目标 return 100.0 elif state[0] -5: # 超出边界 return -100.0 else: # 距离奖励 return -distance4. 完整训练流程实现4.1 主训练循环搭建结合环境和算法的完整训练代码def train_drone(): env DroneEnv(configs/drone_config.yaml) agent QLearningAgent(actionslist(range(len(env.actions)))) for episode in range(1000): state env.reset() total_reward 0 while True: action agent.choose_action(state) next_state, reward, done env.step(action) agent.learn(state, action, reward, next_state) state next_state total_reward reward if done: break print(fEpisode {episode}, Total Reward: {total_reward})4.2 训练可视化与调试添加训练过程监控# 在训练循环中添加 if episode % 50 0: plt.plot(rewards_history) plt.title(Training Progress) plt.xlabel(Episode) plt.ylabel(Total Reward) plt.savefig(training_progress.png)常见问题解决方案无人机不移动检查API控制是否启用训练不收敛调整奖励函数或学习率动作执行失败增加step之间的延迟5. 进阶优化方向5.1 状态空间扩展当前实现仅考虑了x轴位置可以扩展为def _get_state(self): kinematics self.client.simGetGroundTruthKinematics() pos kinematics.position vel kinematics.linear_velocity return ( round(pos.x_val, 1), round(pos.y_val, 1), round(vel.x_val, 1), round(vel.y_val, 1) )5.2 深度强化学习迁移将Q-table替换为神经网络class DQNAgent: def __init__(self, state_dim, action_dim): self.model self._build_model(state_dim, action_dim) def _build_model(self, state_dim, action_dim): model Sequential([ Dense(64, input_dimstate_dim, activationrelu), Dense(64, activationrelu), Dense(action_dim) ]) model.compile(lossmse, optimizerAdam(0.001)) return model5.3 多无人机协同训练修改环境配置支持多机# configs/drone_config.yaml drones: - name: Drone1 start_pos: [0, 0, 0] - name: Drone2 start_pos: [0, 5, 0]在实际项目中我发现将学习率设置为动态调整效果更好初期可以用较大学习率(0.1-0.3)加快收敛后期逐步降低到0.01以下提高稳定性。另一个实用技巧是在奖励函数中加入微小的时间惩罚(-0.1/step)可以显著减少无人机在目标点附近振荡的情况。