端到端-202411-DiffusionDrive01:用于端到端自动驾驶的截断扩散模型
Abstract 摘要Recently, the diffusion model has emerged as a powerful generative technique for robotic policy learning, capable of modeling multi-mode action distributions. Leveraging its capability for end-to-end autonomous driving is a promising direction. However, the numerous denoising steps in the robotic diffusion policy and the more dynamic, open-world nature of traffic scenes pose substantial challenges for generating diverse driving actions at a real-time speed.To address these challenges, we propose a novel truncated diffusion policy that incorporates prior multi-mode anchors and truncates the diffusion schedule, enabling the model to learn denoising fromanchored Gaussian distributionto themulti-mode driving action distribution.Additionally, we design an efficient cascade diffusion decoder for enhanced interaction with conditional scene context. The proposed model, DiffusionDrive, demonstrates 10× reduction in denoising steps compared to vanilla diffusion policy, delivering superior diversity and quality in just 2 steps. On the planning-oriented NAVSIM dataset, with aligned ResNet-34 backbone, DiffusionDrive achieves 88.1 PDMS without bells and whistles, setting a new record, while running at a real-time speed of 45 FPS on an NVIDIA 4090. Qualitative results on challenging scenarios further confirm that DiffusionDrive can robustly generate diverse plausible driving actions.近年来,difusion model已成为一种强大的机器人策略学习生成技术,能够对多模态动作分布进行建模。利用其能力实现端到端自动驾驶是一个很有前景的方向。然而,机器人difusion policy 中大量的去噪步骤,以及交通场景更具动态性、开放世界的特性,为以实时速度生成多样化驾驶动作带来了重大挑战。为应对这些挑战,我们提出了一种新颖的截断 diffusion policy,该策略引入先验多模态锚点并截断 diffusions chedule,使模型能够学习从锚定的 Gaussian distribution到多模态驾驶动作分布的去噪过程。此外,我们设计了一种高效的级联 diffusion decoder,以增强与条件场景上下文的交互。所提出的模型 DiffsionDrive相比 vanilla diffusion policy 将去噪步骤减少了 10×,仅需2步即可实现更优的多样性和质量。在面向规划的 NAVSIM 数据集上,在对齐的 ResNet34 backbone下,DiffusionDrive 无需额外技巧即可达到88.1PDMS,创下新纪录,同时在NVIDIA 4090上以45 FPS 的实时速度运行。在具有挑战性的场景上的定性结果进一步证实,DiffusionDrive 能够稳健地生成多样且合理的驾驶动作。Introduction1.引言End-to-end autonomous driving has gained significant attention in recent years due to advancements in perception models (detection [4, 17, 24, 42], tracking [54–56], online mapping [27, 28, 32], etc.), which directly learns the driving policy from the raw sensor inputs.近年来,端到端自动驾驶因感知模型(检测[4,17,24,42]、跟踪[54-56]、在线建图[27,28,32]等)的进步而受到广泛关注,它直接从原始传感器输入中学习驾驶策略。This>传统基于规则的运动规划提供了一种可扩展且稳健的替代方案,而后者往往难以泛化到复杂的真实世界驾驶环境Figure1:Figure1.Thecomparisonof diffrentend-to-end paradigms.(a)Single moderegresson[7,16,2o].(b)Samplin from vocabulary[3, 25]. © Vanilla diffusion policy [6,19]. (d) The proposed truncated diffusion policy.图1:不同端到端范式的比较。(a)单模态回归[7,16,20](b)从 Vocabulary中采样[3,25]©原始 difusion policy[6,19](d)所提出的 truncated diffusion policyTo effectively learn from data, mainstream end-to-end planners (e.g., Transfuser [7], UniAD [16], VAD [20]) typically regress a single-mode trajectory from an ego-query as shown in Fig. 1a. However, this paradigm does not account for the inherent uncertainty and multi-mode nature of driving behaviors. Recently, VADv2 [20] introduces a large fixed vocabulary of anchor trajectories (4096 anchors) to discretize the continuous action space and capture a broader range of driving behaviors, and then samples from these anchors based on predicted scores as shown in Fig. 1b. However, this large fixed-vocabulary paradigm is fundamentally constrained by the number and quality of anchor trajectories, often failing in out-of-vocabulary scenarios. Furthermore, managing a large number of anchors presents significant computational challenges for real-time applications. Rather than discretizing the action space, diffusion model [6] has proven to be a powerful generative decision-making policy in the robotics domain, which can directly sample multi-mode physically plausible actions from a Gaussian distribution via iterative denoising process为了有效地从数据中学习,主流的端到端规划器(例如,Transfuser[7]、UniAD[16]、VAD[20])通常如图1a所示,从 一个ego-query回归出单模态轨迹。然而,这一范式并未考虑驾驶行为固有的不确定性和多模态特性。最近,VADv2[20] 引入了一个大型固定Vocabulary的锚点轨迹集合(4096个anchors),以离散化连续动作空间并捕获更广泛的驾驶行 为,然后如图1b所示,根据预测得分从这些anchors中进行采样。然而,这种大型固定Vocabulary范式从根本上受限于 锚点轨迹的数量和质量,在Vocabulary外场景中常常失效。此外,管理大量anchors 也给实时应用带来了显著的计算挑 战。与其对动作空间进行离散化,diffusion model [6]已被证明是机器人领域中一种强大的生成式决策策略,它可以通过迭代去噪过程,直接从高斯分布中采样多模态且物理上合理的动作。This inspires us to replicate the success of the diffusion model in the robotics domain to end-to-end autonomous driving. We apply the vanilla robotic diffusion policy to the well-known single-mode-regression method, Transfuser [7], by proposing a variant, Transfuser_DP, which replaces the deterministic MLP regression head with a conditional diffusion model [34]. Though Transfuser_DP improves planning performance, two major issues arise: 1) The numerous 20 denoising steps in the vanilla DDIM diffusion policy introduce heavy computational consumption during inference as shown in Tab. 2, hindering the real-time application for autonomous driving. 2) The trajectories sampled from different Gaussian noises severely overlap with each other, as illustrated in Fig. 2. This underscores the non-trivial challenge of taming the diffusion models for the dynamic and open-world traffic scenes.这启发我们将扩散模型在机器人领域取得的成功复制到端到端自动驾驶中。我们通过提出一个变体 TransfuserDP,将经典的机器人扩散策略应用于著名的单模态回归方法Transfuser[7],该变体用条件扩散模型 [34]替代了确定性的MLP 回归头。尽管Transfuser DP 提升了规划性能,但也带来了两个主要问题:1)原始 DDIM扩散策略中多达 20个去噪步骤在Inference 期间引入了巨大的计算开销,如表2所示,这阻碍了其在自动驾驶中的实时应用。2)从不同Gaussian噪声中采样得到的轨迹彼此严重重叠,如图2所示。这凸显了在动态且开放世界的交通场景中驾驭扩散模型这一挑战并非易事。Unlike the vanilla diffusion policy, which samples actions from a random Gaussian noise conditioned on scene context, human drivers adhere to established driving patterns that they dynamically adjust in response to real-time traffic conditions. This insight motivates us to embed these prior driving patterns into the diffusion policy by partitioning the Gaussian distribution into multiple sub-Gaussian distributions centered around prior anchors, referred to as anchored Gaussian distribution. It is implemented by truncating the diffusion schedule to introduce a small portion of Gaussian noise around the prior anchors as shown in Fig. 3. Thanks to the multi-mode distributional expressivity of the diffusion model, the proposed truncated diffusion policy effectively covers the potential action space without requiring a large set of fixed anchors, as VADv2 does. With more reasonable initial noise samples from the anchored Gaussian distribution, we can truncate the denoising process, reducing the required steps from 20 to just 2—a substantial speedup that satisfies the real-time requirements of autonomous driving.与从以场景上下文为条件的随机 Gaussian noise中采样动作的原始 diffusion policy不同,人类驾驶员会遵循既有的驾驶模式,并根据实时交通状况进行动态调整。这一观察促使我们将这些先验驾驶模式嵌入到diffusion policy中,方法是将Gaussian distribution 划分为多个以先验锚点为中心的子 Gaussian distribution,称为 anchored Gaussiandistribution。如图 3所示,该方法通过截断 diffusion schedule,在先验锚点周围引入少量 Gaussian noise来实现。得益于diffusion model的多模态分布表达能力,所提出的 truncated difusionpolicy 能够有效覆盖潜在动作空间,而无需像VADv2 那样使用大量固定锚点。由于anchored Gaussian distribution提供了更合理的初始噪声样本,我们可以截断denoising process,将所需步数从20步减少到仅2步——实现了显著加速,从而满足autonomous driving的实时性要求。To enhance the interactionwith conditionalscene context,we proposeaneficient transformer-based difusiondecoder thatinteracts not only with structured queries from the perception modulebut also with Bird’s Eye View (BEV)and perspectiveview(PV)features through asparse deformable attention mechanism[62].Additionally,weintroduceacascade mechanism to iteratively refine the trajectory reconstruction within the diffsion decoder at each denoising step.为了增强与条件场景上下文的交互,我们提出了一种高效的基于Transformer的扩散解码器,它不仅与来自感知模块的结构化查询进行交互,还通过稀疏可变形注意力机制[62]与 Bird’s Eye View (BEV)和 perspective view(PV)特征进行交互。此外,我们引入了一种级联机制,在每个去噪步骤中于扩散解码器内迭代地细化轨迹重建。Withthese innovations,we present DiffusionDrive,a diffusion model forreal-timeend-to-end autonomous driving.We benchmark our method on the planningoriented NAVsIMdataset[1o]using non-reactive simulation and closed-loop evaluations.Without bells and whis-基于这些创新,我们提出了DiffusionDrive,这是一种用于实时端到端自动驾驶的扩散模型。我们在面向规划的NAVSIM数据集[10]上使用非反应式仿真和闭环评估对我们的方法进行了基准测试。在没有花哨技巧和额外组件的情况下tles,DiffusionDrive achieves 88.1PDMSon NAVsIM navtest split with the aligned ResNet-34 backbone,significantly outperforming previous state-of-the-artmethods.Evencompared tothe NAVSIMchallnge-winning solution Hydra-MDP-V8192-W-EP25l,which folows VADvwith8192 anchortrajectories and furtherincorporates postprocessing and additional supervision,DiffusionDrive stilloutperforms itby1.6PDMS through directlylearning from human demonstrationsand infering without postprocesing,whilerunningatreal-timespeedof 45FPSonan NVIDIA 409o.We further validate the superiority of DiffusionDrive on popular nuScenes dataset[2] with openloop evaluations, DiffusionDrive runs1.8×1 . 8 \times1.8×faster than VAD and outperforms it [2o] by20.8%2 0 . 8 \%20.8%lower L2 error and63.6%6 3 . 6 \%63.6%lower collision rate with the same ResNet-5o backbone, demonstrating state-of-the-art planning performance.总体而言,DiffusionDrive 在 NAVSIM navtest split 上使用对齐的 ResNet-34 backbone 取得了 88.1的 PDMS,显著优于先前的state-of-the-art方法。即使与NAVSIM challenge的获胜方案 Hydra-MDP-V8192-W-EP [25]相比——该方案遵循VADv2,使用8192条anchor trajectories,并进一步结合了postprocessing和额外监督——DiffusionDrive仍然通过直接从人类示范中学习并在无postprocessing的情况下进行推理,以1.6的PDMS优势超过它,同时在NVIDIA4090 上以45 FPS 的实时速度运行。我们还在广泛使用的 nuScenes dataset [2]上通过 openloop evaluation 进一步验证了DiffusionDrive的优越性。DiffusionDrive的运行速度比VAD 快1.8×1 . 8 \times1.8×,并且在相同的 ResNet-50 backbone下,以20.8%2 0 . 8 \%20.8%更低的L2error和63.6%6 3 . 6 \%63.6%更低的collsion rate 超越了它[20],展示了 state-of-the-art的规划性能。Our contributions can be summarized as follows:我们的贡献可以总结如下:Wefirstly introduce the diffusion model tothe fieldof end-to-endautonomous drivingand proposeanovel truncated diffusion policytoaddress the isses ofmode collapse and heavy computational overhead found indirect adaptationofvanilla diffusion policy to the traffic scene.我们首次将 diffusion model引入端到端自动驾驶领域,并提出了一种新颖的 truncated difusion policy,以解决将vanilla diffusionpolicy直接适配到交通场景时出现的modecollapse和沉重计算开销问题。Wedesignanefficient transformer-baseddifusion decoderthatinteracts withtheconditionalinformationinacascadec manner for better trajectory reconstruction.我们设计了一种高效的基于Transformer的difusion decoder,以级联方式与条件信息交互,从而实现更好的轨迹重建。Withoutbelsandwhistles,DifusionDrivesignificantlyoutperforms previousstate-of-the-artmethods,achievingarecord breaking 88.1 PDMS on the NAVSIM navtest split with the same backbone,while maintaining real-time performance at 45 FPS on an NVIDIA 4090.无需额外技巧,DiffusionDrive 显著优于先前的 state-of-the-art方法,在使用相同 backbone 的情况下,在 NAVSIMnavtest split上取得了创纪录的88.1 PDMS,同时在 NVIDIA 4090 上保持了45 FPS 的实时性能。We qualitativelydemonstrate thatDifusionDrivecangenerate morediverseand plausible trajectories,exhibiting highquality multi-mode driving actions in various challenging scenarios.我们从定性角度表明,DiffusionDrive 能够生成更加多样且合理的轨迹,在各种具有挑战性的场景中展现出高质量的多模态驾驶动作。2.Related Work2.相关工作End-to-endautonomous driving.UniAD[16],asapioneering work,demonstrates thepotentialofend-to-endautonomous driving by integrating multiple perception tasks to enhance planning performance.VAD[2o]further explores the use of compactvectodeeretaiotroeieybsetlyisks[57,4 haveadoptedthesingle-trajectoryplanning paradigmtoenhanceplaningperformance further.MorerecentlyVADv2[3] shifts the paradigm towards multi-mode planning by scoring and sampling fromalarge fixed vocabulary of anchor trajectories.HydraMDP[25limproves the scoring mechanism of VADv2 by introducing extra supervision froma rulebased scorer.SparseDrive[39]exploresan alternativeBEV-fre solution.Unlikeexisting multi-mode planning approaches, we propose a novel paradigm that leverages powerful generative端到端自动驾驶。作为一项开创性工作,UniAD[16]通过集成多种感知任务来提升规划性能,展示了端到端自动驾驶的潜力。VAD[20]进一步探索了使用紧凑的矢量化场景表示来提高效率。随后,一系列工作[5,7,12,23,26,43,45,58]采用了单轨迹规划范式,以进一步提升规划性能。最近,VADv2[3]通过从一个大型固定锚点轨迹 Vocabulary中进行评分和采样,将范式转向多模态规划。HydraMDP [25]通过引入来自基于规则的评分器的额外监督,改进了VADv2的评分机制。SparseDrive [39]探索了一种无需 BEV的替代方案。不同于现有的多模态规划方法,我们提出了一种利用强大生成式Figure 2: Figure 2. Qualitative comparison of Transfuser, TransfuserDP{ \cal D } { \cal P }DPand DiffusionDrive on challenging scenes of NAVSIM navtest split.With the same inputsfrom front camerasand LiDAR,DifusionDrive achieves the highest planning qualityof top-1scoring trajectoryasillustrated inTab.2.Werenderthehighlighted diverse trajectories predictedby DiffusionDrivein the frontview.(a)and (b)shows thatthe top-1scoring trajectoryof DifusionDriveclosely matches the groundtruth forboth going straightandturningleft.Additionaly,DiffusionDrive’s top-10scoring trajectorydemonstrates high-quality lane changing-an ability not observed in multi-mode TransfuserDP{ \cal D } { \cal P }DPand impossible for Transfuser.图2:图2。Transfuser、Transfuser DP 和 DifusionDrive 在 NAVSIM navtest split挑战性场景上的定性比较。在来自前视摄像头和 LiDAR的相同输入下,DiffusionDrive 实现了最高的top-1评分轨迹规划质量,如表2所示。我们在前视图中渲染了DiffusionDrive 预测的高亮多样化轨迹。(a)和(b)表明,无论是直行还是左转,DifusionDrive的top-1评分轨迹都与ground truth 高度匹配。此外,DiffusionDrive的top-10评分轨迹还展示了高质量的变道能力——这是多模态TransfuserDP 中未观察到的能力,也是Transfuser无法实现的。diffusion models for end-to-end autonomous driving.用于端到端自动驾驶的扩散模型。Difusion modelfortraffcsimulation.Drivingdiffusionpolicyhas been exploredinthe trafic simulationbyleveragingonly abstractperceptiongroundtruth[8,18,21,44].MotionDifuser[21]andCTG[6o]arepioneering applicationsofdiffusion models for multi-agent motion prediction,usingaconditionaldifusion model tosample targettrajectories from Gaussian noise.CTG++[59]\mathrm { C T G } + + \left[ 5 9 \right]CTG++[59]further incorporates a large language model (LLM) for language-driven guidance, improving usability andenablingrealistictraficsimulations.Difusion-ES48]replacesreward-gradientguided denoisingwith evolutionary search.Moving beyond difusion models limited to trafic simulation with percep-用于交通仿真的扩散模型。通过仅利用抽象感知真值,驾驶扩散策略已在交通仿真中得到探索[8,18,21,44]。MotionDiffuser [21]和CTG[60]是将扩散模型应用于多智能体运动预测的开创性工作,它们使用条件扩散模型从高斯噪声中采样目标轨迹。CTG+Δ\mathsf { C T G } + \mathrm { \Delta }CTG+Δ[59]进一步引入了Large LanguageModel(LLM)进行语言驱动的引导,从而提升了易用性并实现了逼真的交通仿真。Diffsion-ES [48]用进化搜索替代了奖励梯度引导的去噪。突破仅限于使用感知真值进行交通仿真的扩散模型的局限-ion groundtruth,our approach unlocks the potentialof difusion models forreal-time,end-to-end autonomous driving :hrough our proposed truncated diffusion policy and efficient diffusion decoder.我们的方案不再局限于感知真值,而是通过我们提出的截断扩散策略和高效扩散解码器,释放了扩散模型在实时、端到端自边驾驶中的潜力。Difusion model forroboticpolicylearning. Difusion policy[6]demonstrates the great potentialin robotic policylearning, effectivelycapturing multi-mode action distributionsand high-dimensionalactionspaces.Difuser[19]proposes an unconditionaldiffusionmodelfor trajectorysampling,incorporating techniquessuchasclassfer-freeguidanceandimage inpainting toachieve guided sampling.Subsequently,numerous works have applieddifusion models to various robotic tasks, including stationary manipula-用于机器人策略学习的扩散模型。扩散策略 [6]展示了其在机器人策略学习中的巨大潜力,能够有效捕捉多模态动作分布和高维动作空间。Diffuser[19]提出了一种用于轨迹采样的无条件扩散模型,并结合clasifier-free guidance 和图像修复等技术来实现引导采样。随后,许多工作将扩散模型应用于各种机器人任务,包括静态操作一Figure3: Figure 3.Ilustrationof truncated difusion policyby comparing with vanill difusion policy.We truncate the difusion processandonlyaddasmallportionofGaussannoise todiffuse theanchor trajectories.Then,we train the diffusionmodeltoreconstructthe ground-truth trajectoryfromtheanchored Gaussian distribution withconditional scene context.Duringtheinference,we also truncatethedenoising processbystarting fromthebetersamples intheanchored Gaussian distribution than the pure Gaussian noise.图3:通过与vanilla diffusion policy对比,对截断扩散策略的示意。我们截断扩散过程,仅添加少量高斯噪声来扩散锚定轨迹。然后,我们训练扩散模型,在条件场景上下文下,从锚定的高斯分布中重建真实轨迹。在Inference 过程中,我们同样截断去噪过程,从锚定高斯分布中优于纯高斯噪声的更好样本开始。tion[1,53],mobile manipulation[47],autonomous navigation[37,51],quadrupedlocomotion38],anddexterous manipulation[46].However,directlyapplying vanila diffusion policytoend-to-endautonomous driving poses unique challenges,asitrequiresreal-timeeficiencyandthegenerationofplausiblemulti-modetrajectories indynamicandopenworld trafic scenes.Inthis work,we proposeanovel truncated diffsionpolicy toaddressthesechallenges,introducingconcepts that have not yet been explored in the robotics field.任务[1,53]、移动操作[47]、自主导航 [37,51]、四足运动[38]以及灵巧操作[46]。然而,将原始 diffusion policy 直接应用于端到端自动驾驶会带来独特挑战,因为它要求实时效率,并且能够在动态、开放世界的交通场景中生成合理的多模态轨迹。在这项工作中,我们提出了一种新颖的 truncated difusion policy来应对这些挑战,并引入了机器人领域尚未探索过的概念。Difusion modelforimage generation. Diffusion models havebeen extensivelyadoptedforimage generation tasks [33,36, 49,50,61].DDIM35]enhancesDDPM[14]byenabling efficientsampling withsignificantlyfewerstepsbasedonnon-Markovian difusionproceses.Flow matching[3o,31]further optimizes the generative process bydirectly modeling continuous probabilityflows.TDPM[57]proposes truncated denoising,which initiates the generationprocessfroman implicit intermediate distribution to accelerate sampling.Incontrast to these approaches,our method introduces an explicitdrivingpriorwithinthediffusionpolicy,effectivelyguidingthedifusionprocesstowardmoreacurateandefcient generation tailored specifically用于图像生成的扩散模型。扩散模型已被广泛应用于图像生成任务[33,36,49,50,61]。DDIM[35]基于非马尔可夫扩散过程,通过显著减少采样步数来提升 DDPM[14]的采样效率。FloWmatching [30,31]则通过直接建模连续概率流,进一步优化了生成过程。TDPM[57]提出了截断去噪方法,通过从一个隐式的中间分布启动生成过程来加速采样。与这些方法不同,我们的方法在扩散策略中引入了显式的驾驶先验,有效引导扩散过程朝向更准确、更高效的生成,专门适用于for end-to-end autonomous driving.端到端自动驾驶。3.Method3.方法3.1. Preliminary3.1.预备知识Task formulation.End-to-end autonomous driving takes raw sensordata asinputand predicts the future trajectoryof the ego-vehicle.Thetrajectoysepresetedasasequeneofatsτ={ (xt,yt)}t=1Tf\tau = \{ ( x _ { t } , y _ { t } ) \} _ { t = 1 } ^ { T _ { f } }τ={(xt,yt)}t=1Tf,whereTfT _ { f }Tfdenotes the planning horizon, and(xt,yt)( x _ { t } , y _ { t } )(xt,yt)is the location of each waypoint at timetttin the current ego-vehicle coordinate system.任务表述。端到端自动驾驶以原始传感器数据作为输入,并预测自车未来的轨迹。该轨迹表示为一系列航路点T={(xt,yt)}t=1,其中Tf表示规划时域,(αt,yt)是当前自车坐标系下时刻t每个航路点的位置。Conditional diffsionmodel.Theconditional diffusionmodelposes aforward diffusionprocess as graduallyadding noiset the data sample, which can be defined as:条件扩散模型。条件扩散模型将前向扩散过程表述为向数据样本逐步添加噪声,其可定义为:q(τi∣τ0)=N(τi;αˉiτ0,(1−αˉi)I), \boldsymbol { q } \left( \tau ^ { i } \mid \tau ^ { 0 } \right) = \mathcal { N } \left( \tau ^ { i } ; \sqrt { \bar { \alpha } ^ { i } } \tau ^ { 0 } , \left( 1 - \bar { \alpha } ^ { i } \right) \mathbf { I } \right) ,q(τi∣τ0)=N(τi;αˉiτ0,(1−αˉi)I),whereτ0\tau ^ { 0 }τ0is the clean data sample, andτi\tau ^ { i }τiis the data sample with noise at timei_ ii(Note: we use superscripti_ iito denotedifusiontimestep)eosanαˉi=∏s=1iαs=∏s=1i(1−βs)ˉandβs\begin{array} { r } { \bar { \alpha } ^ { i } = \prod _ { s = 1 } ^ { i } \alpha ^ { s } = \prod _ { s = 1 } ^ { i } \bar { ( 1 - \beta ^ { s } ) } \operatorname { a n d } \beta ^ { s } } \end{array}αˉi=∏s=1iαs=∏s=1i(1−βs)ˉandβsisthenoiseschedule.Wetratprocess modelfθ(τi,z,i)f _ { \boldsymbol { \theta } } \left( \tau ^ { i } , z , i \right)fθ(τi,z,i)to predictτ0\tau ^ { 0 }τ0fromτi\tau ^ { i }τ