CANN EasyAsc DSL a2 Cube-Vec-Cube-Vec模式

张

张建站

2026/6/3 4:35:15

10分钟阅读

a2 Cube-to-Vec-to-Cube-to-Vec Pattern (Triple Bridge, Normalized Online Softmax)【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when writing an a2 (easyasc.a2, deviceb3) kernel with:one cube stage that produces a score tilevec logic that updates running row max and running row suma later cube stage that consumes the delayed probability tilea final vec stage that accumulates the delayed cube outputone final vec-only divide by the accumulated row sumTypical target formula:score_j q.float() k_j.float().t() * scalecurr_m maximum(prev_m, rowmax(score_j))expdiff_j exp(prev_m - curr_m)p_j exp(score_j - curr_m)row_sum row_sum * expdiff_j p_j.sum(-1)pv_j p_j.half().float() v_j.float()out out * expdiff_j pv_jout out / row_sumThis is the normalized counterpart toa2-cube-vec-cube-vec.md. Use that older pattern only when the kernel stops at the unnormalized numerator.One-page route for the common caseIf this file matches your contract, donotpreload all of:agent/references/constraints/reduction.mdagent/references/constraints/vec-reduction-a2.mdagent/references/constraints/vec-stride.mdagent/references/constraints/online-softmax-tail.mdThis page now owns the common normalized-online-softmax authoring rules. Open the smaller constraint pages only when a specific failure mode still remains unclear after this file.Why this needs its own a2 patternThe a2 hardware constraints are the same as the unnormalized case:cube - vec cannot usel0c_to_ubvec - cube cannot useub_to_l1_*delayed cube output must come back to vec for final accumulationBut normalized online softmax adds two stability-sensitive requirements:runningrow_summust be updated from the floatexp(...)tile before any cast to halfthe final divide must happen only once, after all delayed numerator tiles have been accumulatedSo the stable a2 flow is:GM(q,k,v) - L1 - L0 - L0C(score) - GM(score_ws) - UB(score)- vec(max, expdiff, exp, row_sum, cast p) - GM(p_ws) - L1 - L0 - L0C(pv)- GM(pv_ws) - UB(pv) - UB(accum) - final UB divide by row_sum - GM(out)Workspaces and ownership edgesUse the same three GM workspaces as the unnormalized pattern:score_wsdtype:floatshape:[GetCubeNum(), 2, TILE_M, TILE_N]purpose:L0C(score)-UB(score)p_wsdtype:halfshape:[GetCubeNum(), 2, TILE_M, TILE_N]purpose:UB(p_j.half())-L1(p_j)pv_wsdtype:floatshape:[GetCubeNum(), 2, TILE_M, D]purpose:L0C(pv_j)-UB(pv_j)Ownership edges:stage 1 cube - vec:CvMutex(0, src_end_pipePipe.FIX, dst_end_pipePipe.MTE2)stage 1 vec - stage 2 cube:VcMutex(1, src_end_pipePipe.MTE3, dst_end_pipePipe.FIX)stage 2 cube - stage 3 vec:CvMutex(2, src_end_pipePipe.FIX, dst_end_pipePipe.MTE2)Stable scheduleUse the same one-tile lookahead loop as the unnormalized pattern:for ni in range(0, tiles_n 1): if ni tiles_n: # stage 1: produce tile j ni if ni 0: # stage 2 stage 3: consume tile j ni - 1That gives:warmup: first iteration only producessteady state: producejwhile consumingj - 1drain: final iteration only consumes the last delayed tileSharedL0CruleReuse one physicalL0Cfamily across the two cube stages.This is the same capacity-driven choice as the unnormalized pattern:stage 1 needs float[TILE_M, TILE_N]stage 2 needs float[TILE_M, D]with validatedD 128a2 still has only128 KBL0CKeep one sharedl0c_cnt, but do not merge unrelated counters just becauseL0Cis shared.Counter layoutKeep these lifetimes separate:l1qk_cnt: stage-1q/kloadsl1pv_cnt: stage-2p/vloadsl0c_cnt: shared physicalL0Cfamily across the two cube stagesstage1_cnt: delayed slot rhythm forscore_ws,p_ws, andexpdiffstage2_cnt: delayed slot rhythm forp_wsconsumption andpv_wsRunningrow_sumdoes not need its own delayed counter. It stays vec-resident for the whole inner loop and updates immediately in stage 1.Vec-resident persistent stateKeep these values in per-subblock UB across the whole inner loop:running row max:[HALF_M, 1]running row sum:[HALF_M, 1]delayedexpdiffslots:DBuff(DT.float, [HALF_M, 1], Position.UB)final numerator accumulation:[HALF_M, D]UseGetSubBlockIdx()so each vec lane owns only its ownHALF_Mrows.Stable stage-1 update orderThe normalized online update order matters:computerowmax(score_j)in[HALF_M, 1]snapshotprev_minto the delayedexpdiffslot withadd(..., zero)updaterunning_max maximum(running_max, tile_max)turn the delayed slot intoexp(prev_m - curr_m)broadcastrunning_maxand subtract from the score tilecompute the float probability tilep_j exp(score_j - curr_m)reducesum_jfrom that float tile withaddcaddupdaterunning_sum running_sum * expdiff_j sum_jin[HALF_M, 1]castp_jtohalfonly now, because stage 2 wants the exactp_j.half().float()contractDo not move the row-sum update after the cast. That would silently change the reference contract.Vec rules you usually need without extra docsFor the commonTILE_N 128,D 128path, the usual extra questions are already answered here:keeprunning_max,running_sum, and delayedexpdiffin scalar format[HALF_M, 1]snapshot scalar state withadd(dst, src, zero), notub_to_ubcmax/caddoutput dense scalars, so broadcast them with:brcb(dst, src, dst_blk_stride1, dst_rep_stride8)when a wide[HALF_M, 128]buffer is paired with a narrow[HALF_M, 8]broadcast row, operate on:buf[:, 0:64]buf[:, 64:128]rather than on the full 128-column view in one vec callupdaterunning_sumfrom the floatp_jtile before any cast tohalforhif8for non-alignedS2, invalidate score columns beforecmaxwith a sufficiently negative finite sentinel;valid_non the GM load alone is not enoughThese six rules cover the usual reasons people would otherwise open the separate reduction, vec-reduction, vec-stride, and tail files.Critical scalar-state rule on a2Donotcopy[HALF_M, 1]scalar-format state withub_to_ub.That applies to both:prev_many temporary scalar snapshot you might be tempted to use forrow_sumUseadd(dst, src, zero)for scalar-format copies, and keep bothrunning_maxandrunning_sumin[M,1]format until you explicitly need a broadcast.Final vec accumulation and divideStage 3 still matches the unnormalized pattern:load delayedpv_jback into UBbrcbthe delayedexpdiffslot to[HALF_M, 8]scale the two 64-column halves ofaccumadd(accum, accum, pv_j)After the inner loop finishes:brcbthe finalrunning_sumto[HALF_M, 8]div(accum[:, 0:64], accum[:, 0:64], row_sum_broadcast)div(accum[:, 64:128], accum[:, 64:128], row_sum_broadcast)write the normalized result to GMWhy the divide happens at the end:accummust finish all delayedpv_jcontributions firstrow_sumis the denominator for the whole streamed softmax, not one tileExtending the pattern to non-alignedS2The initial validated contract for this pattern keptS2 % 128 0so the first implementation could ignore score-tail masking.WhenS2is not aligned, donotstop at GM-boundaryvalid_nslicing. For normalized online softmax, padded score columns can still corrupt:rowmax(score_j)curr_mdelayedexpdiffrow_sumStable rule:loadk/vthroughvalid_nkeep local score buffers full-sizedbeforecmax, force invalid score columns to behave like-infwhen materializing that mask, use a sufficiently large finite negative fill value instead of literal-infafterexp, those same columns naturally behave like0For the currentTILE_N 128layout, the simplest a2 implementation is:split the score tile into two[HALF_M, 64]halvesuse vec mask finite-negativedup(...)on the affected halfrecomputeprev_valid_nfor the delayedvload in stage 2Read next for the exact rule and mask-construction trick:agent/references/constraints/online-softmax-tail.mdValidation targetKeep the first validated contract narrow:D 128S1 % 128 0S2 % 128 0inputq/k/varefloat16output isfloat32Suggested cases:(1, 3, 256, 256, 128)for the smallest two-tile online update(1, 1, 256, 512, 128)(1, 3, 256, 512, 128)(1, 3, 2048, 4096, 128)For non-alignedS2extensions, add at least:one aligned baseline:S2 % 128 0one left-half tail:S2 % 128 10one cross-boundary case:S2 % 128 65one mid-right-half case:S2 % 128 96one last-column case:S2 % 128 127Files to study / deeper fallbacksagent/example/kernels/a2/flash_attn_full.pyagent/example/kernels/a2/flash_attn_unnorm.pyagent/example/kernels/a2/flash_attn_score_pv.pyagent/references/patterns/a2-cube-vec-cube-vec.mdagent/references/constraints/reduction.md— fallback only when the online update order is still unclearagent/references/constraints/vec-reduction-a2.md— fallback only when thecmax/cadd - brcbdetail is still unclearagent/references/constraints/vec-stride.md— fallback only when a sliced wide/narrow vec op is still unclearagent/references/constraints/online-softmax-tail.md— fallback only when the non-alignedS2mask construction itself is the question【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

AI智能体视觉（TVA）化工行业十大应用场景（5）

重磅预告：本专栏将独家连载系列丛书《AI智能体视觉技术与应用》部分精华内容，该书是世界首套系统阐述“因式智能体”视觉理论与实践的专著，特邀美国 TypeOne 公司首席科学家、斯坦福大学博士 Bohan 担任技术顾问。Bohan先生师从美国三院院士、…...

2026/6/3 4:34:15 阅读更多 →

告别重复代码！在Uniapp项目中用Vue3+TS封装一个高复用性的请求库（附完整源码）

构建企业级Uniapp请求库：Vue3TS工程化实践指南在跨平台开发领域，Uniapp凭借其"一次开发，多端发布"的特性已成为移动端开发的热门选择。但当团队需要同时维护多个Uniapp项目时，每个项目都复制粘贴相同的请求代码会导致维…...

2026/6/3 4:31:45 阅读更多 →

从A-si到LTPO：一文搞懂手机屏幕上那些TFT材料（Oxide/LTPS）到底怎么选？

从A-si到LTPO：手机屏幕TFT材料技术选型全指南站在2023年的电子消费品展会上，任何一位硬件工程师都会被琳琅满目的屏幕参数淹没——从iPhone的ProMotion自适应刷新率到三星Galaxy的2K 120Hz流畅体验，背后都离不开TFT背板技术的迭代。当产品经理…...

2026/6/3 4:31:40 阅读更多 →

AnolisOS 8.8安装源配置踩坑实录：从‘设置基础软件仓库时出错’到成功联网的保姆级指南

AnolisOS 8.8安装源配置实战指南：从诊断到解决方案的全流程解析当你在安装AnolisOS 8.8时遇到"设置基础软件仓库时出错"的提示，这通常意味着系统无法访问或识别安装源。这个问题看似简单，但背后可能涉及网络配置、镜像选择、启动参…...

2026/6/2 16:05:16 阅读更多 →

Lindy路线图前瞻：3个已被验证的信号，预示Q3将启动下一代AI原生平台重构

更多请点击： https://intelliparadigm.com 第一章：Lindy路线图前瞻：3个已被验证的信号，预示Q3将启动下一代AI原生平台重构信号一：核心基础设施层API调用量连续8周突破临界阈值 Lindy平台的 /v2/execute与 /v3/plan端…...

2026/6/3 1:19:41 阅读更多 →

【AI工具智能排行榜TOP10】：2024年实测数据驱动的生产力跃迁指南（仅限本周开放下载）

更多请点击： https://kaifayun.com 第一章：AI工具智能排行榜TOP10的底层逻辑与评估范式 AI工具排行榜并非主观评分的产物，而是由多维可量化指标驱动的系统性工程。其核心在于构建一个兼顾能力广度、推理深度、工程鲁棒性与生态协同性的评估范…...

2026/6/1 2:44:39 阅读更多 →

3步解决博德之门3模组管理难题：BG3ModManager完整使用指南

3步解决博德之门3模组管理难题：BG3ModManager完整使用指南【免费下载链接】BG3ModManager A mod manager for Baldurs Gate 3. This is the only official source! 项目地址: https://gitcode.com/gh_mirrors/bg/BG3ModManager BG3ModManager是专为《博德之…...

2026/5/31 0:17:22 阅读更多 →