CANN自动同步约束指南

张

张建站

2026/5/9 12:10:42

10分钟阅读

Autosync Constraints【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when debugging or designingauto_sync()behavior. Do not use it as a general synchronization guide for every kernel.GoalUseauto_sync()correctly as a same-side queueing tool. Do not confuse it with cross-side ownership transfer.1. Core ruleauto_sync()is only a same-side pipeline-ordering mechanism. It does not transfer ownership between cube and vec by itself.Practical consequence:same-side queueing may useauto_sync()cube - vec handoff still needs explicitCvMutexvec - cube handoff still needs explicitVcMutexthere is no autosync-managedV - MTE2family, so a vec op such asdup(...)does not become ordered ahead of a latergm_to_ub_padjust because both are inside the sameauto_sync()region2. Mental modelTreatauto_sync()as:marker emission during authoringevent synthesis later during split/loweringCurrent high-level flow:authoring insertsstart_auto_sync/end_auto_syncsplit translation rewrites cube and vec instruction listsautosync inserts event declarations plusevent_wait/event_setduplicate event declarations are deduplicated laterstart_auto_syncandend_auto_syncthemselves are not the synchronization. The later synthesized events are.3. Current hard-coded pipe pairsVec-side pairs:MTE2 - VasubinV - MTE3asuboutCube-side pairs:MTE2 - MTE1asl1MTE1 - Masl0M - FIXasfixIf an instruction is not classified into one of the supported pipe families, currentauto_sync()will not reason about it.Exact op-to-pipe mapping used byauto_sync():Pipe.MTE2:gm_to_l1_nd2nz,set_constant_to_l1,gm_to_ub_padPipe.MTE1:l1_to_l0Pipe.M:mmadPipe.FIX:l0c_to_gm_nz2nd,l0c_to_l1,l0c_to_ubPipe.MTE3:ub_to_gm_pad,ub_to_l1_nd2nz,ub_to_l1_nzPipe.V: remaining vec allowlist after subtracting the memory ops aboveBecausegm_to_ub_pad,ub_to_l1_nd2nz,ub_to_l1_nz, andub_to_gm_padare reassigned toMTE2/MTE3, they do not stay onPipe.Vfor autosync matching.Practical tail example:dup(ub, 0.0)runs onVub[...] gm[...]lowers togm_to_ub_padonMTE2auto_sync()does not create aV - MTE2ordering edge for that pairIf correctness depends on the zero-fill finishing before the reload starts, do not assumeauto_sync()will serialize it. On the simulator path, the local barrier that actually drains cross-pipe work isbar_all(); single-pipe barriers such asbar_v()are no-ops there.4. Event modelEach autosync-managed slot behaves like a two-token handshake:ready: producer - consumervalid: consumer - producerStable mental contract:producer waits onvalidproducer writes the slotproducer setsreadyconsumer waits onreadyconsumer reads the slotconsumer setsvalidIf your kernel logic does not fit this contract,auto_sync()is probably the wrong tool for that edge.5. Event metadata and namingSynthesized event metadata rule:ready: name prefix_tmp_{s|d}event_ready_,src_pipe producer pipe,dst_pipe consumer pipe,preset False,idx 9998valid: name prefix_tmp_{s|d}event_valid_,src_pipe consumer pipe,dst_pipe producer pipe,preset True,idx 9999Generated family names:vecMTE2 - V:_tmp_{s|d}event_{ready|valid}_ubin_idxvecV - MTE3:_tmp_{s|d}event_{ready|valid}_ubout_idxcubeMTE2 - MTE1:_tmp_{s|d}event_{ready|valid}_l1_idxcubeMTE1 - M:_tmp_{s|d}event_{ready|valid}_l0_idxcubeM - FIX:_tmp_{s|d}event_{ready|valid}_fix_idxRead generated code literally. For example:DEventPIPE_V, PIPE_MTE2, true _tmp_devent_valid_ubin_0;means the vec-sideubinslot-0validtoken;Vis the consumer pipe,MTE2is the producer pipe.6. Sync grouping is not just by buffer name5a. Practical state machine for same-side reuseWhen debugging generatedevent_wait/event_set, a compact state machine is often faster than staring at the whole region.Useful mental states:idle: no producer token is currently heldproducing: producer already reacquiredvalidfor the current roundconsumed: consumer already finished the current round and the next producer round must first close and reacquirevalidUseful transitions:firstsrc_pipeof a round:idle - producingviavalid.waitmatchingdst_pipeof that round:producing - consumedviaready.setthenready.waitnextsrc_pipeafter a completed round:consumed - producingviavalid.setthenvalid.waitPractical consequence:for asrc - dst - src - dstpattern on the same autosync-managed family, the second producer round must reacquirevalidif codegen shows onlyvalid.setbefore the second producer round, the autosync insertion is unbalanced for slot reuse even if no warning is printedConcrete cube-side example to look for:expected shape:valid.wait ... ready.set/ready.wait ... valid.set valid.wait ...suspicious shape:valid.wait ... ready.set/ready.wait ... valid.set ...This check is especially useful for:repeatedl1_to_l0 - mmad - writebackinside one inner looprepeatedmmad - FIXstages that reuse the sameDBuff/TBufffamilyRepository note:testcases/parser/sync/test_autosync_event_metadata.pycontains a regression that checks thesrc - dst - src - dstcase reacquiresvalidon the second roundthe current implementation path iseasyasc/parser/asc_autosync.pyCurrent grouping is based on normalized sync-key groups built from the participating tensors. That means the event family index depends on:logical buffer lineageslot expressionwhether the view comes fromDBuff/TBuffor a plainTensorPair-specific sync-key collection:vecMTE2 - V(ubin): producer key fromgm_to_ub_paddestination; consumer key from vec-op sourcevecV - MTE3(ubout): producer key from vec-op destination; consumer key fromub_to_*sourcecubeMTE2 - MTE1(l1): producer key fromgm_to_l1_nd2nzdestination; consumer key froml1_to_l0sourcecubeMTE1 - M(l0): producer key froml1_to_l0destination; consumer key frommmadsourcessrc_a/src_bcubeM - FIX(fix): producer key frommmaddestination; consumer key from FIX-side sourcecall_microspecial case: if treated as source pipeV, useswrite_sync_keys; if destination pipeV, usesread_sync_keys.Sync-key derivation (fromeasyasc/utils/sync_key.py):DBuff/TBuffview key shape:(buffer, buffer_type, position, buffer_name, source_index_token)plainTensorview key shape:(tensor, position, tensor_name)scalar token rule: concreteVar.valueuses numeric value; symbolicVaruses variable name;Nonebecomes empty tokentemporary tensors with_tmp_prefix are ignored for groupingall keys are deduplicated, sorted, and packed into one normalizedSyncKeyGroupif no usable keys remain, fallback is(scope, buf_name, str(id(node)))Practical consequence:same buffer family name does not automatically mean same autosync event familychanging slot lineage can silently create a different event family5b. Workspace-mediated MTE2-reload reuse ruleUse one shared localDBufffamily plus one shared slot counter only when the real same-side reuse edge is a workspace-mediatedMTE2reload into local storage.Conclusion:this is not an a2-only rulethis is not a vec-only rulethe deciding factor is whether multiple stages are being matched as the sameMTE2 - consumerautosync family after reloading from workspace / GMcurrent repository families with anMTE2producer include:vecMTE2 - V(ubin)cubeMTE2 - MTE1(l1)the rarer vec-sideMTE2 - MTE3(ubrelay) when a region really has only that pairLogic chain:auto_sync()groups by pipe pair first, not by these two stages happen to increment the same counter.The producer side must really be anMTE2read op such asgm_to_ub_padorgm_to_l1_nd2nz, meaning the local slot is being refilled from workspace / GM.Family reuse only happens when the normalized sync-key group also matches. ForDBuff/TBuff, that key includes the slot tokensource_index_token; for plainTensor, it does not.Therefore flag and buffer count move together is a necessary bookkeeping consequence, not the root cause. The counter matters because it stabilizes the slot token for the same local buffer family under the sameMTE2producer family.If the producer is notMTE2, this rule does not apply just because a local buffer is being reused.Counterexample:l0c_to_ubis classified asPipe.FIX, notPipe.MTE2so aFIX - consumeredge does not become anMTE2reload family merely by sharing aDBuffname or co-incrementing a counterin those cases the primary correctness mechanism is still the realFIX-side ownership / handoff model, for exampleCvMutex(..., src_end_pipePipe.FIX, dst_end_pipePipe.V)on cube - vec pathsPractical authoring rule:use shared-slotDBuff countreuse when you are intentionally preserving one workspace-mediatedMTE2reload lineage across stagesdo not cargo-cult that pattern into intra-core transfers just because the temporary tensors look similarEvent-family index assignment in nested code:if a node is not mixed-scope, it does not allocate a new event familyif a mixed-scope node has a sync-key group not yet seen, it gets the next newidxif a later mixed-scope node has the same normalized group, it reuses the sameidxa parent preload plus child mixed pipeline can create multipleubinfamilies when buffer groups differa child reusing the sameuboutstream stays on_ubout_0without needing_ubout_16.SEventvsDEventDo not assumeauto_sync()always producesDEvent. Current rule:single-buffer style views may produceSEventslot-buffer traffic usually producesDEventPlainTensorviews can collapse a stage intoSEventbehavior.Concrete comparison in this repo:agent/example/kernels/a2/flash_attn_full_pj_hif8.pykeeps stage-1ub_scoreand stage-2ub_pvas separate plainTensorscratch views, so the vec-sideubinstory is the simpler baseline to read firstagent/example/kernels/a2/flash_attn_full_pj_hif8_commonub.pyreplaces those two views with one sharedDBufffamily plusscore_pv_cnt; that gives the vecMTE2 - Vubinedge stable slot lineage, which is why generated code can move to_tmp_devent_*_ubin_*queueingagent/example/kernels/a2/flash_attn_full_pj_hif8_causal.pyandagent/example/kernels/a2/flash_attn_full_pj_half_block32_causal.pynow reuse that same shared local slot family, so their causal-mask variants inherit the same vec-side queueing structure instead of staying on the older plain-Tensorscratch pathfor these specific kernels, the improvement point is on the vecubinedge, not on cube-sidel1/l0/fix: those cube families were alreadyDBuff-based in both kernelsstage1_cntandstage2_cntstill stay separate; the shared scratch gets its own counter because the queueing gain comes from the local slot family, not from merging delayed-stage lifetimes7. When a region actually gets a handshakeA region only receives a full autosync handshake when both sides of the current pipe pair appear inside that region. If a region contains only the producer or only the consumer side, it does not own a complete handshake by itself.Practical consequence:one large outerauto_sync()block may produce events only around some inner loops or branchesnested code may own the real handshake instead of the parent block8. Nested-region ruleIf a child region already owns both sides of a pipe pair, the parent should not wrap it again with another full handshake. Otherwise you get dangling or duplicated token flow.When parent and child both seem to control the same edge, suspect the region structure first.8a. Event-pairing debug workflowWhen debugging a concrete local event failure, inspect the autosync-expanded instruction stream before changing the kernel body.Recommended workflow:Buildkernel.instructionsfirst by calling thekernelonce with placeholderGMTensor(...)arguments.Run:split_instructions(...)insert_auto_sync(..., modecube | vec)eliminate_duplicated_event_creation(...)Filter the expanded instructions down to one failing family:cube:l1,l0,fixvec:ubin,ubout,ubrelayPrint only:event_waitevent_setthe paired producer/consumer ops around themCompare that familys action sequence to a stable baseline kernel that uses the same pipe pair.Useful interpretation rules:event_wait timeoutusually means the next producer-sideevent_setnever happened.event_set on already-set flagusually means the family got a duplicateevent_setbefore a matchingevent_waitconsumed the previous token.On the simulator path,presetTruestarts the family with one published token. Soset_count wait_counton timeout usually means the preset token was consumed and no laterevent_setwas published.Repository pointers:parser split / autosync:easyasc/parser/asc.py,easyasc/parser/asc_autosync.pysimulator local-event semantics:easyasc/simulator_v2/sync/local_events.pygood regression home:testcases/parser/sync/test_autosync_event_metadata.pyPractical output shape to look for:healthy same-family reuse:wait(valid) ... set(ready) wait(ready) ... set(valid) wait(valid) ...missing publish:... wait(valid)with no priorset(valid)for the next roundduplicate publish:... set(valid) ... set(valid) ...without an interveningwait(valid)This workflow is especially effective for:M - FIXreuse aroundl0c_to_ub/l0c_to_l1MTE2 - MTE1reuse around repeatedgm_to_l1_nd2nznestedstart_loop/start_ifregions that own different autosync familiesWhen the bug only appears in nested regions:inspect parent and childbuf_idx/ sync-key family identity, not just the pipe pairparent/child state transfer is only safe when both scopes refer to the same autosync familyif parent and child are different families on the same pipe pair, carryingready/validstate across the boundary can create a missing-set or duplicate-set failure9. Warning ruleTreat this warning as real:WARNING: NOT balanced auto_sync events, please check the code logic!It means the region ended with an unfinished producer-side token story. Typical causes:only one side of a supported pair appears in the regionthe last producer action never reaches its matching consumer phaseparent and child both partially own the same handshakeslot lineage changes inside the regionDo not wave this away just because codegen still succeeds.10. Stable authoring rulesUse these rules for current repository behavior:wrap one full same-side pipeline stage, not random fragmentskeep producer and consumer on stableDBuff/TBuffviews when possibleprefer one stable counter per slot familydo not reuse one counter across different buffering lifetimesletauto_sync()handle same-side queueing onlyuse explicit mutexes for cross-side ownershipStable cross-side mappings in this repository:cube - vec:CvMutex(..., src_end_pipePipe.FIX, dst_end_pipePipe.V)vec - cube:VcMutex(..., src_end_pipePipe.MTE3, dst_end_pipePipe.FIX)Explicit mutex lifetime rule:lock()before producer writeready()after producer completeswait()before consumer readfree()immediately after consumer completes10a.ub_to_ubbelongs to Pipe V, not MTE3Despite being a datamove,ub_to_ubis serviced by the vec (V) pipe, not MTE3. Concrete consequences:Same-pipe ordering. Aub_to_ubstep between two vec computations is already serialized by the V pipeline. You donotneed an explicitSEvent(Pipe.MTE3, Pipe.V)/SEvent(Pipe.V, Pipe.MTE3)handshake to fence it against its vec producer or consumer.auto_sync()bar_v fence. The autosync pass now unconditionally emits a trailingbar_vbarrier after everyub_to_ubas a conservative fence for slot reuse (see_insert_b_device_vec_barriersineasyasc/parser/asc_autosync.py).MTE3 events stay for real MTE3 ops. KeepV - MTE3handshakes only for ops that truly run on MTE3 (e.g.ub_to_gm_pad).flash_attn_full_pj_hif8keepsaccum_store_ready/validaround the accum UB→GM write for this reason, but its earlierquant_chunk_loaded/committedevents were redundant and have been removed.If you see legacy kernels inserting V↔MTE3 events around aub_to_ubchunked-quantize step, that is almost always leftover noise — delete the events and rely on pipe-V serial ordering plus the autosync trailing bar_v.10b.dupparticipates in both read and write trackingFor WAW-hazard detection, autosync treats the destination ofdupas both a consumer and a producer. Concrete consequence: back-to-backdups that target the same tensor get a separatingbar_v, matching intuitive hardware expectations. This is handled by_VEC_DST_AS_READ_OPNAMESineasyasc/parser/asc_autosync.py.10c. Vec-only helpers can still break autosync through buffer lineageDo not reason about autosync at the Python-helper boundary. Reason about it at the touched slot-family boundary.Practical failure mode seen during a2 dense-attention-backward bring-up:a new helper contained only vec opsthe helper reused a live stage buffer such asdpbufas temporary quantization scratchthe enclosing region then no longer had one cleanMTE2 - V - MTE3story for that slot familyautosync reportedWARNING: NOT balanced auto_sync events, please check the code logic!Why this happens:autosync groups by supported pipe pair plus normalized sync-key group, not by Python function namewriting scratch data through a liveDBuff/TBufffamily changes the same slot familys vec-side historythere is still no autosync-managedV - MTE2repair pathStable authoring rule:if a local family later participates inub_to_*, keep that family on one coherent producer/consumer lineagedo not borrow delayed or still-live stage buffers as temporary vec scratchgive helper scratch its own plainTensoror its own dedicated local buffer familyConcrete a2 lesson:reusingdpbufas quant scratch was enough to create an unbalanced vecuboutstorymoving quant scratch onto dedicated chunk-sized tensors removed the warning10d. When adding vec-side transforms, chunk the whole stage instead of splicing scratch into a half-tile regionIf a stage originally processed one half-tile such as[HALF_M, TILE_N], and you later add quantization, extra masking, or format conversion, prefer re-chunking the entire vec hot path so each chunk owns a full:MTE2 - V - MTE3Do not keep the old large region and inject extra vec writes into the middle of someone elses slot lifetime.Stable a2 pattern:choose a row chunk such as32 x 128or16 x 128loadqk/dpfor one chunk from workspace or GMfinish all vec work for that chunkwritep/dqkfor that chunk before advancing to the next onePractical consequence:the warning usually disappears because each chunk owns a complete same-side handshakethe code becomes easier to audit than one large region with partial parent/child ownership10e. Do not pass sliced views into helpers that internallyreinterpret(...)If a helper internally reinterprets its tensor argument, pass the full backing tensor unless the helper itself was written to be slice-safe.Concrete failure mode:a helper rebuilt a packed causal mask throughcausal_mask.reinterpret(DT.int)the caller passed a sliced view such ascausal_mask_dynamic[0:HIF8_CHUNK_M, ...]simulator-v2 then reported a storage/view-size failure such asshared tensor view exceeds source storageStable rule:pass the full tensor into the helperkeep the helpers internal full-tile assumptions explicitslice later at the use site, for example when callingselect(...)10f. Warning triage must identify the broken autosync pattern and report itWhenWARNING: NOT balanced auto_sync events, please check the code logic!appears, do not stop at the warning exists or the result still passes.Required debugging behavior for future agent work:actively search for the unmatched autosync pathidentify which supported pipe pair failed to close, for exampleMTE2 - V,V - MTE3,MTE2 - MTE1,MTE1 - M, orM - FIXdetermine whether the real problem is:only one side of the pair appearing in the regionparent and child regions both partially owning the same handshakea slot-family lineage change caused by reusing a live buffer as scratcha counter / slot token mismatch that moved later ops onto a different familyreport that concrete mismatch back to the user, not just the presence of the warningPreferred feedback shape:which side reported the imbalance: cube or vecwhich family looks broken:ubin,ubout,l1,l0, orfixwhich local buffer family or region caused the mismatchwhether the likely fix is region restructuring, buffer separation, or counter repairWhy this rule exists:the warning is often the first sign that the mental model is wrongusers need the suspected broken pattern to decide whether to change the kernel structure, the helper scratch plan, or autosync assumptions11. Verification workflowWhen replacing manual barriers withauto_sync(), validate in this order:inspect inserted event signatures in IRinspect generated C event declarationsrun a minimal reproducer with the same pipe topologyrerun the real kernel shapeCheck these details explicitly:event family names match the intended stage (ubin,ubout,l1,l0,fix)validuses reversed pipe direction withpresetTruereadyuses forward pipe direction withpresetFalsenested scopes do not create duplicate event families unless slot grouping truly changednoNOT balanced auto_sync eventswarning appearsFiles to studyeasyasc/parser/asc_autosync.pyeasyasc/parser/asc.pyeasyasc/utils/sync_key.pyeasyasc/decorators.pytestcases/parser/sync/test_autosync_event_metadata.pyagent/example/kernels/a5/vec_cube_abs_sqrt_matmul.pyagent/example/kernels/a5/vec_cube_vec_scale2_abs_add1_matmul.py【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

CANNBot内核调试指南

Kernel Debugging Playbook 【免费下载链接】cannbot-skills CANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills Use this playbook when an existing kern…...

2026/5/9 12:05:32 阅读更多 →

CANN空间智能优化样例

cann-recipes-spatial-intelligence 【免费下载链接】cann-recipes-spatial-intelligence 本项目针对空间智能业务中的典型模型、加速算法，提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-spatial-intelligence 🚀…...

2026/5/9 12:02:30 阅读更多 →