CANNBot内核调试指南
Kernel Debugging Playbook【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsUse this playbook when an existing kernel is wrong, unstable, warning-heavy, or unclear. Debug in layers. Do not jump between random fixes.GoalFind the first broken assumption. Fix the model, then fix the kernel. Do not keep stacking patches on top of an unclear design.Fast-path: match your symptom firstMost bug reports match one of the patterns below. Try these before running the full layer-by-layer review further down.Symptom-to-check mapWrong everywhere→ check formula, transpose/layout, cast order,shape_bindingsOnly large shapes fail→ check tile budgets, split mode, estimator choice, counter ownership across nested loopsOnly tail tiles fail→ checkvalid_*handling, half-row vec writeback split, GM boundary slicingAutosync warnings or weird pipeline stalls→ check same-side vs cross-side misunderstanding, event family grouping, counter reuse across different lifetimes, unsupported instruction not covered by autosync pairingLocal event timeout / already-set (_tmp_*valid_*,_tmp_*ready_*)→ classify the event failure first, dump autosync-expanded instructions, then compare the failing family against a stable kernel before changing the DSLSimulator passes, generated path looks suspicious→ check parser lowering, codegen handlers, explicit event or mutex placement, assumptions hidden by simulator convenienceV2wait_vec/wait_cubetimeout→ see the V2 timeout section below; almost always the other lanes actor crashed silentlyKernel only fails when run alongside other tests→ see the V2 parallel-process section belowEvent pairing workflow for local event failuresUse this when V2 reports a lane-local event problem such as:event_wait timeout: {name: _tmp_sevent_valid_fix_0, ...}event_set on already-set flag: _tmp_sevent_valid_l1_0 ...Debugging sequence:Classify the failure from the runtime message.event_wait timeoutusually means a missingevent_setfor the same family.event_set on already-set flagusually means a duplicateevent_setbefore a matchingevent_waitconsumed the token.Read the counters literally.On the simulator path,presetTrueevents start with one published token.If a timeout reportsset_count wait_count, the preset token was consumed and the next producer-sideevent_setnever happened.Build the kernel instructions before inspecting split/autosync output.Call thekernelonce with placeholderGMTensor(...)arguments sokernel.instructionsis populated.Dump the autosync-expanded lane instructions.Usesplit_instructions(...)plusinsert_auto_sync(...), then inspect only the failing side (cubeorvec).Prefer printing just one family at a time:l1,l0,fix,ubin, orubout.Turn the event stream into an action sequence.Record onlyevent_wait/event_setfor the failing event name(s).Healthy reuse should look like alternating publish/consume rounds; repeatedwaitor repeatedsetwithout the opposite action in between is the broken edge.Compare against a stable baseline kernel.Dump the same family from a nearby working kernel and diff the action sequence.This is often faster than reasoning from the fused kernel body.Check nested autosync ownership next.If the failing edge sits around nestedstart_loop/start_ifregions, inspect parent/child mixed-scope handling before touching the kernel.In particular, confirm whether parent and child are really the same autosync family, not just the same pipe pair.Add a parser regression before rerunning the real kernel.Put the minimal reproducer intestcases/parser/sync/test_autosync_event_metadata.py.Fix the split/autosync behavior there first, then rerun the full simulator kernel.When this workflow points to parser behavior, jump to:agent/references/constraints/autosync.mdV2 simulator:CvMutex/VcMutextimeout (wait_vec/wait_cube)When V2 reports a sync timeout such as:wait_vec timeout: {scope: intra_core, name: vec_to_cube_0_0, target_phase: 3, current_phase: 2, consumed_phase: 2}The timeout almost always means theother lanes actor thread crashed, not that the sync logic itself is wrong. The crashing thread silently terminates, so the expectedvec_readyorcube_readysignal is never published, and the waiting side eventually times out.Debugging sequence:Capture the real error on the other lane first.PatchCoreRuntime.startto wrap eachControlActor.start()in a try/except that prints the lane name and exception. The first non-timeout error is the real root cause.Common root causes behind the silent crash:Float8 indexing: PyTorch does not supporttensor[indices]forfloat8_e5m2/float8_e4m3fn. Any_gather_1d,_scatter_1d, or fancy indexing on a float8 register or UB tensor raisesindex_cpu not implemented for Float8_e5m2. Fix by viewing astorch.uint8before indexing.Non-contiguous UB views in burst copy:ub_to_gm_pad/ub_to_l1_nzuse.view(torch.uint8)on the source. A column slice (stride 1) makes.view()fail. Fix with_linear_view_from_pointer().Micro op not implemented: avf()body calls an op thatMicroRuntimedoes not dispatch (NotImplementedError). The vec lane dies and itsfree()never fires.After fixing the vec/cube error, the sync timeout resolves on its own.Donottune sync timeouts or phase counters to work around these failures — the counters are correct; the lane just never ran to completion.V2 simulator: do not run multiple simulator processes in parallelRunning multiple V2 simulator processes concurrently can producesilent data corruption. Root cause: per-lanePipeWorkerthreads are exposed to intra-process races under heavy CPU thread contention. Primarily affects kernels using NZ layout ops (ub_to_l1_nz,deinterleave,reg_to_ub) or complexvffunctions.simulatorlegacyis still accepted but routes to the same V2 runtime — there is no sequential fallback to switch to.Always run kernel simulator tests sequentially, not in parallel withor batch scripts.If a kernel produces incorrect results only when run alongside other tests, re-run it alone before investigating.V2 simulator launch rule: use a real script entry and keepPYTHONPATHWhen launching helper comparisons or ad-hoc debugging runs, do not start the simulator fromstdinentry points such as:python - PYcat script.py | pythonV2 uses child processes plus worker threads. On the process-spawn path, Python must be able to re-import the parent__main__module from a real file.stdinentry points show up asstdin, so child startup fails with errors such as:FileNotFoundError: ... /path/to/repo/stdinfollow-onEOFErrorwhilemultiprocessing.Manager()startsPractical rule:put the repro in a real.pyfile and run that fileinclude the repository root inPYTHONPATHwhenever the script imports local modules from outside the repo root or from a temp directoryTypical safe form:PYTHONPATH/abs/path/to/repo python /tmp/repro.pyLayer-by-layer reviewUse this order when the fast-path sections above did not match or did not fix the bug:contract and cast orderlayout and shape bindingstile and capacity assumptionstail handlingsync and ownershipcounters and lifetime separationprecision boundariesparser/simulator/codegen implementation path1. Re-check the exact contractVerify the kernel against the real PyTorch formula. Common failure modes: wrong cast order, wrong transpose interpretation, wrong reshape meaning, accidental semantic drift. If the reference is still fuzzy, stop here and clarify it before changing the DSL code.2. Re-check layout and shape binding assumptionsVerify tensor logical shapes, transpose site,shape_bindings, repeated scalar dimension mapping.Common signs: output shape is right but values are wrong everywhere; only some shapes fail; changingM,N, orKflips behavior unpredictably.Repository reminder: if repeated scalar dimensions are ambiguous, try explicitshape_bindingsbefore deeper kernel surgery.3. Re-check tile and capacity assumptionsWhen the kernel is tiled, verifyTILE_M,TILE_N,TILE_K,m_split,n_split,splitk/splitn,L0A/L0B/L0Cbyte budgets.Repository reminders: keepsplitkandsplitnat 32; choosesplitkwhen K-side staging is too large; choosesplitnwhen N-side staging or output tile is too large; do not author non-zeroL0Crow offsets on matmul destinations. For the exact per-device caps and DBuff formulas, seeagent/references/facts-authoring.mdandagent/references/facts-device-runtime.md.If tile search is non-trivial, useagent/scripts/estimate_matmul_datamove.pyinstead of eyeballing it. Drill intoagent/references/constraints/tiling.mdfor reasoning.4. Re-check tail handlingLook at GM boundaries first, not local tensor sizes. Rule: local buffers stay full-tile sized; only GM read/write boundaries usevalid_m,valid_n,valid_k.For cube - vec writeback, verify the standard half-row split:half_rows CeilDiv(valid_m, 2)row_begin GetSubBlockIdx() * half_rowsrow_end Min(row_begin half_rows, valid_m)For a2 workspace-mediated cube - vec tails: keep workspace writes and reads on stable tile shapes (ws[..., 0:TILE_M, 0:TILE_N]on cube;ws[..., row_begin:row_begin row_count, 0:TILE_N]on vec). Applyvalid_nwith vec-side masking and final GM write boundaries, not by cropping the workspace column span first.Symptoms of tail bugs: aligned cases pass but odd sizes fail; only the last tile is wrong; one vec subblock is correct and the other is garbage.Drill:agent/references/constraints/tail-safety.md. For normalized online softmax with runningrow_max/row_sum, alsoagent/references/constraints/online-softmax-tail.md.5. Re-check sync ownershipAssume ownership is wrong until proven otherwise.auto_sync()only manages same-side ordering and does not replace cross-side ownership transfer. Cube - vec handoff needsCvMutex; vec - cube handoff needsVcMutex. Exact mutex signatures per device live inagent/references/facts-device-runtime.md.If the issue smells like pipeline ordering: inspect where the producer finishes, where the consumer starts, whetherlock/ready/wait/freesurround the real ownership edge, and keep the critical section narrow.Drill:agent/references/constraints/autosync.md.6. Re-check counters and lifetimesMany broken kernels are actually lifetime bugs. Verify which loop owns each buffer family, whether different lifetimes accidentally share one counter, whether the same slot lineage is expressed consistently.Rules: buffers with different lifetimes must use different counters; same-lifetime paired buffers may share one; reusing one counter across different loop-owned lifetimes can silently break autosync grouping and slot reasoning.Drill:agent/references/constraints/counters.md.7. Re-check precision boundariesVerify where values change dtype. Common failures: casting too early, reducing in the wrong dtype, writing packed or quantized data too early, comparing against a reference with a different cast order.Rule: keep matmul accumulation infloat; downcast later unless the design proves otherwise.Drill:agent/references/constraints/precision.md.8. Inspect the real implementation pathIf a rule is still unclear, inspect the actual implementation path instead of theorizing. Device family mapping (950→ C310,b*→ C220) and common target files (easyasc/stub_functions/,easyasc/parser/,easyasc/parser/asc_autosync.py,easyasc/kernelbase/kernelbase.py,easyasc/simulator_v2/,easyasc/shortcuts/matmul.py) are inagent/references/code-paths.md.Good debugging question: which exact instruction gets emitted, how the parser lowers it, how the simulator executes it, whether the kernel assumption matches that path.When the simulator itself produces an unexpected error: investigate the simulator path first; inspect the exact simulator stage, runtime view, and lowered instruction that failed; do not assume the upper-layer kernel is wrong just because the simulator failed first.If simulator behavior still looks inconsistent with the intended model after real inspection: stop blind upper-layer edits, summarize the concrete simulator finding, pause and discuss with the user.Build a minimal reproducerWhen the full kernel is noisy, isolate one mechanism: one matmul, one handoff, one vec postprocess, one autosync chain, one tail tile. A minimal reproducer is usually faster than staring at a fused kernel.Shrink-down order: keep the original failing shape, remove later stages until only the first wrong stage remains, inside that stage keep only one subformula (odo,rowmax, one GM bridge), shrink again if needed to one instruction and one view shape.Treat warnings as real signalsDo not accept a passing result with unresolved warnings. Especially forauto_sync, warnings usually mean the lifetime model is off. If a warning persists after real inspection, stop blind iteration — either redesign the stage boundary or ask the user for clarification.Fallback referencesagent/references/code-paths.mdagent/references/simulator-v2.mddoc/11_architecture_for_contributors.md【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考