Neurarch rules: structural checks that catch model bugs before training

Layer 1

Guardrail gates (6)

These run on every action the agent proposes, before anything touches the canvas. A blocking finding requires explicit user override. Source: src/utils/agentGuardrails.ts.

G01 Impact preview warn blast radius

Trigger

Any destructive action (delete_component, delete_components_matching, rename_components_matching, clear_canvas, replace_model) where downstream impact > 4 layers, or any of them are shape-changing.

Surfaces

Number of downstream layers affected, how many would reshape, how many carry weights (rebuild required).

Override

User confirms in the impact-preview card before the action lands.

G02 Param explosion block resource

Trigger

Estimated total parameter count grows by ≥ threshold × baseline (default 5×, slider 2-20×). Blocks at 2× threshold, warns below.

Why

A misread "scale this 10×" prompt should not silently turn a 60M model into a 6B model that won't fit anywhere.

Setting

useProviderStore.paramExplosionThreshold(). Settings → "Param explosion threshold".

G03 Cycle introduction block structure

Trigger

An add_connection closes a cycle in the existing DAG. DFS check on the union of current edges and proposed edges.

Why

NN forward graphs are acyclic. Recurrence lives inside the layer type (LSTM, GRU), not as a graph cycle. A connection that closes a cycle is almost always a mis-step.

Source

DAG requirement matches PyTorch nn.Module.forward semantics. torch.autograd docs note that gradient computation requires a DAG.

G04 Orphan layers warn structure

Trigger

add_component without afterName, and no follow-up add_connection referencing the new component as either endpoint.

Why

A new layer that's not wired is dead code. The agent occasionally proposes these when it forgets to chain edits; the warning flips the decision back to the user.

G05 Shape inference (new) block shape

Trigger

Propagates tensor shapes through the sandboxed post-action graph and surfaces only the issues the action would introduce. Sub-checks: attention embedDim % numHeads, GQA numHeads % numKVHeads, elementwise merge parent equality, concat axis compatibility, explicit linear inFeatures vs upstream, computeOutputShape throw, and outputs with NaN / 0 / negative dims at the first layer to introduce them.

Source

Per-layer transforms from componentRegistry.computeOutputShape. Head-dim convention from Vaswani et al. 2017 (Attention Is All You Need). GQA ratio from Ainslie et al. 2023 (GQA).

Why

Text-layer review tools (Cursor, Copilot) can't catch embedDim=384, numHeads=5 until the GPU rejects the kernel. This gate fires sub-millisecond, pre-apply.

G06 Param range warn param-range

Trigger

A single parameter value falls outside its conventional range on add_component or update_params — e.g. dropout above 1, stride of 0, numHeads of 0. Uses the same convention-based ranges the inspector shows inline.

Why

These are values the shape gate cannot see: the graph still propagates, the model still builds, and the run is quietly wrong. A dropout of 1.2 is not a shape error, it is a silently dead layer.

Source

src/utils/paramConstraints.ts::validateParamValue, shared with the inspector so CI and the canvas never disagree.

Layer 2

Architecture advisor rules (30)

These run live as you build, every model change, no agent involvement. They surface as inline warnings on the canvas and in the Advisor panel. Source: src/utils/architectureAdvisor.ts.

R01 No Input node error structure

Trigger

Model has ≥ 1 component but no component of type input.

Why

Without an Input layer, tensor shapes can't be propagated and generated code is incomplete.

Fix

Drag an Input layer from the I/O section of the palette.

R02 No Output node warn structure

Trigger

Model has > 1 component but no component of type output.

Why

Code generator can't tell where the forward pass terminates.

R03 Isolated components warn structure

Trigger

Component has no inbound and no outbound connections.

Why

Most often a stranded layer left over from a refactor. Code generator skips it but it inflates the visual graph.

R04 Dead-end layer warn structure

Trigger

A non-output node has inbound connections but no outbound connections.

Why

Compute happens but the result is discarded. Gradients flow into a black hole.

R05 BatchNorm / LayerNorm after activation warn ordering

Trigger

A normalization layer is connected directly downstream of an activation (relu, gelu, swish, etc.).

Why

Normalization is meant to stabilize the pre-activation distribution. Applying it after activation breaks the assumption and shifts already-nonlinear features back toward zero mean.

Source

Ioffe & Szegedy 2015, BatchNorm places BN between Linear/Conv and the activation.

R06 Dropout directly before BatchNorm info ordering

Trigger

A dropout layer is connected directly upstream of a BatchNorm.

Why

Dropout at train time changes the activation variance; BN's running statistics get distorted. A known train-vs-eval mismatch.

Source

Li et al. 2018, "Understanding the Disharmony Between Dropout and Batch Normalization".

R07 Softmax / Sigmoid directly before Output info ordering

Trigger

An explicit Softmax or Sigmoid layer is wired directly into Output.

Why

PyTorch's nn.CrossEntropyLoss applies LogSoftmax internally; an explicit Softmax before it double-applies and slows training. BCEWithLogitsLoss has the same issue with Sigmoid.

Source

torch.nn.CrossEntropyLoss docs.

R08 Normalization at output warn ordering

Trigger

A normalization layer is wired directly into Output without a final linear / classification head.

Why

Normalization zero-centers the logits, breaking calibration. The model's output scale is no longer meaningful.

R09 Deep network with no residual connections warn pattern

Trigger

≥ 8 weight-carrying layers (conv2d / linear / etc.) and zero residual / skip / add nodes.

Why

Gradient signal degrades through depth without skip connections.

Source

He et al. 2015, ResNet.

R10 Attention without positional encoding warn pattern

Trigger

Model contains attention layers but no positional encoding (sinusoidal, learned, RoPE, ALiBi).

Why

Self-attention is permutation-invariant. Without PE, the model is a bag-of-words and can't learn order.

Source

Vaswani et al. 2017, Attention Is All You Need, Section 3.5.

R11 Sigmoid / Tanh in deep networks info pattern

Trigger

Sigmoid or Tanh activation appears in a network with ≥ 5 weight-carrying layers.

Why

Saturating activations vanish gradients in deep stacks. ReLU / GELU / SiLU are the modern defaults.

Source

Glorot & Bengio 2010 on the vanishing gradient problem.

R12 Deep network with no normalization info pattern

Trigger

≥ 7 non-I/O layers and zero normalization layers (BN / LN / IN / GN / RMSNorm).

Why

Without normalization, deep nets converge slowly and are sensitive to initialization scale.

R13 Very high dropout rate warn performance

Trigger

Any dropout layer with p > 0.65.

Why

Common dropout rates are 0.1 to 0.5. Anything above 0.65 usually indicates a typo or a confused regularization strategy.

R14 Very large activation tensor warn performance

Trigger

Any layer's estimated activation tensor exceeds 50 M elements (~200 MB per sample at float32).

Why

Activations live in GPU memory through the backward pass. One bloated layer forces tiny batch sizes or OOM.

R15 MoE without auxiliary loss info pattern

Trigger

Model contains an MoE layer but no auxiliary load-balance loss is referenced.

Why

Without an aux loss, experts collapse into a few dominant ones. Most papers add a 0.01 weight router-balance term.

Source

Shazeer et al. 2017, Sparsely-Gated MoE, Section 4.

R16 GQA numHeads not divisible by numKVHeads error structure

Trigger

A groupedQueryAttention component has params where numHeads % numKVHeads != 0.

Why

GQA groups query heads; the group size must divide the head count. Mis-set ratios crash on the first attention forward.

Source

Ainslie et al. 2023, GQA.

R17 SwiGLU intermediateSize convention info performance

Trigger

A swiGLU / gated FFN has intermediateSize that doesn't follow the common ~(2/3) × 4 × hidden convention used in LLaMA / Mistral.

Why

Param-count parity with standard FFN; mis-sized SwiGLU silently changes the FLOP and param budget.

Source

Shazeer 2020, GLU Variants.

R18 Conv feeds Linear with no flatten / pool error structure

Trigger

A convolution (conv1d/2d/3d or a depthwise/separable/transpose variant) connects directly into a linear layer.

Why

Conv outputs a multi-dimensional feature map; Linear expects a flat [batch, features] tensor. The forward pass raises a shape error, or silently mis-multiplies the spatial dims. Insert a Flatten or a Global Average Pool between them.

Source

Standard CNN classifier construction (e.g. LeNet / AlexNet head).

R19 Back-to-back activations warning ordering

Trigger

An activation node (relu/gelu/silu/sigmoid/tanh/softmax/…) connects directly into another activation node with no Linear/Conv/Norm between them.

Why

Two activations in a row add compute but no expressivity (same-type is a no-op duplicate). A common slip is ReLU → Softmax, which clips logits to ≥ 0 and distorts the output distribution.

Source

Composition of pointwise non-linearities adds no representational power without an intervening affine map (universal-approximation premise).

R20 Linear → Linear with no activation info pattern

Trigger

A linear layer connects directly into another linear layer with no activation between them.

Why

Two stacked linear maps collapse into one (W₂·W₁), so the extra layer costs parameters but adds no representational power — usually a forgotten ReLU. Deliberate low-rank / factorized projections (down-proj → up-proj) are the safe exception.

Source

Linear-map composition closure; the rank of the product cannot exceed the smaller intermediate dimension.

R21 Dropout immediately before Output warning ordering

Trigger

A dropout layer is the last node before the output node.

Why

In training this randomly zeroes the final logits fed to the loss; at eval it is a no-op, so train and eval behaviour diverge. Dropout belongs before the final projection (Linear/Conv), not after it.

Source

Srivastava et al. 2014 (Dropout) — dropout regularizes hidden representations, not output logits.

R22 Conv stride larger than kernel warning structure

Trigger

A downsampling convolution (conv1d/2d/3d, depthwise, or separable) has stride > kernelSize.

Why

The kernel jumps past its own footprint each step, so a band of the input is never read — a silent loss of information. Non-overlapping patches use stride == kernel (e.g. ViT 16/16); stride > kernel is almost always a typo.

Source

Convolution arithmetic — Dumoulin & Visin 2016.

R23 Non-spatial tensor into Conv warning structure

Trigger

A flatten or linear layer connects directly into a (non-transpose) convolution.

Why

Mirror of R18: a Conv needs a [channels, …spatial] feature map, but Flatten / Linear emit a flat [batch, features] vector. The forward pass raises a shape error unless the spatial dims are rebuilt with a Reshape / Unflatten first.

Source

PyTorch nn.Conv*d input-shape contract.

R24 Back-to-back normalization info pattern

Trigger

A normalization layer connects directly into another normalization layer (e.g. LayerNorm → BatchNorm).

Why

Normalizing an already-normalized tensor is redundant; the second layer mostly re-centres/re-scales what the first produced and just burns its own learnable parameters.

Source

Idempotence of standardization — Ioffe & Szegedy 2015.

R25 Duplicate positional encoding info pattern

Trigger

More than one positional-encoding layer (positionalEncoding, learnedPositionalEmbedding, rope, alibi) in the model.

Why

Position is normally injected once. Stacking absolute + rotary, or two of the same, double-counts position and tends to hurt more than help.

Source

Positional-encoding design — Vaswani et al. 2017, RoPE (Su et al. 2021).

R26 Pooling feeds Linear with no flatten warning structure

Trigger

A spatial pooling layer (maxpool, avgpool, adaptive pool) connects directly into a linear layer. Global pools are exempt — they already collapse spatial dims.

Why

Sibling of R18: pooling still emits a [channels, …spatial] map, but Linear expects a flat [batch, features] tensor, so the forward pass raises a shape error. Insert a Flatten or a Global Average Pool first.

Source

Standard CNN classifier head construction.

R27 Flatten before attention warning structure

Trigger

A flatten layer connects directly into an attention layer.

Why

Flatten collapses the sequence dimension into one long vector, leaving a length-1 sequence, so attention has nothing to relate. Keep the [sequence, dim] layout and flatten only after the attention stack.

Source

Attention operates over a sequence axis — Vaswani et al. 2017.

R28 ConvTranspose checkerboard risk info pattern

Trigger

A ConvTranspose2d has kernelSize not divisible by stride (with stride > 1).

Why

Uneven kernel overlap during upsampling deposits more weight on some output pixels than others, producing checkerboard artifacts. Make kernelSize a multiple of stride, or upsample with Upsample + Conv (resize-convolution).

Source

Odena et al. 2016, Deconvolution and Checkerboard Artifacts.

R29 GroupNorm channels not divisible by numGroups error structure

Trigger

A GroupNorm layer's channel count is not an exact multiple of numGroups.

Why

GroupNorm splits channels into equal groups; a non-divisible count raises at construction. Parallel to the GQA / attention head-dim divisibility checks. Set numGroups to a divisor of the channel count.

Source

Wu & He 2018, Group Normalization.

R30 Very large Linear layer warning performance

Trigger

A single Linear exceeds ~1B parameters (inFeatures × outFeatures).

Why

A dense layer that large (~4 GB float32) almost always means a feature map was flattened without pooling first. Add a Global Average Pool / more downsampling, or factorize the layer. Embedding and vocab-projection heads are the expected exception.

Source

Parameter-budget hygiene; a single matrix this size dominates model memory.

R31 Full multi-head attention at LLM scale info performance

Trigger

A transformer stack (≥6 attention layers) at LLM width (embedDim ≥ 2048) uses full multi-head attention everywhere, with no grouped-query or latent attention present.

Why

At real model scale the KV cache, not the weights, dominates serving memory. Full per-head K/V caching is what production LLMs move away from: grouped-query attention (e.g. 8:1) cuts the cache ~8×, and multi-head latent attention (MLA) shrinks it ~10× or more, without changing the parameter count. Advice, not a bug: set numKVHeads below numHeads, or switch to mla.

Source

Serving-cost roofline; the same KV-cache math the app and the public serving calculator run.

R32 Default init assumes ReLU, saturating activation follows info pattern

Trigger

A Linear/Conv layer feeds a sigmoid or tanh activation directly.

Why

PyTorch's default Kaiming (He) init is derived for ReLU-family activations. Feeding a saturating activation from a He-initialized layer starts training in the saturated tails and shrinks early gradients. Use Xavier init with the matching gain, or a ReLU-family activation.

Source

Glorot & Bengio 2010 (Xavier) vs He et al. 2015 (Kaiming) derivation assumptions.

R33 Deep attention stack without depth-scaled init info pattern

Trigger

≥ 8 attention layers stacked in one model.

Why

Residual-branch outputs add up with depth; unscaled init lets activation variance grow layer over layer. GPT-2/LLaMA-family models scale residual output projections by depth: N(0, 0.02 / √(2L)).

Source

GPT-2 (Radford et al. 2019) init convention; carried forward by LLaMA/Mistral.

R34 LM head width disagrees with embedding vocab info structure

Trigger

The model has an embedding with a numeric vocabSize and attention layers, but the final Linear feeding the Output projects to a different width.

Why

A language model's head must project back to vocab size (and is usually weight-tied to the embedding); a mismatch means the model cannot emit token logits. A classifier head over N classes is the expected exception, which is why this is info, not error.

Source

Weight-tying convention (Press & Wolf 2017); standard LM head contract.

R35 KV cache exceeds serving budget at reference context warn performance

Trigger

Total fp16 K/V cache across all attention layers (GQA-aware: counts numKVHeads, skips MLA) exceeds 4 GB for a single 8,192-token sequence.

Why

That much cache for ONE sequence is gone before weights or activations load; on a 24 GB GPU it caps concurrency at a handful of requests. Raise the GQA ratio, switch to MLA, reduce depth/width, or accept a shorter serving context.

Source

Same per-token KV math as R31 and the serving calculator, applied as an absolute budget instead of a pattern.

Every structural rule Neurarch runs on your model

Guardrail gates (6)

Architecture advisor rules (30)

Try it on your own model