✦ Structural rules reference

Every structural rule Neurarch runs on your model

35 deterministic checks, no LLM in the loop, no false positives from token salad. 5 of them block the agent from applying broken edits. The other 30 lint live as you build. Every rule has a trigger condition you can verify against your code.

35 total rules
5 guardrail gates
30 advisor rules
~3ms full pass on a 100-layer model
Layer 1

Guardrail gates (5)

These run on every action the agent proposes, before anything touches the canvas. A blocking finding requires explicit user override. Source: src/utils/agentGuardrails.ts.

G01 Impact preview warn blast radius
Trigger
Any destructive action (delete_component, delete_components_matching, rename_components_matching, clear_canvas, replace_model) where downstream impact > 4 layers, or any of them are shape-changing.
Surfaces
Number of downstream layers affected, how many would reshape, how many carry weights (rebuild required).
Override
User confirms in the impact-preview card before the action lands.
G02 Param explosion block resource
Trigger
Estimated total parameter count grows by ≥ threshold × baseline (default 5×, slider 2-20×). Blocks at 2× threshold, warns below.
Why
A misread "scale this 10×" prompt should not silently turn a 60M model into a 6B model that won't fit anywhere.
Setting
useProviderStore.paramExplosionThreshold(). Settings → "Param explosion threshold".
G03 Cycle introduction block structure
Trigger
An add_connection closes a cycle in the existing DAG. DFS check on the union of current edges and proposed edges.
Why
NN forward graphs are acyclic. Recurrence lives inside the layer type (LSTM, GRU), not as a graph cycle. A connection that closes a cycle is almost always a mis-step.
Source
DAG requirement matches PyTorch nn.Module.forward semantics. torch.autograd docs note that gradient computation requires a DAG.
G04 Orphan layers warn structure
Trigger
add_component without afterName, and no follow-up add_connection referencing the new component as either endpoint.
Why
A new layer that's not wired is dead code. The agent occasionally proposes these when it forgets to chain edits; the warning flips the decision back to the user.
G05 Shape inference (new) block shape
Trigger
Propagates tensor shapes through the sandboxed post-action graph and surfaces only the issues the action would introduce. Sub-checks: attention embedDim % numHeads, GQA numHeads % numKVHeads, elementwise merge parent equality, concat axis compatibility, explicit linear inFeatures vs upstream, computeOutputShape throw, and outputs with NaN / 0 / negative dims at the first layer to introduce them.
Source
Per-layer transforms from componentRegistry.computeOutputShape. Head-dim convention from Vaswani et al. 2017 (Attention Is All You Need). GQA ratio from Ainslie et al. 2023 (GQA).
Why
Text-layer review tools (Cursor, Copilot) can't catch embedDim=384, numHeads=5 until the GPU rejects the kernel. This gate fires sub-millisecond, pre-apply.
Layer 2

Architecture advisor rules (30)

These run live as you build, every model change, no agent involvement. They surface as inline warnings on the canvas and in the Advisor panel. Source: src/utils/architectureAdvisor.ts.

R01 No Input node error structure
Trigger
Model has ≥ 1 component but no component of type input.
Why
Without an Input layer, tensor shapes can't be propagated and generated code is incomplete.
Fix
Drag an Input layer from the I/O section of the palette.
R02 No Output node warn structure
Trigger
Model has > 1 component but no component of type output.
Why
Code generator can't tell where the forward pass terminates.
R03 Isolated components warn structure
Trigger
Component has no inbound and no outbound connections.
Why
Most often a stranded layer left over from a refactor. Code generator skips it but it inflates the visual graph.
R04 Dead-end layer warn structure
Trigger
A non-output node has inbound connections but no outbound connections.
Why
Compute happens but the result is discarded. Gradients flow into a black hole.
R05 BatchNorm / LayerNorm after activation warn ordering
Trigger
A normalization layer is connected directly downstream of an activation (relu, gelu, swish, etc.).
Why
Normalization is meant to stabilize the pre-activation distribution. Applying it after activation breaks the assumption and shifts already-nonlinear features back toward zero mean.
Source
Ioffe & Szegedy 2015, BatchNorm places BN between Linear/Conv and the activation.
R06 Dropout directly before BatchNorm info ordering
Trigger
A dropout layer is connected directly upstream of a BatchNorm.
Why
Dropout at train time changes the activation variance; BN's running statistics get distorted. A known train-vs-eval mismatch.
R07 Softmax / Sigmoid directly before Output info ordering
Trigger
An explicit Softmax or Sigmoid layer is wired directly into Output.
Why
PyTorch's nn.CrossEntropyLoss applies LogSoftmax internally; an explicit Softmax before it double-applies and slows training. BCEWithLogitsLoss has the same issue with Sigmoid.
R08 Normalization at output warn ordering
Trigger
A normalization layer is wired directly into Output without a final linear / classification head.
Why
Normalization zero-centers the logits, breaking calibration. The model's output scale is no longer meaningful.
R09 Deep network with no residual connections warn pattern
Trigger
≥ 8 weight-carrying layers (conv2d / linear / etc.) and zero residual / skip / add nodes.
Why
Gradient signal degrades through depth without skip connections.
R10 Attention without positional encoding warn pattern
Trigger
Model contains attention layers but no positional encoding (sinusoidal, learned, RoPE, ALiBi).
Why
Self-attention is permutation-invariant. Without PE, the model is a bag-of-words and can't learn order.
R11 Sigmoid / Tanh in deep networks info pattern
Trigger
Sigmoid or Tanh activation appears in a network with ≥ 5 weight-carrying layers.
Why
Saturating activations vanish gradients in deep stacks. ReLU / GELU / SiLU are the modern defaults.
Source
Glorot & Bengio 2010 on the vanishing gradient problem.
R12 Deep network with no normalization info pattern
Trigger
≥ 7 non-I/O layers and zero normalization layers (BN / LN / IN / GN / RMSNorm).
Why
Without normalization, deep nets converge slowly and are sensitive to initialization scale.
R13 Very high dropout rate warn performance
Trigger
Any dropout layer with p > 0.65.
Why
Common dropout rates are 0.1 to 0.5. Anything above 0.65 usually indicates a typo or a confused regularization strategy.
R14 Very large activation tensor warn performance
Trigger
Any layer's estimated activation tensor exceeds 50 M elements (~200 MB per sample at float32).
Why
Activations live in GPU memory through the backward pass. One bloated layer forces tiny batch sizes or OOM.
R15 MoE without auxiliary loss info pattern
Trigger
Model contains an MoE layer but no auxiliary load-balance loss is referenced.
Why
Without an aux loss, experts collapse into a few dominant ones. Most papers add a 0.01 weight router-balance term.
R16 GQA numHeads not divisible by numKVHeads error structure
Trigger
A groupedQueryAttention component has params where numHeads % numKVHeads != 0.
Why
GQA groups query heads; the group size must divide the head count. Mis-set ratios crash on the first attention forward.
R17 SwiGLU intermediateSize convention info performance
Trigger
A swiGLU / gated FFN has intermediateSize that doesn't follow the common ~(2/3) × 4 × hidden convention used in LLaMA / Mistral.
Why
Param-count parity with standard FFN; mis-sized SwiGLU silently changes the FLOP and param budget.
R18 Conv feeds Linear with no flatten / pool error structure
Trigger
A convolution (conv1d/2d/3d or a depthwise/separable/transpose variant) connects directly into a linear layer.
Why
Conv outputs a multi-dimensional feature map; Linear expects a flat [batch, features] tensor. The forward pass raises a shape error, or silently mis-multiplies the spatial dims. Insert a Flatten or a Global Average Pool between them.
Source
Standard CNN classifier construction (e.g. LeNet / AlexNet head).
R19 Back-to-back activations warning ordering
Trigger
An activation node (relu/gelu/silu/sigmoid/tanh/softmax/…) connects directly into another activation node with no Linear/Conv/Norm between them.
Why
Two activations in a row add compute but no expressivity (same-type is a no-op duplicate). A common slip is ReLU → Softmax, which clips logits to ≥ 0 and distorts the output distribution.
Source
Composition of pointwise non-linearities adds no representational power without an intervening affine map (universal-approximation premise).
R20 LinearLinear with no activation info pattern
Trigger
A linear layer connects directly into another linear layer with no activation between them.
Why
Two stacked linear maps collapse into one (W₂·W₁), so the extra layer costs parameters but adds no representational power — usually a forgotten ReLU. Deliberate low-rank / factorized projections (down-proj → up-proj) are the safe exception.
Source
Linear-map composition closure; the rank of the product cannot exceed the smaller intermediate dimension.
R21 Dropout immediately before Output warning ordering
Trigger
A dropout layer is the last node before the output node.
Why
In training this randomly zeroes the final logits fed to the loss; at eval it is a no-op, so train and eval behaviour diverge. Dropout belongs before the final projection (Linear/Conv), not after it.
Source
Srivastava et al. 2014 (Dropout) — dropout regularizes hidden representations, not output logits.
R22 Conv stride larger than kernel warning structure
Trigger
A downsampling convolution (conv1d/2d/3d, depthwise, or separable) has stride > kernelSize.
Why
The kernel jumps past its own footprint each step, so a band of the input is never read — a silent loss of information. Non-overlapping patches use stride == kernel (e.g. ViT 16/16); stride > kernel is almost always a typo.
Source
Convolution arithmetic — Dumoulin & Visin 2016.
R23 Non-spatial tensor into Conv warning structure
Trigger
A flatten or linear layer connects directly into a (non-transpose) convolution.
Why
Mirror of R18: a Conv needs a [channels, …spatial] feature map, but Flatten / Linear emit a flat [batch, features] vector. The forward pass raises a shape error unless the spatial dims are rebuilt with a Reshape / Unflatten first.
Source
PyTorch nn.Conv*d input-shape contract.
R24 Back-to-back normalization info pattern
Trigger
A normalization layer connects directly into another normalization layer (e.g. LayerNorm → BatchNorm).
Why
Normalizing an already-normalized tensor is redundant; the second layer mostly re-centres/re-scales what the first produced and just burns its own learnable parameters.
Source
Idempotence of standardization — Ioffe & Szegedy 2015.
R25 Duplicate positional encoding info pattern
Trigger
More than one positional-encoding layer (positionalEncoding, learnedPositionalEmbedding, rope, alibi) in the model.
Why
Position is normally injected once. Stacking absolute + rotary, or two of the same, double-counts position and tends to hurt more than help.
Source
Positional-encoding design — Vaswani et al. 2017, RoPE (Su et al. 2021).
R26 Pooling feeds Linear with no flatten warning structure
Trigger
A spatial pooling layer (maxpool, avgpool, adaptive pool) connects directly into a linear layer. Global pools are exempt — they already collapse spatial dims.
Why
Sibling of R18: pooling still emits a [channels, …spatial] map, but Linear expects a flat [batch, features] tensor, so the forward pass raises a shape error. Insert a Flatten or a Global Average Pool first.
Source
Standard CNN classifier head construction.
R27 Flatten before attention warning structure
Trigger
A flatten layer connects directly into an attention layer.
Why
Flatten collapses the sequence dimension into one long vector, leaving a length-1 sequence, so attention has nothing to relate. Keep the [sequence, dim] layout and flatten only after the attention stack.
Source
Attention operates over a sequence axis — Vaswani et al. 2017.
R28 ConvTranspose checkerboard risk info pattern
Trigger
A ConvTranspose2d has kernelSize not divisible by stride (with stride > 1).
Why
Uneven kernel overlap during upsampling deposits more weight on some output pixels than others, producing checkerboard artifacts. Make kernelSize a multiple of stride, or upsample with Upsample + Conv (resize-convolution).
R29 GroupNorm channels not divisible by numGroups error structure
Trigger
A GroupNorm layer's channel count is not an exact multiple of numGroups.
Why
GroupNorm splits channels into equal groups; a non-divisible count raises at construction. Parallel to the GQA / attention head-dim divisibility checks. Set numGroups to a divisor of the channel count.
R30 Very large Linear layer warning performance
Trigger
A single Linear exceeds ~1B parameters (inFeatures × outFeatures).
Why
A dense layer that large (~4 GB float32) almost always means a feature map was flattened without pooling first. Add a Global Average Pool / more downsampling, or factorize the layer. Embedding and vocab-projection heads are the expected exception.
Source
Parameter-budget hygiene; a single matrix this size dominates model memory.

Try it on your own model

Paste a HuggingFace ID or drop a .py file. The same 35 rules run client-side, and now also run in CI on raw .py via the neurarch-lint GitHub Action, sub-millisecond.

Open Neurarch →