These run live as you build, every model change, no agent involvement.
They surface as inline warnings on the canvas and in the Advisor panel.
Source: src/utils/architectureAdvisor.ts.
R01
No Input node
error
structure
Trigger
Model has ≥ 1 component but no component of type input.
Why
Without an Input layer, tensor shapes can't be propagated and generated code is incomplete.
Fix
Drag an Input layer from the I/O section of the palette.
R02
No Output node
warn
structure
Trigger
Model has > 1 component but no component of type output.
Why
Code generator can't tell where the forward pass terminates.
R03
Isolated components
warn
structure
Trigger
Component has no inbound and no outbound connections.
Why
Most often a stranded layer left over from a refactor. Code generator skips it but it inflates the visual graph.
R04
Dead-end layer
warn
structure
Trigger
A non-output node has inbound connections but no outbound connections.
Why
Compute happens but the result is discarded. Gradients flow into a black hole.
R05
BatchNorm / LayerNorm after activation
warn
ordering
Trigger
A normalization layer is connected directly downstream of an activation (relu, gelu, swish, etc.).
Why
Normalization is meant to stabilize the pre-activation distribution. Applying it after activation breaks the assumption and shifts already-nonlinear features back toward zero mean.
R06
Dropout directly before BatchNorm
info
ordering
Trigger
A dropout layer is connected directly upstream of a BatchNorm.
Why
Dropout at train time changes the activation variance; BN's running statistics get distorted. A known train-vs-eval mismatch.
R07
Softmax / Sigmoid directly before Output
info
ordering
Trigger
An explicit Softmax or Sigmoid layer is wired directly into Output.
Why
PyTorch's nn.CrossEntropyLoss applies LogSoftmax internally; an explicit Softmax before it double-applies and slows training. BCEWithLogitsLoss has the same issue with Sigmoid.
R08
Normalization at output
warn
ordering
Trigger
A normalization layer is wired directly into Output without a final linear / classification head.
Why
Normalization zero-centers the logits, breaking calibration. The model's output scale is no longer meaningful.
R09
Deep network with no residual connections
warn
pattern
Trigger
≥ 8 weight-carrying layers (conv2d / linear / etc.) and zero residual / skip / add nodes.
Why
Gradient signal degrades through depth without skip connections.
R10
Attention without positional encoding
warn
pattern
Trigger
Model contains attention layers but no positional encoding (sinusoidal, learned, RoPE, ALiBi).
Why
Self-attention is permutation-invariant. Without PE, the model is a bag-of-words and can't learn order.
R11
Sigmoid / Tanh in deep networks
info
pattern
Trigger
Sigmoid or Tanh activation appears in a network with ≥ 5 weight-carrying layers.
Why
Saturating activations vanish gradients in deep stacks. ReLU / GELU / SiLU are the modern defaults.
R12
Deep network with no normalization
info
pattern
Trigger
≥ 7 non-I/O layers and zero normalization layers (BN / LN / IN / GN / RMSNorm).
Why
Without normalization, deep nets converge slowly and are sensitive to initialization scale.
R13
Very high dropout rate
warn
performance
Trigger
Any dropout layer with p > 0.65.
Why
Common dropout rates are 0.1 to 0.5. Anything above 0.65 usually indicates a typo or a confused regularization strategy.
R14
Very large activation tensor
warn
performance
Trigger
Any layer's estimated activation tensor exceeds 50 M elements (~200 MB per sample at float32).
Why
Activations live in GPU memory through the backward pass. One bloated layer forces tiny batch sizes or OOM.
R15
MoE without auxiliary loss
info
pattern
Trigger
Model contains an MoE layer but no auxiliary load-balance loss is referenced.
Why
Without an aux loss, experts collapse into a few dominant ones. Most papers add a 0.01 weight router-balance term.
R16
GQA numHeads not divisible by numKVHeads
error
structure
Trigger
A groupedQueryAttention component has params where numHeads % numKVHeads != 0.
Why
GQA groups query heads; the group size must divide the head count. Mis-set ratios crash on the first attention forward.
R17
SwiGLU intermediateSize convention
info
performance
Trigger
A swiGLU / gated FFN has intermediateSize that doesn't follow the common ~(2/3) × 4 × hidden convention used in LLaMA / Mistral.
Why
Param-count parity with standard FFN; mis-sized SwiGLU silently changes the FLOP and param budget.
R18
Conv feeds Linear with no flatten / pool
error
structure
Trigger
A convolution (conv1d/2d/3d or a depthwise/separable/transpose variant) connects directly into a linear layer.
Why
Conv outputs a multi-dimensional feature map; Linear expects a flat [batch, features] tensor. The forward pass raises a shape error, or silently mis-multiplies the spatial dims. Insert a Flatten or a Global Average Pool between them.
Source
Standard CNN classifier construction (e.g. LeNet / AlexNet head).
R19
Back-to-back activations
warning
ordering
Trigger
An activation node (relu/gelu/silu/sigmoid/tanh/softmax/…) connects directly into another activation node with no Linear/Conv/Norm between them.
Why
Two activations in a row add compute but no expressivity (same-type is a no-op duplicate). A common slip is ReLU → Softmax, which clips logits to ≥ 0 and distorts the output distribution.
Source
Composition of pointwise non-linearities adds no representational power without an intervening affine map (universal-approximation premise).
R20
Linear → Linear with no activation
info
pattern
Trigger
A linear layer connects directly into another linear layer with no activation between them.
Why
Two stacked linear maps collapse into one (W₂·W₁), so the extra layer costs parameters but adds no representational power — usually a forgotten ReLU. Deliberate low-rank / factorized projections (down-proj → up-proj) are the safe exception.
Source
Linear-map composition closure; the rank of the product cannot exceed the smaller intermediate dimension.
R21
Dropout immediately before Output
warning
ordering
Trigger
A dropout layer is the last node before the output node.
Why
In training this randomly zeroes the final logits fed to the loss; at eval it is a no-op, so train and eval behaviour diverge. Dropout belongs before the final projection (Linear/Conv), not after it.
R22
Conv stride larger than kernel
warning
structure
Trigger
A downsampling convolution (conv1d/2d/3d, depthwise, or separable) has stride > kernelSize.
Why
The kernel jumps past its own footprint each step, so a band of the input is never read — a silent loss of information. Non-overlapping patches use stride == kernel (e.g. ViT 16/16); stride > kernel is almost always a typo.
R23
Non-spatial tensor into Conv
warning
structure
Trigger
A flatten or linear layer connects directly into a (non-transpose) convolution.
Why
Mirror of R18: a Conv needs a [channels, …spatial] feature map, but Flatten / Linear emit a flat [batch, features] vector. The forward pass raises a shape error unless the spatial dims are rebuilt with a Reshape / Unflatten first.
Source
PyTorch nn.Conv*d input-shape contract.
R24
Back-to-back normalization
info
pattern
Trigger
A normalization layer connects directly into another normalization layer (e.g. LayerNorm → BatchNorm).
Why
Normalizing an already-normalized tensor is redundant; the second layer mostly re-centres/re-scales what the first produced and just burns its own learnable parameters.
R25
Duplicate positional encoding
info
pattern
Trigger
More than one positional-encoding layer (positionalEncoding, learnedPositionalEmbedding, rope, alibi) in the model.
Why
Position is normally injected once. Stacking absolute + rotary, or two of the same, double-counts position and tends to hurt more than help.
R26
Pooling feeds Linear with no flatten
warning
structure
Trigger
A spatial pooling layer (maxpool, avgpool, adaptive pool) connects directly into a linear layer. Global pools are exempt — they already collapse spatial dims.
Why
Sibling of R18: pooling still emits a [channels, …spatial] map, but Linear expects a flat [batch, features] tensor, so the forward pass raises a shape error. Insert a Flatten or a Global Average Pool first.
Source
Standard CNN classifier head construction.
R27
Flatten before attention
warning
structure
Trigger
A flatten layer connects directly into an attention layer.
Why
Flatten collapses the sequence dimension into one long vector, leaving a length-1 sequence, so attention has nothing to relate. Keep the [sequence, dim] layout and flatten only after the attention stack.
R28
ConvTranspose checkerboard risk
info
pattern
Trigger
A ConvTranspose2d has kernelSize not divisible by stride (with stride > 1).
Why
Uneven kernel overlap during upsampling deposits more weight on some output pixels than others, producing checkerboard artifacts. Make kernelSize a multiple of stride, or upsample with Upsample + Conv (resize-convolution).
R29
GroupNorm channels not divisible by numGroups
error
structure
Trigger
A GroupNorm layer's channel count is not an exact multiple of numGroups.
Why
GroupNorm splits channels into equal groups; a non-divisible count raises at construction. Parallel to the GQA / attention head-dim divisibility checks. Set numGroups to a divisor of the channel count.
R30
Very large Linear layer
warning
performance
Trigger
A single Linear exceeds ~1B parameters (inFeatures × outFeatures).
Why
A dense layer that large (~4 GB float32) almost always means a feature map was flattened without pooling first. Add a Global Average Pool / more downsampling, or factorize the layer. Embedding and vocab-projection heads are the expected exception.
Source
Parameter-budget hygiene; a single matrix this size dominates model memory.