← Back to Learn
2. Architecture

Architecture Tricks

U-Net skips, BigramHash, SmearGate, and more

When the Transformer Is Not Enough

Every serious Parameter Golf submission starts with the same backbone: a Transformer. 899 of the 1,162 parsed entries use a vanilla Transformer. But the top scorers do not stop there. They bolt on extra components -- skip connections that echo U-Net, hash-based bigram features, gating mechanisms -- that squeeze more performance from the same parameter budget.

The numbers are striking. BigramHash appears in 583 submissions. SmearGate shows up in 396. U-Net skip connections appear in 275. And the best neural submission (PR #1056, 0.018 BPB) uses all three.

These are not separate architectures. They are modifications layered onto a standard Transformer, each addressing a different weakness. Understanding what each one does -- and why they help -- reveals how far you can push a small model when every byte counts.

BigramHash: Free Features from Character Pairs

Here is a question worth asking: what is the cheapest way to give a language model extra information about the input?

The answer, used by over half of all Parameter Golf submissions, is BigramHash. The idea is simple. Take each pair of adjacent characters in the input. Hash that pair into a fixed-size embedding table. Add the result to the token embedding before it enters the Transformer.

Why This Works

A Transformer processes tokens one at a time through self-attention. To understand the relationship between adjacent characters, it needs at least one attention layer. But bigram statistics -- which character tends to follow which -- are extremely predictable. The letter "q" is almost always followed by "u". The space before "the" is vastly more common than the space before "xyl".

BigramHash gives the model this local context for free, before the first attention layer even fires. The Transformer can then spend its limited capacity on harder, longer-range patterns instead of wasting attention heads on obvious bigram predictions.

The Hash Trick

Why hash instead of using a full lookup table? A naive bigram table for byte-level tokens would need 256 x 256 = 65,536 entries. At 512 dimensions each, that is 128 MB of embeddings alone -- wildly over budget.

Hashing maps those 65,536 possible bigrams down to a much smaller table (typically 8,192 or 16,384 entries). Collisions happen, but they are rare enough to be harmless. The hash approach costs a tiny fraction of the parameters while capturing most of the bigram signal.

Impact on Scores

The top 5 neural submissions all use BigramHash. Among submissions with val_bpb scores, those using BigramHash average significantly better than those without. It is essentially a free lunch -- a few thousand extra parameters for a meaningful BPB improvement.

SmearGate: Mixing Adjacent Token Information

SmearGate appears in 396 submissions, including the best neural entry (PR #1056, 0.018 BPB). Where BigramHash adds static bigram features, SmearGate provides a learned mechanism for blending information between adjacent positions.

The Core Idea

At each position in the sequence, SmearGate computes a gate value between 0 and 1. This gate controls how much of the previous token's representation gets "smeared" into the current position:

output[t] = gate[t] * input[t-1] + (1 - gate[t]) * input[t]

When the gate is 0, the position keeps its own representation untouched. When the gate is 1, it completely adopts its neighbor's representation. In practice, the gate learns intermediate values that blend the two.

Why Not Just Use Attention?

Attention can already mix information across positions. But attention is expensive -- it scales quadratically with sequence length -- and it operates at a coarse level. SmearGate is a cheap, local operation that handles the most common case: information flow between adjacent tokens.

Think of it like this. Attention is a conference call where everyone can talk to everyone. SmearGate is whispering to the person next to you. Both are useful, but whispering is faster and cheaper.

Where It Lives in the Architecture

SmearGate is typically applied right after the token embedding, before the first Transformer block. This means the Transformer layers receive inputs that already contain some local context. Combined with BigramHash, the Transformer starts with surprisingly rich representations before a single attention operation.

Interactive: SmearGate in Action

Adjust the gate value to see how SmearGate blends adjacent token representations. Notice how different gate values affect information flow between positions.

Animation "smeargate-demo" coming soon

U-Net Skip Connections: Borrowing from Image Segmentation

The U-Net architecture was invented for biomedical image segmentation in 2015. It has an encoder that progressively compresses the input and a decoder that expands it back, with skip connections that pipe information directly from encoder layers to matching decoder layers.

275 Parameter Golf submissions borrow this idea. Instead of a straight stack of Transformer layers (layer 1 feeds layer 2 feeds layer 3 and so on), they create skip connections that link early layers directly to late layers.

How It Works in a Transformer

Imagine an 11-layer Transformer. In a standard architecture, layer 1's output feeds only into layer 2. With U-Net skips, the architecture might look like this:

LayerReceives Input From
Layer 1Token embeddings
Layer 2Layer 1
Layer 3Layer 2
Layer 4Layer 3
Layer 5Layer 4
Layer 6Layer 5
Layer 7Layer 6 + Layer 5 (skip)
Layer 8Layer 7 + Layer 4 (skip)
Layer 9Layer 8 + Layer 3 (skip)
Layer 10Layer 9 + Layer 2 (skip)
Layer 11Layer 10 + Layer 1 (skip)

The skip connections are typically concatenated or added to the regular input, sometimes with a learned scaling factor.

Why This Helps Small Models

Large Transformers can afford to be wasteful. If a useful feature from layer 2 gets diluted by layers 3 through 10, the model has enough parameters to re-derive it. Small models cannot. Skip connections give late layers direct access to early features without re-computation.

This is especially valuable for language modeling because different linguistic features live at different depths. Character-level patterns (spelling, common subwords) emerge in early layers. Syntactic patterns (grammar, phrase structure) emerge in middle layers. Semantic patterns (meaning, topic) emerge in late layers. Skip connections let the final prediction layer draw on all three levels simultaneously.

The Best U-Net Submission

PR #826 (0.295 BPB) describes itself as an "11-layer Transformer-like model with 512d, GQA 8/4, MLP 3.0x, BigramHash, SmearGate, XSA, Partial RoPE, LN Scale, U-Net skips, VE128." This kitchen-sink approach -- combining U-Net skips with every other trick on this page -- illustrates how these modifications are meant to be layered, not used in isolation.

Depth Recurrence: More Layers Without More Parameters

116 submissions use depth recurrence -- running the same set of Transformer layers multiple times in sequence. Instead of 11 unique layers, you might have 5 unique layers that execute twice each, giving you 10 effective layers with only 5 layers' worth of parameters.

The Trade-off

This is a direct trade of compute for parameters. Each recurrent pass through the shared layers costs the same FLOPs as a unique layer would. But the weights are shared, so you stay under the 16 MB artifact limit more easily.

The risk is that shared layers might not specialize. In a standard Transformer, layer 3 can learn completely different features from layer 7. When they share weights, both passes must use the same feature detectors. The model loses some representational diversity.

When It Works

Depth recurrence works best when:

  • You are severely parameter-constrained (which you always are in Parameter Golf)
  • The task benefits more from depth than from width
  • You combine recurrence with conditioning signals that differentiate the passes

Some submissions use FiLM conditioning -- feeding a pass index into each layer so the shared weights can behave differently on the first pass versus the second. This partially restores the specialization that weight sharing removes.

Compared to Simply Making the Model Wider

An alternative to depth recurrence is reducing layers and increasing hidden dimension. Both approaches use the same number of parameters. The empirical evidence from Parameter Golf leans toward depth recurrence being slightly better for language modeling, though the gap is small and depends on other architectural choices.

Other Architecture Modifications

Beyond the big four, several other modifications appear frequently across submissions.

Grouped Query Attention (GQA)

GQA appears in 209 submissions. Standard multi-head attention gives each head its own key, query, and value projections. GQA groups heads together, sharing key-value projections across multiple query heads. A common configuration is 8 query heads with 4 key-value heads (GQA 8/4).

The benefit is parameter savings. Key-value projections account for a significant fraction of attention parameters. Sharing them across head groups reduces this cost while retaining most of the expressiveness.

Partial RoPE (Rotary Position Embeddings)

270 submissions use Partial RoPE, where rotary position embeddings are applied to only a fraction of the head dimensions rather than all of them. Typically, RoPE is applied to the first half of each head's dimensions while the second half uses no positional encoding.

The intuition is that some attention heads benefit from position awareness (to handle local patterns) while others work better as position-independent feature detectors. Partial RoPE lets both modes coexist within the same head.

XSA (Extended Self-Attention)

392 submissions use XSA, a variant of self-attention that extends the attention mechanism with additional learned projections or gating. The specific implementation varies across submissions, but the common thread is enhancing the standard attention computation at minimal parameter cost.

LN Scale (LayerNorm Scaling)

226 submissions use LN Scale, a technique that adds a learned per-layer scaling factor to LayerNorm outputs. This gives the model fine-grained control over information flow between layers. Layers that should have a stronger influence on the residual stream can amplify their output, while layers that should have a weaker influence can dampen it.

This costs exactly one scalar parameter per layer -- essentially free -- but gives the optimizer an extra lever to balance the contribution of each layer during training.

Implementing BigramHash and SmearGate

import torch
import torch.nn as nn

class BigramHash(nn.Module):
    """Hash-based bigram features added to token embeddings."""

    def __init__(self, table_size: int = 8192, embed_dim: int = 512):
        super().__init__()
        self.table = nn.Embedding(table_size, embed_dim)
        self.table_size = table_size

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        # Compute bigram hashes: hash(token[t-1], token[t])
        # Shift tokens to create pairs
        prev_tokens = torch.cat([
            torch.zeros_like(token_ids[:, :1]),  # pad start
            token_ids[:, :-1]
        ], dim=1)

        # Simple hash: (prev * 257 + current) mod table_size
        bigram_hashes = (prev_tokens * 257 + token_ids) % self.table_size

        return self.table(bigram_hashes)


class SmearGate(nn.Module):
    """Learned gate that blends adjacent token representations."""

    def __init__(self, embed_dim: int = 512):
        super().__init__()
        self.gate_proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x shape: (batch, seq_len, embed_dim)
        gate = torch.sigmoid(self.gate_proj(x))

        # Shift x to get previous position
        prev_x = torch.cat([
            torch.zeros_like(x[:, :1]),  # pad start with zeros
            x[:, :-1]
        ], dim=1)

        # Blend current position with previous
        return gate * prev_x + (1 - gate) * x


# Usage in a model's forward pass:
# embeddings = token_embed(input_ids) + bigram_hash(input_ids)
# embeddings = smear_gate(embeddings)
# output = transformer_layers(embeddings)

Stacking Tricks: The Winning Recipe

The most important lesson from Parameter Golf architecture modifications is that they are additive. The best submissions do not pick one trick -- they use all of them.

PR #1056 (0.018 BPB, best neural) uses: BigramHash, SmearGate, U-Net skip connections, GQA, Partial RoPE, XSA, and LN Scale. PR #826 (0.295 BPB) uses the same stack. PR #944 (0.017 BPB) uses BigramHash plus the standard suite.

Why They Compose Well

Each modification targets a different bottleneck:

ModificationWhat It AddressesParameter Cost
BigramHashLocal character context~4K embeddings
SmearGateAdjacent token blendingOne linear layer
U-Net skipsFeature reuse across depthZero (just wiring)
Depth recurrenceMore depth per parameterZero (shared weights)
GQAAttention parameter efficiencyNegative (saves params)
Partial RoPEPositional flexibilityZero (applies RoPE to subset)
LN ScaleLayer contribution controlOne scalar per layer

None of these modifications conflict with each other. BigramHash and SmearGate both enhance the input before attention. U-Net skips and depth recurrence both change how layers connect. GQA and Partial RoPE both modify the attention mechanism itself. They operate on different parts of the architecture, which is why stacking them produces compounding gains.

The Practical Takeaway

If you are building a Parameter Golf submission, start with the standard Transformer baseline and add modifications one at a time. BigramHash is the single highest-impact addition based on submission counts and average BPB. SmearGate is the second. U-Net skips have a higher implementation cost but provide meaningful gains for models with 8 or more layers.

The architecture tricks in this article are not exotic research ideas. They are battle-tested by hundreds of submissions and used by every top competitor. The question is not whether to use them, but how to combine them for your specific parameter budget.