PR #413

open

Non-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations

by anantdgoelView on GitHub

val_bpb

1.4525

Architecture

Transformer

Optimizer

AdamW

Artifact Size

13.2 MB

Training Techniques

Architecture

Value Residual

Caches raw V vectors from layer 0 and mixes them into all subsequent layers via learnable scalars to preserve token identity through depth.

parameters: {"layers":9,"learnable_scalars":18}

Gated Attention

Applies a per-head sigmoid gate after SDPA output to allow heads to suppress output and reduce attention sinks.

parameters: {"bias_init":4}

SmearGate

Architecture component used in the ablation setup and training stack.

parameters: null

BigramHash

Bigram hashing component used in the ablation setup and training stack.

parameters: {"buckets":4096}

XSA

Included in the training script as a model component; likely an attention-related architectural modification.

parameters: null

Partial RoPE

Partial rotary positional embedding variant used in the training script.

parameters: null

LN Scale

LayerNorm scaling modification included in the training script.

parameters: null

Weight Averaging

EMA

parameters: null

Initialization

OrthoInit

Orthogonal initialization used for the model.

Evaluation

sliding window eval

parameters: {"stride":128}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Regularization

weight decay

parameters: {"weight_decay_muon":0.04,"weight_decay_adam":0.04}

Other

other

PPM-C context mixer: blends classical Prediction by Partial Matching with neural softmax at evaluation time; reported as a negative result.

parameters: {"alpha":0.95,"order":2}

Novel Contributions

Value Residual (ResFormer) that mixes layer-0 value vectors into deeper layers with learnable scalars
Gated Attention with per-head sigmoid gating after SDPA to reduce attention sinks
Controlled ablation study showing the two techniques stack additively
Negative result for PPM-C context mixing on SmearGate + BigramHash models