PR #413
openNon-record: Value Residual (-0.015 BPB) + Gated Attention (-0.003 BPB) with ablations
by anantdgoelView on GitHub
val_bpb
1.4525
Architecture
Transformer
Optimizer
AdamW
Artifact Size
13.2 MB
Training Techniques
Architecture
Value Residual
Caches raw V vectors from layer 0 and mixes them into all subsequent layers via learnable scalars to preserve token identity through depth.
parameters: {"layers":9,"learnable_scalars":18}
Gated Attention
Applies a per-head sigmoid gate after SDPA output to allow heads to suppress output and reduce attention sinks.
parameters: {"bias_init":4}
SmearGate
Architecture component used in the ablation setup and training stack.
parameters: null
BigramHash
Bigram hashing component used in the ablation setup and training stack.
parameters: {"buckets":4096}
XSA
Included in the training script as a model component; likely an attention-related architectural modification.
parameters: null
Partial RoPE
Partial rotary positional embedding variant used in the training script.
parameters: null
LN Scale
LayerNorm scaling modification included in the training script.
parameters: null
Weight Averaging
EMA
parameters: null
Initialization
OrthoInit
Orthogonal initialization used for the model.
Evaluation
sliding window eval
parameters: {"stride":128}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Regularization
weight decay
parameters: {"weight_decay_muon":0.04,"weight_decay_adam":0.04}
Other
other
PPM-C context mixer: blends classical Prediction by Partial Matching with neural softmax at evaluation time; reported as a negative result.
parameters: {"alpha":0.95,"order":2}
Novel Contributions
- Value Residual (ResFormer) that mixes layer-0 value vectors into deeper layers with learnable scalars
- Gated Attention with per-head sigmoid gating after SDPA to reduce attention sinks
- Controlled ablation study showing the two techniques stack additively
- Negative result for PPM-C context mixing on SmearGate + BigramHash models