PR #862
openRECORD: Denseformer+VRL+XSA on last 4 layers+Gradient Clipping (pending 8xH100 eval)
by grim-hitman0XXView on GitHub
val_bpb
1.3036
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
DenseFormer
Depth-weighted average over current and all past layer representations, including embedding output.
parameters: {"layers":9}
LeakyReLU
Uses LeakyReLU(0.5) squared instead of ReLU squared in the MLP activation.
parameters: {"negative_slope":0.5}
Value Residual
Caches the value tensor from layer 0 and blends it into later layers' value tensors with learned softmax-normalized scalars.
parameters: {"layers":"1-8"}
XSA
Cross-self attention applied to the last 4 layers to project out the self-value component from attention output.
parameters: {"layers":4}
Regularization
gradient clipping
parameters: {"norm":0.3}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: 9
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"warmup_from":0.85,"warmup_steps":500}
LR Schedule
warmdown
parameters: {"warmdown_steps":1200}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Novel Contributions
- DenseFormer depth-weighted averaging across all previous layer representations
- LeakyReLU(0.5) squared activation replacing ReLU squared
- Value Residual Learning blending layer-0 values into later layers
- Cross-self attention on the last 4 layers
- Global gradient clipping at 0.3
- int8 plus zlib artifact compression