PR #862

open

RECORD: Denseformer+VRL+XSA on last 4 layers+Gradient Clipping (pending 8xH100 eval)

by grim-hitman0XXView on GitHub

val_bpb

1.3036

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

DenseFormer

Depth-weighted average over current and all past layer representations, including embedding output.

parameters: {"layers":9}

LeakyReLU

Uses LeakyReLU(0.5) squared instead of ReLU squared in the MLP activation.

parameters: {"negative_slope":0.5}

Value Residual

Caches the value tensor from layer 0 and blends it into later layers' value tensors with learned softmax-normalized scalars.

parameters: {"layers":"1-8"}

XSA

Cross-self attention applied to the last 4 layers to project out the self-value component from attention output.

parameters: {"layers":4}

Regularization

gradient clipping

parameters: {"norm":0.3}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: 9

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"warmup_from":0.85,"warmup_steps":500}

LR Schedule

warmdown

parameters: {"warmdown_steps":1200}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Novel Contributions

DenseFormer depth-weighted averaging across all previous layer representations
LeakyReLU(0.5) squared activation replacing ReLU squared
Value Residual Learning blending layer-0 values into later layers
Cross-self attention on the last 4 layers
Global gradient clipping at 0.3
int8 plus zlib artifact compression