PR #1453

open

Non-record: Depth Recurrence + Int7 Mixed Quant — val_bpb 1.1324 (3-seed mean)

by iverbovoyView on GitHub
val_bpb
1.1324
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.40 MB

Training Techniques

Architecture
depth recurrence
3 shared transformer blocks repeated 4 times for 12 effective layers with cross-repeat skip and loop embeddings.
parameters: {"layers":3,"repeats":4,"effective_layers":12}
MLP3x
Wider 3x MLP configuration used to increase model capacity.
parameters: {"multiplier":3,"hidden_dim":2640}
XSA
Exclusive self-attention applied on the last 4 effective layers to prevent attention collapse.
parameters: {"last_n":4}
LeakyReLU
LeakyReLU(0.5)^2 activation used for better gradient flow in deep recurrent models.
parameters: {"slope":0.5}
Quantization
mixed int7/int5
bits: 7
scope: attention and MLP
Weight Averaging
SWA
parameters: {"every":30,"start_frac":0.6}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.95
other_params: {"matrix_lr":0.018,"scalar_lr":0.018,"tied_embed_lr":0.021}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: null
hedge mixer
parameters: {"experts":5,"parallel_gpus":8}
LR Schedule
warmdown
parameters: {"iters":3000}
Regularization
logit softcap
parameters: {"value":30}
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Int7 attention with Int5 MLP mixed quantization to fit a wider model within the 16 MB budget
  • Depth recurrence with 3 shared blocks repeated 4 times for 12 effective layers
  • Parallelized hedge mixer evaluation across 8 GPUs to reduce eval time
  • Wider MLP 3x model enabled by saved quantization budget
  • Progressive depth training with earlier phase transitions and SWA